从Pinterest网址获取电路板上的所有图像

这个问题听起来很容易,但并不像听起来那么简单。

对什么是错的小结

举一个例子,使用这个板子; http://pinterest.com/dodo/web-designui-and-mobile/

检查页面顶部的电路板本身的HTML(在类GridItemsdiv内)得出:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 0px; left: 0px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 3343px; left: 1000px; visibility: visible;">..</div>
</div>

然而,在页面底部,激活无限滚动几次后,我们将其作为HTML:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 12431px; left: 750px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 19944px; left: 750px; visibility: visible;">..</div>
</div>

正如您所看到的,页面上较高图像的一些容器已经消失,并且在第一次加载页面时不会为图像的所有容器加载。


我想做的事

我希望能够创建一个C#脚本(或者目前的任何服务器端语言),它可以下载页面的完整HTML(即,检索页面上的每个图像),然后将从它们的URL下载图像。 下载网页并使用适当的XPath很容易,但真正的挑战是为每个图像下载完整的HTML。

有没有一种方法可以模仿滚动到页面底部,还是有更简单的方法可以检索每个图像? 我想Pinterest使用AJAX来改变HTML,有没有一种方法可以以编程方式触发事件来接收所有的HTML? 如果您没有任何建议和解决方案,甚至可以阅读这个非常长的问题,请提前感谢您!

伪代码

using System;
using System.Net;
using HtmlAgilityPack;

private void Main() {
    string pinterestURL = "http://www.pinterest.com/...";
    string XPath = ".../img";

    HtmlDocument doc = new HtmlDocument();

    // Currently only downloads the first 25 images.
    doc.Load(strPinterestUrl);

    foreach(HtmlNode link in doc.DocumentElement.SelectNodes(strXPath))
    {
         image_links[] = link["src"];
         // Use image links
    }
}

好的,所以我认为这可能是(需要一些改动)你需要的。

注意事项:

  • 这是PHP,而不是C#(但你说你对任何服务器端语言都感兴趣)。
  • 此代码挂钩(非官方)Pinterest搜索端点。 您需要更改$ data和$ search_res以反映您的任务的适当端点(例如BoardFeedResouce)。 注意:至少在搜索时,Pinterest目前使用两个端点,一个用于初始页面加载,另一个用于无限滚动操作。 每个都有自己的预期参数结构。
  • Pinterest没有官方的公共API,只要他们改变任何内容,并且没有任何警告,就会期望它会崩溃。
  • 您可能会发现pinterestapi.co.uk更易于实施,并且可以接受您正在做的事情。
  • 我在课堂下面有一些演示/调试代码,一旦你获得了你想要的数据,它就不应该在那里,并且你可能想要改变默认的页面提取限制。
  • 兴趣点:

  • 下划线_参数采用JavaScript格式的时间戳,即。 比如Unix时间,但是它增加了毫秒。 它实际上并未用于分页。
  • 分页使用bookmarks属性,因此您向不需要它的“新”端点发出第一个请求,然后从结果中提取bookmarks并在请求中使用它以获取结果的下一个“页面”然后从这些结果中获取下一页的bookmarks ,等等,直到用完结果或达到预设的限制(或者在脚本执行时间点击服务器最大值)。 我很想知道bookmarks字段编码的内容。 我想认为除了一个PIN码或其他页面标记之外,还有一些有趣的秘密酱油。
  • 我跳过html,而是处理JSON,因为它比使用DOM操作解决方案或一堆正则表达式更容易(对我来说)。
  • <?php
    
    if(!class_exists('Skrivener_Pins')) {
    
      class Skrivener_Pins {
    
        /**
         * Constructor
         */
        public function __construct() {
        }
    
        /**
         * Pinterest search function. Uses Pinterest's "internal" page APIs, so likely to break if they change.
         * @author [@skrivener] Philip Tillsley
         * @param $search_str     The string used to search for matching pins.
         * @param $limit          Max number of pages to get, defaults to 2 to avoid excessively large queries. Use care when passing in a value.
         * @param $bookmarks_str  Used internally for recursive fetches.
         * @param $pages          Used internally to limit recursion.
         * @return array()        int['id'], obj['image'], str['pin_link'], str['orig_link'], bool['video_flag']
         * 
         * TODO:
            * 
            * 
         */
        public function get_tagged_pins($search_str, $limit = 1, $bookmarks_str = null, $page = 1) {
    
          // limit depth of recursion, ie. number of pages of 25 returned, otherwise we can hang on huge queries
          if( $page > $limit ) return false;
    
          // are we getting a next page of pins or not
          $next_page = false;
          if( isset($bookmarks_str) ) $next_page = true;
    
          // build url components
          if( !$next_page ) {
    
            // 1st time
            $search_res = 'BaseSearchResource'; // end point
            $path = '&module_path=' . urlencode('SearchInfoBar(query=' . $search_str . ', scope=boards)');
            $data = preg_replace("'[nrst]'","",'{
              "options":{
                "scope":"pins",
                "show_scope_selector":true,
                "query":"' . $search_str . '"
              },
              "context":{
                "app_version":"2f83a7e"
              },
              "module":{
                "name":"SearchPage",
                "options":{
                  "scope":"pins",
                  "query":"' . $search_str . '"
                }
              },
              "append":false,
              "error_strategy":0
              }');
          } else {
    
            // this is a fetch for 'scrolling', what changes is the bookmarks reference, 
            // so pass the previous bookmarks value to this function and it is included
            // in query
            $search_res = 'SearchResource'; // different end point from 1st time search
            $path = '';
            $data = preg_replace("'[nrst]'","",'{
              "options":{
                "query":"' . $search_str . '",
                "bookmarks":["' . $bookmarks_str . '"],
                "show_scope_selector":null,
                "scope":"pins"
              },
              "context":{
                "app_version":"2f83a7e"
              },
                "module":{
                  "name":"GridItems",
                "options":{
                  "scrollable":true,
                  "show_grid_footer":true,
                  "centered":true,
                  "reflow_all":true,
                  "virtualize":true,
                  "item_options":{
                    "show_pinner":true,
                    "show_pinned_from":false,
                    "show_board":true
                  },
                  "layout":"variable_height"
                }
              },
              "append":true,
              "error_strategy":2
            }');
          }
          $data = urlencode($data);
          $timestamp = time() * 1000; // unix time but in JS format (ie. has ms vs normal server time in secs), * 1000 to add ms (ie. 0ms)
    
          // build url
          $url = 'http://pinterest.com/resource/' . $search_res . '/get/?source_url=/search/pins/?q=' . $search_str
              . '&data=' . $data
              . $path
              . '&_=' . $timestamp;//'1378150472669';
    
          // setup curl
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_URL, $url);
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
          curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest"));
    
          // get result
          $curl_result = curl_exec ($ch); // this echoes the output
          $curl_result = json_decode($curl_result);
          curl_close ($ch);
    
          // clear html to make var_dumps easier to see when debugging
          // $curl_result->module->html = '';
    
          // isolate the pin data, different end points have different data structures
          if(!$next_page) $pin_array = $curl_result->module->tree->children[1]->children[0]->children[0]->children;
          else $pin_array = $curl_result->module->tree->children;
    
          // map the pin data into desired format
          $pin_data_array = array();
          $bookmarks = null;
          if(is_array($pin_array)) {
            if(count($pin_array)) {
    
              foreach ($pin_array as $pin) {
    
                //setup data
                $image_id = $pin->options->pin_id;
                $image_data = ( isset($pin->data->images->originals) ) ? $pin->data->images->originals : $pin->data->images->orig;
                $pin_url = 'http://pinterest.com/pin/' . $image_id . '/';
                $original_url = $pin->data->link;
                $video = $pin->data->is_video;
    
                array_push($pin_data_array, array(
                  "id"          => $image_id,
                  "image"       => $image_data,
                  "pin_link"    => $pin_url,
                  "orig_link"   => $original_url,
                  "video_flag"  => $video,
                  ));
              }
              $bookmarks = reset($curl_result->module->tree->resource->options->bookmarks);
    
            } else {
              $pin_data_array = false;
            }
          }
    
          // recurse until we're done
          if( !($pin_data_array === false) && !is_null($bookmarks) ) {
    
            // more pins to get
            $more_pins = $this->get_tagged_pins($search_str, $limit, $bookmarks, ++$page);
            if( !($more_pins === false) ) $pin_data_array = array_merge($pin_data_array, $more_pins);
            return $pin_data_array;
          }
    
          // end of recursion
          return false;
        }
    
      } // end class Skrivener_Pins
    } // end if
    
    
    
    /**
     * Debug/Demo Code
     * delete or comment this section for production
     */
    
    // output headers to control how the content displays
    // header("Content-Type: application/json");
    header("Content-Type: text/plain");
    // header("Content-Type: text/html");
    
    // define search term
    // $tag = "vader";
    $tag = "haemolytic";
    // $tag = "qjkjgjerbjjkrekhjk";
    
    if(class_exists('Skrivener_Pins')) {
    
      // instantiate the class
      $pin_handler = new Skrivener_Pins();
    
      // get pins, pinterest returns 25 per batch, function pages through this recursively, pass in limit to 
      // override default limit on number of pages to retrieve, avoid high limits (eg. limit of 20 * 25 pins/page = 500 pins to pull 
      // and 20 separate calls to Pinterest)
      $pins1 = $pin_handler->get_tagged_pins($tag, 2);
    
      // display the pins for demo purposes
      echo '<h1>Images on Pinterest mentioning "' . $tag . '"</h1>' . "n";
      if( $pins1 != false ) {
        echo '<p><em>' . count($pins1) . ' images found.</em></p>' . "n";
        skrivener_dump_images($pins1, 5);
      } else {
        echo '<p><em>No images found.</em></p>' . "n";
      }
    }
    
    // demo function, dumps images in array to html img tags, can pass limit to only display part of array
    function skrivener_dump_images($pin_array, $limit = false) {
      if(is_array($pin_array)) {
        if($limit) $pin_array = array_slice($pin_array, -($limit));
        foreach ($pin_array as $pin) {
          echo '<img src="' . $pin['image']->url . '" width="' . $pin['image']->width . '" height="' . $pin['image']->height . '" >' . "n";
        }
      }
    }
    
    ?>
    

    让我知道你是否遇到了适应你特定终点的问题。 对于代码中的任何不合适的Apols,它最初都没有生成。


    有几个人建议使用javascript来模拟滚动。

    我不认为你需要模拟滚动,我想你只需要找到滚动发生时通过AJAX调用的URI的格式,然后你可以依次获得结果的每个“页面”。 需要一点落后的工程。

    使用Chrome检查器的网络标签,我可以看到,一旦我到达页面一定距离,就会调用以下URI:

    http://pinterest.com/resource/BoardFeedResource/get/?source_url=%2Fdodo%2Fweb-designui-and-mobile%2F&data=%7B%22options%22%3A%7B%22board_id%22%3A%22158400180582875562%22 %2C%22access%22%3A%5B%5D%2C%22bookmarks%22%3A%5B%22LT4xNTg0MDAxMTE4NjcxMTM2ODk6MjV8ZWJjODJjOWI4NTQ4NjU4ZDMyNzhmN2U3MGQyZGJhYTJhZjY2ODUzNTI4YTZhY2NlNmY0M2I1ODYwYjExZmQ3Yw%3D%3D%22%5D%7D%2C%22context%22%3A%7B%22app_version%22%3A %22fb43cdb%22%7D%2C%22module%22%3A%7B%22name%22%3A%22GridItems%22%2C%22options%22%3A%7B%22scrollable%22%3Atrue%2C%22show_grid_footer%22%3Atrue %2C%22centered%22%3Atrue%2C%22reflow_all%22%3Atrue%2C%22virtualize%22%3Atrue%2C%22item_options%22%3A%7B%22show_rich_title%22%3Afalse%2C%22squish_giraffe_pins%22%3Afalse%2C %22show_board%的22%3Afalse%2C%22show_via%的22%3Afalse%2C%22show_pinner%的22%3Afalse%2C%22show_pinned_from%的22%3Atrue%7D%2C%22layout%的22%3A%22variable_height%的22%7D%7D%2C %22append%22%3Atrue%2C%22error_strategy%22%3A1%7D&_ = 1377092055381

    如果我们解码,我们看到它主要是JSON

    http://pinterest.com/resource/BoardFeedResource/get/?source_url=/dodo/web-designui-and-mobile/&data=
    {
    "options": {
        "board_id": "158400180582875562",
        "access": [],
        "bookmarks": [
            "LT4xNTg0MDAxMTE4NjcxMTM2ODk6MjV8ZWJjODJjOWI4NTQ4NjU4ZDMyNzhmN2U3MGQyZGJhYTJhZjY2ODUzNTI4YTZhY2NlNmY0M2I1ODYwYjExZmQ3Yw=="
        ]
    },
    "context": {
        "app_version": "fb43cdb"
    },
    "module": {
        "name": "GridItems",
        "options": {
            "scrollable": true,
            "show_grid_footer": true,
            "centered": true,
            "reflow_all": true,
            "virtualize": true,
            "item_options": {
                "show_rich_title": false,
                "squish_giraffe_pins": false,
                "show_board": false,
                "show_via": false,
                "show_pinner": false,
                "show_pinned_from": true
            },
            "layout": "variable_height"
        }
    },
    "append": true,
    "error_strategy": 1
    }
    &_=1377091719636
    

    向下滚动,直到我们收到第二个请求,我们看到这一点

    http://pinterest.com/resource/BoardFeedResource/get/?source_url=/dodo/web-designui-and-mobile/&data=
    {
        "options": {
            "board_id": "158400180582875562",
            "access": [],
            "bookmarks": [
                "LT4xNTg0MDAxMTE4NjcwNTk1ODQ6NDl8ODFlMDUwYzVlYWQxNzVmYzdkMzI0YTJiOWJkYzUwOWFhZGFkM2M1MzhiNzA0ZDliZDIzYzE3NjkzNTg1ZTEyOQ=="
            ]
        },
        "context": {
            "app_version": "fb43cdb"
        },
        "module": {
            "name": "GridItems",
            "options": {
                "scrollable": true,
                "show_grid_footer": true,
                "centered": true,
                "reflow_all": true,
                "virtualize": true,
                "item_options": {
                    "show_rich_title": false,
                    "squish_giraffe_pins": false,
                    "show_board": false,
                    "show_via": false,
                    "show_pinner": false,
                    "show_pinned_from": true
                },
                "layout": "variable_height"
            }
        },
        "append": true,
        "error_strategy": 2
    }
    &_=1377092231234
    

    正如你所看到的,没有太多变化。 Board_id是一样的。 error_strategy现在是2,最后的&_是不同的。

    &_参数在这里是关键。 我敢打赌,它告诉页面从哪里开始下一组照片。 我无法在任何回复或原始页面HTML中找到对它的引用,但它必须位于某处,或者由客户端的JavaScript生成。 无论哪种方式,页面/浏览器都必须知道接下来要问什么,所以这些信息是你应该能够得到的。


    您可以通过使用此标头发出请求来触发json端点: X-Requested-With:XMLHttpRequest

    在控制台中尝试执行此命令:

    curl -H "X-Requested-With:XMLHttpRequest" "http://pinterest.com/resource/CategoryFeedResource/get/?source_url=%2Fall%2Fgeek%2F&data=%7B%22options%22%3A%7B%22feed%22%3A%22geek%22%2C%22scope%22%3Anull%2C%22bookmarks%22%3A%5B%22Pz8xMzc3NjU4MjEyLjc0Xy0xfDE1ZjczYzc4YzNlNDg3M2YyNDQ4NGU1ZTczMmM0ZTQyYzBjMWFiMWNhYjRhMDRhYjg2MTYwMGVkNWQ0ZDg1MTY%3D%22%5D%2C%22is_category_feed%22%3Atrue%7D%2C%22context%22%3A%7B%22app_version%22%3A%22addc92b%22%7D%2C%22module%22%3A%7B%22name%22%3A%22GridItems%22%2C%22options%22%3A%7B%22scrollable%22%3Atrue%2C%22show_grid_footer%22%3Atrue%2C%22centered%22%3Atrue%2C%22reflow_all%22%3Atrue%2C%22virtualize%22%3Atrue%2C%22item_options%22%3A%7B%22show_pinner%22%3Atrue%2C%22show_pinned_from%22%3Afalse%2C%22show_board%22%3Atrue%2C%22show_via%22%3Afalse%7D%2C%22layout%22%3A%22variable_height%22%7D%7D%2C%22append%22%3Atrue%2C%22error_strategy%22%3A2%7D&module_path=App()%3EHeader()%3EDropdownButton()%3EDropdown()%3ECategoriesMenu(resource%3D%5Bobject+Object%5D%2C+name%3DCategoriesMenu%2C+resource%3DCategoriesResource(browsable%3Dtrue))&_=1377658213300" | python -mjson.tool
    

    您将在输出的json中看到引脚数据。 你应该能够解析它并抓取你需要的下一个图像。

    至于这一点: &_=1377658213300 。 我推测这是上一个列表的最后一个引脚的ID。 您应该能够在每次通话时使用上一个响应中的最后一个引脚替换它。

    链接地址: http://www.djcxy.com/p/15273.html

    上一篇: Get all images from a board from a Pinterest web address

    下一篇: How can we secure a third