HTML抓取和CSS查询

2018-07-03 06:08:20

以下库的优点和缺点是什么？

PHP简单的HTML DOM解析器

phpQuery

从上面我已经使用QP，它没有解析无效的HTML和simpleDomParser，它做得很好，但由于对象模型的原因，它有点泄漏内存。但是你可以通过调用$object->clear(); unset($object); $object->clear(); unset($object); 当你不再需要物体时。

还有更多的刮板吗？你与他们的经历是什么？我将把它变成一个社区wiki，我们可以建立一个有用的库列表，这些列表可以在抓取时使用。

我根据拜伦的回答做了一些测试：

    <?
    include("lib/simplehtmldom/simple_html_dom.php");
    include("lib/phpQuery/phpQuery/phpQuery.php");


    echo "<pre>";

    $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
    $data['pq'] = $data['dom'] = $data['simple_dom'] = array();

    $timer_start = microtime(true);

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom);

    foreach($x->query("//a") as $node)
    {
         $data['dom'][] = $node->getAttribute("href");
    }

    foreach($x->query("//img") as $node)
    {
         $data['dom'][] = $node->getAttribute("src");
    }

    foreach($x->query("//input") as $node)
    {
         $data['dom'][] = $node->getAttribute("name");
    }

    $dom_time =  microtime(true) - $timer_start;
    echo "dom: tt $dom_time . Got ".count($data['dom'])." items n";






    $timer_start = microtime(true);
    $doc = phpQuery::newDocument($html);
    foreach( $doc->find("a") as $node)
    {
       $data['pq'][] = $node->href;
    }

    foreach( $doc->find("img") as $node)
    {
       $data['pq'][] = $node->src;
    }

    foreach( $doc->find("input") as $node)
    {
       $data['pq'][] = $node->name;
    }
    $time =  microtime(true) - $timer_start;
    echo "PQ: tt $time . Got ".count($data['pq'])." items n";









    $timer_start = microtime(true);
    $simple_dom = new simple_html_dom();
    $simple_dom->load($html);
    foreach( $simple_dom->find("a") as $node)
    {
       $data['simple_dom'][] = $node->href;
    }

    foreach( $simple_dom->find("img") as $node)
    {
       $data['simple_dom'][] = $node->src;
    }

    foreach( $simple_dom->find("input") as $node)
    {
       $data['simple_dom'][] = $node->name;
    }
    $simple_dom_time =  microtime(true) - $timer_start;
    echo "simple_dom: t $simple_dom_time . Got ".count($data['simple_dom'])." items n";


    echo "</pre>";

得到了

dom:         0.00359296798706 . Got 115 items 
PQ:          0.010568857193 . Got 115 items 
simple_dom:  0.0770139694214 . Got 115 items

我曾经使用简单的html dom，直到一些明亮的SO'ers向我展示了光明hallelujah。

只需使用内置的DOM功能即可。它们是用C语言编写的，也是PHP核心的一部分。它们比任何第三方解决方案效率更高。使用萤火虫，获取XPath查询非常简单。这个简单的改变使得我的基于PHP的刮板运行速度更快，同时节省了我的宝贵时间。

我的刮刀过去需要约60兆字节以卷曲方式异地刮擦10个站点。即使你提到了简单的html dom内存修复。

现在我的PHP进程永远不会超过8兆字节。

强烈推荐。

编辑

好吧，我做了一些基准测试。建在dom的速度至少要快一个数量级。

Built in php DOM: 0.007061
Simple html  DOM: 0.117781

<?
include("../lib/simple_html_dom.php");

$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['dom'] = $data['simple_dom'] = array();

$timer_start = microtime(true);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom); 

foreach($x->query("//a") as $node) 
{
     $data['dom'][] = $node->getAttribute("href");
}

foreach($x->query("//img") as $node) 
{
     $data['dom'][] = $node->getAttribute("src");
}

foreach($x->query("//input") as $node) 
{
     $data['dom'][] = $node->getAttribute("name");
}

$dom_time =  microtime(true) - $timer_start;

echo "built in php DOM : $dom_timen";

$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
   $data['simple_dom'][] = $node->href;
}

foreach( $simple_dom->find("img") as $node)
{
   $data['simple_dom'][] = $node->src;
}

foreach( $simple_dom->find("input") as $node)
{
   $data['simple_dom'][] = $node->name;
}
$simple_dom_time =  microtime(true) - $timer_start;

echo "simple html  DOM : $simple_dom_timen";

链接地址: http://www.djcxy.com/p/92667.html

上一篇: html scraping and css queries

下一篇: Are global variables in PHP considered bad practice? If so, why?