php: Extract text between specific tags from a webpage

Possible Duplicate:
Best methods to parse HTML with PHP

I understand I should be using a html parser like php domdocument (http://docs.php.net/manual/en/domdocument.loadhtml.php) or tagsoup.

How would I use php domdocument to extract text between specific tags, for example get text between h1,h2,h3,p,table? It seems I can only do this for one tag only with getelementbytagname.

Is there a better html parser for such task? Or how would I loop through the php domdocument?


You are correct, use DomDocument (since regex is NOT a good idea for parsing HTML. Why? See here and here for reasons why).

getElementsByTagName gives you a DOMNodeList that you can iterate over to get the text of all the found elements. So, your code could look something like:

$document = new DOMDocument();
$document->loadHTML($html);

$tags = array ('h1', 'h2', 'h3', 'h4', 'p');
$texts = array ();
foreach($tags as $tag)
{
  $elementList = $document->getElementsByTagName($tag);
  foreach($elementList as $element)
  {
     $texts[$element->tagName][] = $element->textContent;
  }
}
return $texts;

Note that you should probably have some error handling in there, and you will also lose the context of the texts, but you can probably edit this code as you see fit.


你可以用正则表达式来完成。

preg_match_all('#<h1>([^<]*)</h1>#Usi', $html_string, $matches);
foreach ($matches as $match)
{
  // do something with $match
}

I am not sure what is your source so I added a function to get the content via the URL.

$file = file_get_contents($url);

$doc = new DOMDocument();
$doc->loadHTML($file);

$body = $doc->getElementsByTagName('body');
$h1 = $body->getElementsByTagName('h1');

I am not sure of this part:

for ($i = 0; $i < $items->length; $i++) {
    echo $items->item($i)->nodeValue . "n";
}

Or:

foreach ($items as $item) {
    echo $item->nodeValue . "n";
}

Here is more info on nodeValue: http://docs.php.net/manual/en/function.domnode-node-value.php

Hope it helps!

链接地址: http://www.djcxy.com/p/29912.html

上一篇: 决定何时使用XmlDocument与XmlReader

下一篇: php:从网页中提取特定标签之间的文本