PHP Native DOMDocument and Simple DOM Parser

2018-06-10 01:25:11

I need to parse the contents of a HTML document (produced by Microsoft Word). Traversing the DOM to get the information/contents I need then outputting the desired as a CSV. Hardly brain surgery I know.

Now as PHP isn't really my thing and I have a tight schedule I was going to use the PHP Simple HTML DOM Parser from http://simplehtmldom.sourceforge.net/

I noticed that my script isn't working. After trial and error I have realised that this is due to the file size of the HTML files produced by Word (they are 3MB and have as much as 30,000 lines of HTML!). I assume that there is a file size limit to what can be parsed with either the PHP Simple HTML DOM Parser and perhaps the native PHP DOMDocument API? If this is the case does anyone know what this limit is? I've been googling for 40 mins now with no success.

Maybe I should just use Node.js?

PHP "Native" DOMDocument Docs and its little sister SimpleXMLElement Docs do not have a hardencoded size limit, but they are limited by the memory you allow PHP to use (see PHP memory limitDocs).

Also you must not assume that loading a 100 MB XML or HTML file will consume an equal size of memory. It most often is much less memory than the file-size (eg a fifth or a tenth or even, depends a bit on the XML so you can not just say factor X here instead you need to metric your own if you want to obtain precise information).

The file-size you give in your question - 3 MB - is rather small I'd say. Maybe not small for a HTML file in the internet but small for the libxml based PHP extensions. You can find out about the memory usage in PHP when loading that file by using memory_get_usage() Docs.

If you have really large XML files - then normally X(HT)ML - let's say 1.5 gigabytes - parsing with DOMDocument will take a lot of lead time. Then using the XMLReader Docs will allow you to parse the document without loading it into memory (completely). But it is no silver bullet, because you still have the parse-time but you can better control what to parse and which parts to skip so you have more room to control optimizations in PHP userland.

The PHP library PHP Simple HTML DOM ParserDocs does not impose a specific size limit as well. However it's not a binary extension of PHP but in PHP userland. So you need to better understand what exactly that library does (see simple_html_dom.php in HEAD revision). If you review the code you can see it is a parser purely written in PHP. This is because it was original written for PHP 4 where DOMDocument with DOMDocument::loadHTML did not exist yet.

As you can imagine, a PHP extension can manage memory much better than a PHP library written in PHP code. Especially when it comes to tree structures which a HTML Document object model is (this sentence is not true in its own, however developing this memory optimized takes a lot of work and a good design which is not always easy to create nor to maintain).

However: Since many years now it is not necessary to use that library any longer. Many PHP users do not know that and they find outdated code examples using that once popular library. The library PHP Simple HTML DOM Parser even still gets suggested from time to time here on Stackoverflow.

So the best suggestion I can give is: Unless you do not need to write PHP 4 compatible code, do not use that library at all and do not care about its limits. Instead port your code to DOMDocument::loadHTML() Docs.

PHP Simple HTML DOM Parser has a limit of 600KB.

define('MAX_FILE_SIZE', 600000);

You can, of course, edit your copy of the library and change this constant.

链接地址: http://www.djcxy.com/p/29860.html

上一篇: DOMDocument :: loadHTML（）：由于输入错误导致输入转换失败

下一篇: PHP原生DOMDocument和简单DOM解析器