style </ in <script> tag

Update : html5lib (bottom of question) seems to get close, I just need to improve my understanding of how it's used.

I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:

<script type="text/x-jquery-tmpl" id="foo">
    <table><tr><td>${name}</td></tr></table>
</script>

Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO ( </ ) inside a <script> tag. However, HTML5 allows for </ before </script> . All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.

My requirements:

  • Real parser, not regex hacks.
  • Ability to load full pages or HTML fragments.
  • Ability to pull script contents back out, selecting by the tag's id attribute.
  • Input:

    <script id="foo"><td>bar</td></script>
    

    Example of failing output (no closing </td> ):

    <script id="foo"><td>bar</script>
    

    Some parsers and their results:


    DOMDocument (fails)

    Source:

    <?php
    
    header('Content-type: text/plain');
    $d = new DOMDocument;
    $d->loadHTML('<script id="foo"><td>bar</td></script>');
    echo $d->saveHTML();
    

    Output:

    Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><head><script id="foo"><td>bar</script></head></html>
    


    FluentDOM (fails)

    Source:

    <?php
    
    header('Content-type: text/plain');
    require_once 'FluentDOM/src/FluentDOM.php';
    $html = "<html><head></head><body><script id='foo'><td></td></script></body></html>";
    echo FluentDOM($html, 'text/html');
    

    Output:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><head></head><body><script id="foo"><td></script></body></html>
    


    phpQuery (fails)

    Source:

    <?php
    
    header('Content-type: text/plain');
    
    require_once 'phpQuery.php';
    
    phpQuery::newDocumentHTML(<<<EOF
    <script type="text/x-jquery-tmpl" id="foo">
    <td>test</td>
    </script>
    EOF
    );
    

    echo (string)pq('#foo');

    Output:

    <script type="text/x-jquery-tmpl" id="foo">
    <td>test
    </script>
    


    html5lib (passes)

    Possibly promising. Can I get at the contents of the script#foo tag?

    Source:

    <?php
    
    header('Content-type: text/plain');
    
    include 'HTML5/Parser.php';
    
    $html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>";
    $d = HTML5_Parser::parse($html);
    
    echo $d->saveHTML();
    

    Output:

    <html><head></head><body><script id="foo"><td></td></script></body></html>
    

    I had the same problem and apparently you can hack your way trough this by loading the document as XML, and save it as HTML :)

    $d = new DOMDocument;
    $d->loadXML('<script id="foo"><td>bar</td></script>');
    echo $d->saveHTML();
    

    But of course the markup must be error-free for loadXML to work.


    Re: html5lib

    You click on the download tab and download the PHP version of the parser.

    You untar the archive in a local folder

     tar -zxvf html5lib-php-0.1.tar.gz
     x html5lib-php-0.1/
     x html5lib-php-0.1/VERSION
     x html5lib-php-0.1/docs/
     ... etc
    

    You change directories and create a file named hello.php

    cd html5lib-php-0.1
    touch hello.php 
    

    You place the following PHP code in hello.php

    $html = '<html><head></head><body>
    <script type="text/x-jquery-tmpl" id="foo">
    <table><tr><td>${name}</td></tr></table>
    </script> 
    </body></html>';
    $dom = HTML5_Parser::parse($html); 
    var_dump($dom->saveXml()); 
    echo "nDonen";
    

    You run hello.php from the command line

    php hello.php
    

    The parser will parse the document tree, and return a DOMDocument object, which can be manipulated as any other DOMDocument object.


    FluentDOM uses the DOMDocument but blocks loading notices and warnings. It does not have an own parser. You can add your own loaders (For example one that uses the html5lib).

    链接地址: http://www.djcxy.com/p/5074.html

    上一篇: 正则表达式模式不匹配字符串中的任何位

    下一篇: 样式</ in <script>标签