Parsing RDFa in html/xhtml?

Using RDF::RDFa::Parser module in perl to parse rdf data out of website. On website with with !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> it works, but on sites using xhtml !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> no output...

test website -> http://www.filmstarts.de/kritiken/186918.html

use RDF::RDFa::Parser;

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new_from_url($url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

(I'm the author of RDF::RDFa::Parser.)

It looks like the HTML parser used by the RDFa parser is failing on that page. (I'm also the maintainer of the HTML parser in question, so I can't shift the blame onto anyone else!) Thus, by the time the RDFa parsing starts, all it sees is an empty DOM tree.

The page is quite hideously invalid XHTML yet still I would have expected the HTML parser to do a reasonable job. I've filed a bug report for you.

In the mean time, a workaround might be to build the XML::LibXML DOM tree outside of RDF::RDFa::Parser (perhaps using libxml's built-in HTML parser?). You could pass that tree directly to the RDFa parser:

use RDF::RDFa::Parser;
use LWP::Simple qw(get);

my $url     = 'http://www.filmstarts.de/kritiken/186918.html';
my $xhtml   = get($url);
my $dom     = somehow_build_a_dom_tree($xhtml);  # hand-waving!!
my $options = RDF::RDFa::Parser::Config->tagsoup;
my $rdfa    = RDF::RDFa::Parser->new($dom, $url, $options);

print $rdfa->opengraph('image');
print $rdfa->opengraph('description');

I hope that helps!

Update: here's a possible implementation of somehow_build_a_dom_tree ...

sub somehow_build_a_dom_tree {
    my $p = XML::LibXML->new;
    $p->recover_silently(1);
    $p->load_html( string => @_ );
}
链接地址: http://www.djcxy.com/p/15710.html

上一篇: 使用JavaScript / jQuery滚动到页面顶部?

下一篇: 解析HTML / xhtml中的RDFa?