Options for HTML scraping?

I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.

The story so far:

  • Python
  • Beautiful Soup
  • lxml
  • HTQL
  • Scrapy
  • Mechanize
  • Ruby
  • Nokogiri
  • Hpricot
  • Mechanize
  • scrAPI
  • scRUBYt!
  • wombat
  • Watir
  • .NET
  • Html Agility Pack
  • WatiN
  • Perl
  • WWW::Mechanize
  • Web-Scraper
  • Java
  • Tag Soup
  • HtmlUnit
  • Web-Harvest
  • jARVEST
  • jsoup
  • Jericho HTML Parser
  • JavaScript
  • request
  • cheerio
  • artoo
  • node-horseman
  • phantomjs
  • PHP
  • Goutte
  • htmlSQL
  • PHP Simple HTML DOM Parser
  • PHP Scraping with CURL
  • Most of them
  • Screen-Scraper

  • Ruby世界相当于美丽的汤是why_the_lucky_stiff的Hpricot。


    In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.

    http://www.codeplex.com/htmlagilitypack


    BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.

    Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.

    Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.

    链接地址: http://www.djcxy.com/p/92662.html

    上一篇: 使用XML包将html表格刮到R数据框中

    下一篇: HTML抓取的选项?