How to use HTML Agility pack

How do I use the HTML Agility Pack?

My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.


First, install the HTMLAgilityPack nuget package into your project.

Then, as an example:

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);

// Use:  htmlDoc.LoadHtml(xmlString);  to load from a string (was htmlDoc.LoadXML(xmlString)

// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
    // Handle any parse errors as required

}
else
{

    if (htmlDoc.DocumentNode != null)
    {
        HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

        if (bodyNode != null)
        {
            // Do something with bodyNode
        }
    }
}

(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)

The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)

HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.

Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.

There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.


I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.

  • HtmlAgilityPack Article Series
  • Introduction To The HtmlAgilityPack Library
  • Easily extracting links from a snippet of html with HtmlAgilityPack
  • The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.


    HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp

    To parse

    <h2>
      <a href="">Jack</a>
    </h2>
    <ul>
      <li class="tel">
        <a href="">81 75 53 60</a>
      </li>
    </ul>
    <h2>
      <a href="">Roy</a>
    </h2>
    <ul>
      <li class="tel">
        <a href="">44 52 16 87</a>
      </li>
    </ul>
    

    I did this:

    string url = "http://website.com";
    var Webget = new HtmlWeb();
    var doc = Webget.Load(url);
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
    {
      names.Add(node.ChildNodes[0].InnerHtml);
    }
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[@class='tel']//a"))
    {
      phones.Add(node.ChildNodes[0].InnerHtml);
    }
    
    链接地址: http://www.djcxy.com/p/83346.html

    上一篇: jquery或JS创建子元素并分配一个ID

    下一篇: 如何使用HTML Agility Pack