Introduction
HTML Agility Pack (HAP) is a robust and flexible .NET library designed for parsing and manipulating HTML documents. This article provides an overview of its capabilities, loading mechanisms, selector usage, node manipulation, traversal, and attribute handling.
Official Resources
- Official Website: http://html-agility-pack.net/
- NuGet Package: https://www.nuget.org/packages/HtmlAgilityPack/
- GitHub Repository: https://github.com/zzzprojects/html-agility-pack
Usage and Examples
1. Loading HTML Content
Before parsing, content must be loaded. The library supports four loading methods:
(1) From File
Load HTML directly from a file on disk:
var filePath = @"D:\data.html";
var document = new HtmlDocument();
document.Load(filePath);
Replace filePath with the actual path to your HTML file. Once loaded, the document is ready for parsing.
(2) From String
Parse an HTML string directly:
var rawHtml = @"<!DOCTYPE html>
<html>
<body>
<h1>This is <b>bold</b> heading</h1>
<p>This is <u>underlined</u> paragraph</p>
<h2>This is <i>italic</i> heading</h2>
</body>
</html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(rawHtml);
var bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
This approach is commonly used in web scraping when the HTML content has already been retrieved as a string from an HTTP response.
(3) From Web (URL)
Load directly from a URL using the HtmlWeb class:
var targetUrl = @"http://html-agility-pack.net/";
HtmlWeb webFetcher = new HtmlWeb();
var loadedDocument = webFetcher.Load(targetUrl);
This method simplifies the process when only a URL is available, eliminating the need for manual HTTP request code.
(4) From Browser
This method is useful for pages that rely on dynamic JavaScript rendering. Note that it requires a WinForms environment.
string dynamicUrl = "http://html-agility-pack/from-browser";
var webLoader1 = new HtmlWeb();
var docFromBrowser1 = webLoader1.LoadFromBrowser(dynamicUrl, o =>
{
var browserInstance = (WebBrowser) o;
// Wait until dynamic element is populated
return !string.IsNullOrEmpty(browserInstance.Document.GetElementById("uiDynamicText").InnerText);
});
var extractedText1 = docFromBrowser1.DocumentNode.SelectSingleNode("//div[@id='uiDynamicText']").InnerText;
var webLoader2 = new HtmlWeb();
var docFromBrowser2 = webLoader2.LoadFromBrowser(dynamicUrl, html =>
{
// Wait until specific placeholder is replaced
return !html.Contains("<div id=\"uiDynamicText\"></div>");
});
var extractedText2 = docFromBrowser2.DocumentNode.SelectSingleNode("//div[@id='uiDynamicText']").InnerText;
Console.WriteLine("Text 1: " + extractedText1);
Console.WriteLine("Text 2: " + extractedText2);
2. Node Selectors
HAP provides two primary methods for selecting nodes. Both use XPath expressions.
SelectNodes()
Returns a collection of nodes matching the given XPath expression. If no match is found, the result is null.
Example 1: Retrieve all input elements inside td elements.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var inputNodes = htmlDoc.DocumentNode.SelectNodes("//td/input");
To obtain the first element from the collection, use .First() only if the collection is not null:
string firstValue = htmlDoc.DocumentNode
.SelectNodes("//td/input")
?.First();
A safer approach is to check for null before accessing elements:
var matchedNodes = htmlDoc.DocumentNode.SelectNodes("//td/input");
if (matchedNodes != null)
{
string first = matchedNodes.First();
}
Example 2: Select all div elements with a specific class.
var newsItems = htmlDoc.DocumentNode.SelectNodes(@"//div[@class='news-item']");
You can adapt this pattern to target other elements (e.g., span, h1, img) and other attributes (e.g., id).
SelectSingleNode()
Returns the first node that matches the XPath expression, or null if no matching node exists.
Example 1:
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string inputElement = htmlDoc.DocumentNode
.SelectSingleNode("//td/input")?.InnerHtml;
Example 2:
var viewportNode = htmlDoc.DocumentNode.SelectSingleNode(@"//div[@class='viewport']");
Combining with LINQ
LINQ enables more expressive filtering based on attributes, content, or other criteria:
HtmlWeb web = new HtmlWeb();
var document = web.Load(model.Url);
var contentDiv = document.DocumentNode
.Descendants("div")
.FirstOrDefault(m => m.GetAttributeValue("class", "") == "text" &&
m.GetAttributeValue("dir", "") == "ltr");
var linkCell = htmlDoc.DocumentNode
.SelectNodes(@"//td[@colspan='3']")
.Where(n => n.InnerHtml.Contains("<a href=") && n.InnerHtml.Contains("<p align="))
.FirstOrDefault();
3. Manipulating Nodes
After selecting nodes, you can perform various operations.
InnerHtml: Gets or sets the inner HTML of a node.
var headingNodes = htmlDoc.DocumentNode.SelectNodes("//body/h1");
foreach (var node in headingNodes)
{
Console.WriteLine(node.InnerHtml);
}
InnerText: Gets or sets the text content, stripping all HTML tags.
var headingNodes = htmlDoc.DocumentNode.SelectNodes("//body/h1");
foreach (var node in headingNodes)
{
Console.WriteLine(node.InnerText);
}
AppendChild: Adds a new child node at the end of a node's children.
var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
HtmlNode newHeading = HtmlNode.CreateNode("<h2> This is h2 heading</h2>");
body.AppendChild(newHeading);
Remove: Removes a specific node from the document.
var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
HtmlNode nodeToRemove = body.ChildNodes[1];
nodeToRemove.Remove();
4. Traversing Nodes
ChildNodes: Returns all immediate children of a node.
var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
HtmlNodeCollection children = body.ChildNodes;
foreach (var child in children)
{
if (child.NodeType == HtmlNodeType.Element)
{
Console.WriteLine(child.OuterHtml);
}
}
FirstChild: Retrieves the first child node.
var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
HtmlNode first = body.FirstChild;
Console.WriteLine(first.OuterHtml);
LastChild: Retrieves the last child node (usage is analogous to FirstChild).
Descendants(): Iterates over all descendant nodes (children, grandchildren, etc.).
var root = htmlDoc.DocumentNode.SelectSingleNode("//body");
foreach (var descendant in root.Descendants())
{
if (descendant.NodeType == HtmlNodeType.Element)
{
Console.WriteLine(descendant.Name);
}
}
Descendants(String): Returns descendants with a specific tag name.
var root = htmlDoc.DocumentNode.SelectSingleNode("//body");
foreach (var h2Node in root.Descendants("h2"))
{
if (h2Node.NodeType == HtmlNodeType.Element)
{
Console.WriteLine(h2Node.Name);
}
}
5. Working with Attributes
SetAttributeValue: Sets or updates an attribute on a node.
var heading = htmlDoc.DocumentNode.SelectSingleNode("//h1");
heading.Attributes.Append("style");
heading.SetAttributeValue("style", "color:blue");
GetAttributeValue: Retrieves the value of a specified attribute. Useful in scraping scenarios to extract URLs or metadata.
// Extract audio source
var sourceElement = htmlDoc.DocumentNode.SelectSingleNode(@"//source[@type='audio/mpeg']");
var audioSrc = sourceElement.GetAttributeValue("src", "");
// Extract input value
var titleInput = childHtmlDoc.DocumentNode.SelectSingleNode(@"//input[@name='title']");
var titleValue = titleInput.GetAttributeValue("value", "");