Installation
Install the library using pip:
pip install lxml
XPath Core Concepts
Node Types
XPath defines seven node types: element, attribute, text, namespace, processing instruction, comment, and the document (root) node. An XML document is represented as a node tree, with the root of the tree being the document or root node.
Consider this sample XML document:
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<product>
<name language="en">Sample Product</name>
<manufacturer>Example Corp</manufacturer>
<release_date>2023</release_date>
<cost>49.95</cost>
</product>
</catalog>
Examples of nodes from the document above:
<catalog>(Root node)<manufacturer>Example Corp</manufacturer>(Element node)language="en"(Attribute node)
Node Relationships
Parent
Every element and attribute has one parent. In the example below, the product element is the parent of name, manufacturer, release_date, and cost.
Children
An element node may have zero, one, or many children. In the example, name, manufacturer, release_date, and cost are all children of the product element.
Siblings
Nodes that share the same parent are siblings. name, manufacturer, release_date, and cost are all siblings.
Ancestor
A node's parent, its parent's parent, and so on. For the name element, its ancestors are the product and catalog elements.
Descendant
A node's children, their children, etc. The descendants of catalog are product, name, manufacturer, release_date, and cost.
Selecting Nodes
XPath uses path expressions to select nodes within an XML document. Nodes are selected by following steps along a path.
Key path expressions:
| Exprestion | Description |
|---|---|
nodename |
Selects all child nodes of the named node. |
/ |
Selects from the root node. |
// |
Selects nodes from the current matching node anywhere in the document, regardless of posision. |
. |
Selects the current node. |
.. |
Selects the parent of the current node. |
@ |
Selects attributes. |
Predicates
Predicates are used to find a specific node or a node containing a particular value. They are enclosed in square brackets [].
Wildcards XPath wildcards can select unknown XML elements.
Selecting Multiple Paths
Use the | operator within a path expression to select several paths.
Practical Example with lxml
# -*- coding: utf-8 -*-
from lxml import etree
html_content = """
<div class="container">
<i class="icon icon-previous" id="prev"></i>
<a rel="nofollow" href="/home" id="main_link">News Portal</a>
<ul id="navigation">
<li><a rel="nofollow" href="https://news.local" title="Local">Local</a></li>
<li><a rel="nofollow" href="https://news.global" title="Global">Global</a></li>
<li><a rel="nofollow" href="https://news.defense" title="Defense">Defense</a></li>
<li><a rel="nofollow" href="https://news.gallery" title="Gallery">Gallery</a></li>
<li><a rel="nofollow" href="https://news.community" title="Community">Community</a></li>
<li><a rel="nofollow" href="https://news.entertainment" title="Entertainment">Entertainment</a></li>
<li><a rel="nofollow" href="https://news.technology" title="Technology">Technology</a></li>
<li><a rel="nofollow" href="https://news.sports" title="Sports">Sports</a></li>
<li><a rel="nofollow" href="https://news.finance" title="Finance">Finance</a></li>
<li><a rel="nofollow" href="https://news.automotive" title="Automotive">Automotive</a></li>
</ul>
<i class="icon icon-list" id="list_menu"></i>
</div>
"""
# Parse the HTML string
parsed_html = etree.HTML(html_content)
# Retrieve href attributes from all anchor tags
all_links = parsed_html.xpath('//a/@href')
print(all_links)
# Get the href from the anchor tag with id="main_link"
specific_link = parsed_html.xpath('//a[@id="main_link"]/@href')
print(specific_link)
# Extract text content from anchor tags within list items
link_texts = parsed_html.xpath('//li/a/text()')
print(link_texts)
Expected output:
['/home', 'https://news.local', 'https://news.global', 'https://news.defense', 'https://news.gallery', 'https://news.community', 'https://news.entertainment', 'https://news.technology', 'https://news.sports', 'https://news.finance', 'https://news.automotive']
['/home']
['Local', 'Global', 'Defense', 'Gallery', 'Community', 'Entertainment', 'Technology', 'Sports', 'Finance', 'Automotive']