Applying XPath Expressions with Python's lxml Library

Installation

Install the library using pip:

pip install lxml

XPath Core Concepts

Node Types

XPath defines seven node types: element, attribute, text, namespace, processing instruction, comment, and the document (root) node. An XML document is represented as a node tree, with the root of the tree being the document or root node.

Consider this sample XML document:

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<product>
  <name language="en">Sample Product</name>
  <manufacturer>Example Corp</manufacturer>
  <release_date>2023</release_date>
  <cost>49.95</cost>
</product>
</catalog>

Examples of nodes from the document above:

  • <catalog> (Root node)
  • <manufacturer>Example Corp</manufacturer> (Element node)
  • language="en" (Attribute node)

Node Relationships

Parent Every element and attribute has one parent. In the example below, the product element is the parent of name, manufacturer, release_date, and cost.

Children An element node may have zero, one, or many children. In the example, name, manufacturer, release_date, and cost are all children of the product element.

Siblings Nodes that share the same parent are siblings. name, manufacturer, release_date, and cost are all siblings.

Ancestor A node's parent, its parent's parent, and so on. For the name element, its ancestors are the product and catalog elements.

Descendant A node's children, their children, etc. The descendants of catalog are product, name, manufacturer, release_date, and cost.

Selecting Nodes

XPath uses path expressions to select nodes within an XML document. Nodes are selected by following steps along a path.

Key path expressions:

Exprestion Description
nodename Selects all child nodes of the named node.
/ Selects from the root node.
// Selects nodes from the current matching node anywhere in the document, regardless of posision.
. Selects the current node.
.. Selects the parent of the current node.
@ Selects attributes.

Predicates Predicates are used to find a specific node or a node containing a particular value. They are enclosed in square brackets [].

Wildcards XPath wildcards can select unknown XML elements.

Selecting Multiple Paths Use the | operator within a path expression to select several paths.

Practical Example with lxml

# -*- coding: utf-8 -*-

from lxml import etree

html_content = """
<div class="container">
    <i class="icon icon-previous" id="prev"></i>
    <a rel="nofollow" href="/home" id="main_link">News Portal</a>
    <ul id="navigation">
        <li><a rel="nofollow" href="https://news.local" title="Local">Local</a></li>
        <li><a rel="nofollow" href="https://news.global" title="Global">Global</a></li>
        <li><a rel="nofollow" href="https://news.defense" title="Defense">Defense</a></li>
        <li><a rel="nofollow" href="https://news.gallery" title="Gallery">Gallery</a></li>
        <li><a rel="nofollow" href="https://news.community" title="Community">Community</a></li>
        <li><a rel="nofollow" href="https://news.entertainment" title="Entertainment">Entertainment</a></li>
        <li><a rel="nofollow" href="https://news.technology" title="Technology">Technology</a></li>
        <li><a rel="nofollow" href="https://news.sports" title="Sports">Sports</a></li>
        <li><a rel="nofollow" href="https://news.finance" title="Finance">Finance</a></li>
        <li><a rel="nofollow" href="https://news.automotive" title="Automotive">Automotive</a></li>
    </ul>
    <i class="icon icon-list" id="list_menu"></i>
</div>
"""

# Parse the HTML string
parsed_html = etree.HTML(html_content)

# Retrieve href attributes from all anchor tags
all_links = parsed_html.xpath('//a/@href')
print(all_links)

# Get the href from the anchor tag with id="main_link"
specific_link = parsed_html.xpath('//a[@id="main_link"]/@href')
print(specific_link)

# Extract text content from anchor tags within list items
link_texts = parsed_html.xpath('//li/a/text()')
print(link_texts)

Expected output:

['/home', 'https://news.local', 'https://news.global', 'https://news.defense', 'https://news.gallery', 'https://news.community', 'https://news.entertainment', 'https://news.technology', 'https://news.sports', 'https://news.finance', 'https://news.automotive']
['/home']
['Local', 'Global', 'Defense', 'Gallery', 'Community', 'Entertainment', 'Technology', 'Sports', 'Finance', 'Automotive']

Tags: python web scraping xpath lxml HTML Parsing

Posted on Wed, 20 May 2026 18:13:14 +0000 by alego