:Practical XPath Parsing with Python's lxml Library

The lxml library serves as a powerful Pythonic wrapper around C libraries like libxml2 and libxslt, delivering exceptional perofrmance for parsing HTML and XML documents. Its comprehensive support for XPath 1.0 makes it an ideal choice for targeted data extraction tasks.

Setup

Install the package via pip:

pip install lxml

Initializing Parser Objects

Two primary methods exist for creating an etree object:

from lxml import etree

# Method 1: Load from local file
tree = etree.parse('document.html')

# Method 2: Parse from string content
html_string = "<html><body><div>Example</div></body></html>"
dom = etree.HTML(html_string)

Core XPath Syntax

XPath expressions navigate document structures and return node lists:

  • // - Recursive descent: searches through all descendant nodes
  • / - Direct child: selects immediate children only
  • @ - Attribute access: retrieves attribute values
  • text() - Extracts text nodes
  • . - Current context node
  • .. - Parent node

Location Path Examples

# All three expressions select div elements, but with different scope
direct_path = dom.xpath('/html/body/div')      # Strict hierarchy
flexible_path = dom.xpath('/html//div')        # Any depth under html
anywhere_path = dom.xpath('//div')             # All divs in document

Attribute and Index Poistioning

# Select elements by attribute
songs = dom.xpath("//section[@class='playlist']")

# Select by position (1-based indexing)
first_paragraph = dom.xpath("//article[@id='main']/p[1]")
last_item = dom.xpath("//ul/li[last()]")

Extracting Content

# Direct text extraction
immediate_text = dom.xpath("//h1/text()")

# All descendant text
full_text = dom.xpath("//div[@class='content']//text()")

# Attribute value retrieval
image_sources = dom.xpath("//img/@src")
link_urls = dom.xpath("//a/@href")

Advanced Predicates

Filter nodes using bracket notation:

# Elements with specific attribute existence
titled_elements = dom.xpath("//*[@title]")

# Numeric comparisons
recent_items = dom.xpath("//item[@price > 100]")

# Position-based selection
middle_rows = dom.xpath("//tr[position() = 2 or position() = 3]")

The xpath() method consistently returns a list of matches, empty if no elements satisfy the expression.

Tags: python lxml xpath web-scraping data-extraction

Posted on Fri, 15 May 2026 15:42:04 +0000 by zackcez