The lxml library serves as a powerful Pythonic wrapper around C libraries like libxml2 and libxslt, delivering exceptional perofrmance for parsing HTML and XML documents. Its comprehensive support for XPath 1.0 makes it an ideal choice for targeted data extraction tasks.
Setup
Install the package via pip:
pip install lxml
Initializing Parser Objects
Two primary methods exist for creating an etree object:
from lxml import etree
# Method 1: Load from local file
tree = etree.parse('document.html')
# Method 2: Parse from string content
html_string = "<html><body><div>Example</div></body></html>"
dom = etree.HTML(html_string)
Core XPath Syntax
XPath expressions navigate document structures and return node lists:
//- Recursive descent: searches through all descendant nodes/- Direct child: selects immediate children only@- Attribute access: retrieves attribute valuestext()- Extracts text nodes.- Current context node..- Parent node
Location Path Examples
# All three expressions select div elements, but with different scope
direct_path = dom.xpath('/html/body/div') # Strict hierarchy
flexible_path = dom.xpath('/html//div') # Any depth under html
anywhere_path = dom.xpath('//div') # All divs in document
Attribute and Index Poistioning
# Select elements by attribute
songs = dom.xpath("//section[@class='playlist']")
# Select by position (1-based indexing)
first_paragraph = dom.xpath("//article[@id='main']/p[1]")
last_item = dom.xpath("//ul/li[last()]")
Extracting Content
# Direct text extraction
immediate_text = dom.xpath("//h1/text()")
# All descendant text
full_text = dom.xpath("//div[@class='content']//text()")
# Attribute value retrieval
image_sources = dom.xpath("//img/@src")
link_urls = dom.xpath("//a/@href")
Advanced Predicates
Filter nodes using bracket notation:
# Elements with specific attribute existence
titled_elements = dom.xpath("//*[@title]")
# Numeric comparisons
recent_items = dom.xpath("//item[@price > 100]")
# Position-based selection
middle_rows = dom.xpath("//tr[position() = 2 or position() = 3]")
The xpath() method consistently returns a list of matches, empty if no elements satisfy the expression.