Core Overview
PyQuery provides an efficient interface for DOM manipulation in Python, mirroring the functionality of jQuery. It leverages the lxml parser back end to handle complex HTML structures.
Environment Setup
Installation requires the core parsing libraries.
pip install lxml pyquery
Initializing the Document Object
Processing begins by instantiating a document object. Multiple data sources are supported.
Text Strings
Raw HTML markup can be passed directly.
import pyquery
markup = """
<main>
<section id="content">
<article class="post">Entry One</article>
<article class="post">Entry Two</article>
</section>
</main>
"""
doc = pyquery.PyQuery(markup)
print(doc('#content'))
Network Resources
External HTTP requests should be separated from parsing logic using libraries like requests.
import requests
import pyquery
resp = requests.get("https://example.com")
resp.encoding = 'utf-8'
doc = pyquery.PyQuery(resp.text)
print(doc('h1'))
Local Files
Load content from local storage via the filename argument.
doc = pyquery.PyQuery(filename='index.html')
print(doc('body'))
Querying Elements
CSS selectors enable precise targeting of nodes.
Direct Selection
Match tags, IDs, or classes.
items = doc('.post')
print(items)
Nested Searching
Search within the scope of the current selection.
# Locate descendant elements
result = doc('main').find('p')
# Locate immediate children only
children = doc('main > div')
Filtering
Restrict selections based on specific attributes.
# Class filtering
filtered = doc('article').filter('.featured')
# ID filtering
target = doc('#content').filter('#main-id')
Iterating Results
Selections return PyQuery objects which may encompass multiple elements. Use the iterator method to process individually.
list_items = doc('.post').items()
for node in list_items:
print(node.text())
Accessing and Modifying Data
Retrieve attribute values or extract inner content.
Content Extraction
text_data = doc('article').text()
html_raw = doc('article').html()
Attribute Management
Get or update values dynamically.
href_val = doc('a').attr('href')
doc('article').attr('data-status', 'new')
DOM Navigasion
Traverse the tree structure upwards or downwards.
parent_node = doc('.post').parent()
child_nodes = doc('.post').children()
sibling_nodes = doc('.post').siblings()
Modifying Nodes
Alter the parsed structure during runtime.
element = doc('.entry')
# Add styling class
element.addClass('highlight')
# Remove specific class
element.removeClass('old-class')
# Delete element entirely
element.remove()
Selector Pseudo-classes
Utilize advanced CSS selectors for positioning.
first_item = doc('li:first-child')
last_item = doc('li:last-child')
indexed = doc('li:nth-child(3)')
contains_text = doc(':contains(target)')