To install the library along with a high-performence parser:
pip install beautifulsoup4 lxml
Begin by importing the class and initializing the parser with your markup:
from bs4 import BeautifulSoup
markup = """
<article class="product-listing">
<header>
<h1 id="main-title">Electronics Catalog</h1>
</header>
<section class="inventory">
<div class="item" data-sku="A100">
<span class="name">Wireless Mouse</span>
<span class="price">$29.99</span>
</div>
<div class="item" data-sku="A101">
<span class="name">Mechanical Keyboard</span>
<span class="price">$89.50</span>
</div>
</section>
</article>
"""
parsed_doc = BeautifulSoup(markup, 'lxml') # or 'html.parser'
Navigating through the parsed structure involves accessing properteis directly:
# Accessing single elements
header = parsed_doc.header
heading_text = parsed_doc.h1.string # "Electronics Catalog"
parent_container = parsed_doc.h1.find_parent() # <header> tag
# Working with attributes
article_class = parsed_doc.article['class'] # ['product-listing']
sku_value = parsed_doc.div['data-sku'] # A100
For locating specific elements programmatically, utilize the search methods:
# Finding multiple elements
all_items = parsed_doc.find_all('div', class_='item')
# or using attribute selectors
all_items = parsed_doc.find_all(attrs={'data-sku': True})
# Finding single elements
title_element = parsed_doc.find(id='main-title')
first_price = parsed_doc.find('span', class_='price')
The find_all() method accepts various filter types:
# By tag list
headers = parsed_doc.find_all(['h1', 'h2', 'h3'])
# By attribute dictionary
specific_items = parsed_doc.find_all(attrs={'data-sku': 'A100'})
# Limiting results
first_two = parsed_doc.find_all('div', limit=2)
# Text content search
import re
price_tags = parsed_doc.find_all(string=re.compile(r'\$\d+'))
When working with individual tag objects:
target = parsed_doc.find('div', class_='item')
# Attribute manipulation
target['data-status'] = 'in-stock' # Add new attribute
del target['data-sku'] # Remove attribute
attribute_dict = target.attrs # Get all attributes as dict
# Text extraction
raw_text = target.get_text(strip=True) # "Wireless Mouse $29.99"
specific_span = target.find('span', class_='name').string # "Wireless Mouse"
Advanced navigation traverses relationships between nodes:
current = parsed_doc.find('span', class_='name')
# Moving up
container = current.find_parent('div')
section = current.find_parents('section')
# Moving sideways
next_sibling = current.find_next_sibling()
previous_item = current.find_previous('div')