Extracting Structured Data from HTML with Python's BeautifulSoup

To install the library along with a high-performence parser:

pip install beautifulsoup4 lxml

Begin by importing the class and initializing the parser with your markup:

from bs4 import BeautifulSoup

markup = """
<article class="product-listing">
  <header>
    <h1 id="main-title">Electronics Catalog</h1>
  </header>
  <section class="inventory">
    <div class="item" data-sku="A100">
      <span class="name">Wireless Mouse</span>
      <span class="price">$29.99</span>
    </div>
    <div class="item" data-sku="A101">
      <span class="name">Mechanical Keyboard</span>
      <span class="price">$89.50</span>
    </div>
  </section>
</article>
"""

parsed_doc = BeautifulSoup(markup, 'lxml')  # or 'html.parser'

Navigating through the parsed structure involves accessing properteis directly:

# Accessing single elements
header = parsed_doc.header
heading_text = parsed_doc.h1.string  # "Electronics Catalog"
parent_container = parsed_doc.h1.find_parent()  # <header> tag

# Working with attributes
article_class = parsed_doc.article['class']  # ['product-listing']
sku_value = parsed_doc.div['data-sku']  # A100

For locating specific elements programmatically, utilize the search methods:

# Finding multiple elements
all_items = parsed_doc.find_all('div', class_='item')
# or using attribute selectors
all_items = parsed_doc.find_all(attrs={'data-sku': True})

# Finding single elements
title_element = parsed_doc.find(id='main-title')
first_price = parsed_doc.find('span', class_='price')

The find_all() method accepts various filter types:

# By tag list
headers = parsed_doc.find_all(['h1', 'h2', 'h3'])

# By attribute dictionary
specific_items = parsed_doc.find_all(attrs={'data-sku': 'A100'})

# Limiting results
first_two = parsed_doc.find_all('div', limit=2)

# Text content search
import re
price_tags = parsed_doc.find_all(string=re.compile(r'\$\d+'))

When working with individual tag objects:

target = parsed_doc.find('div', class_='item')

# Attribute manipulation
target['data-status'] = 'in-stock'  # Add new attribute
del target['data-sku']             # Remove attribute
attribute_dict = target.attrs       # Get all attributes as dict

# Text extraction
raw_text = target.get_text(strip=True)  # "Wireless Mouse $29.99"
specific_span = target.find('span', class_='name').string  # "Wireless Mouse"

Advanced navigation traverses relationships between nodes:

current = parsed_doc.find('span', class_='name')

# Moving up
container = current.find_parent('div')
section = current.find_parents('section')

# Moving sideways
next_sibling = current.find_next_sibling()
previous_item = current.find_previous('div')

Tags: beautifulsoup python web scraping HTML Parsing XML Parsing

Posted on Mon, 25 May 2026 21:15:27 +0000 by forum