Parsing HTML Content with Beautiful Soup in Python

Beautiful Soup is a Python library for parsing HTML and XML documents, creating parse trees that are helpful for extarcting data from web pages. It provides simple methods for navigating, searching, and mdoifying the parse tree.

Installation

pip install beautifulsoup4

Basic Usage

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<div class="content">
<h1>Main Heading</h1>
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

Common Parsing Methods

Selecting Elements

# Get first matching element
first_p = soup.p

# Get all matching elements
all_p = soup.find_all('p')

Extracting Text

# Get text from element
heading = soup.h1.text

# Alternative method
paragraph = soup.p.string

Working with Attributes

# Get all attributes
div_attrs = soup.div.attrs

# Get specific attribute
div_class = soup.div['class']

Navigation Methods

# Parent element
parent = soup.p.parent

# Children elements
children = soup.div.children

# Sibling elements
next_sib = soup.h1.next_sibling

Advnaced Selection

CSS Selectors

# Select by class
items = soup.select('.content p')

# Select by id
main = soup.select('#main-content')

Regular Expressions

import re

# Find elements with matching text
matches = soup.find_all(string=re.compile('paragraph'))

Tags: python web scraping HTML Parsing Beautiful Soup Data Extraction

Posted on Sun, 10 May 2026 16:24:05 +0000 by cornix