Beautiful Soup is a Python library for parsing HTML and XML documents, creating parse trees that are helpful for extarcting data from web pages. It provides simple methods for navigating, searching, and mdoifying the parse tree.
Installation
pip install beautifulsoup4
Basic Usage
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<div class="content">
<h1>Main Heading</h1>
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'lxml')
Common Parsing Methods
Selecting Elements
# Get first matching element
first_p = soup.p
# Get all matching elements
all_p = soup.find_all('p')
Extracting Text
# Get text from element
heading = soup.h1.text
# Alternative method
paragraph = soup.p.string
Working with Attributes
# Get all attributes
div_attrs = soup.div.attrs
# Get specific attribute
div_class = soup.div['class']
Navigation Methods
# Parent element
parent = soup.p.parent
# Children elements
children = soup.div.children
# Sibling elements
next_sib = soup.h1.next_sibling
Advnaced Selection
CSS Selectors
# Select by class
items = soup.select('.content p')
# Select by id
main = soup.select('#main-content')
Regular Expressions
import re
# Find elements with matching text
matches = soup.find_all(string=re.compile('paragraph'))