Install Required Dependencies
To begin web scraping, install the necessary Python packages requests and beautifulsoup4.
pip install requests beautifulsoup4
Construct a Simple Data Scraper
This script demonstrates how to retrieve and parse content from a static webpage.
import requests
from bs4 import BeautifulSoup
# Define the target web address
target_page = 'https://example.com'
# Fetch the webpage content
page_response = requests.get(target_page)
# Verify the request was successful
if page_response.status_code == 200:
# Create a BeautifulSoup object to parse the HTML
html_parser = BeautifulSoup(page_response.content, 'html.parser')
# Locate all instances of a specific HTML tag
main_headings = html_parser.find_all('h1')
for heading in main_headings:
print(heading.get_text())
else:
print(f"Page request failed with status: {page_response.status_code}")
Managing Multi-Page Content
Too collect data from multiple pages that use sequential URLs, a loop can be implemented.
import requests
from bs4 import BeautifulSoup
# Base URL pattern for pagination
root_url = 'https://example.com/data/page/'
for i in range(1, 6): # Iterate through pages 1 to 5
full_url = f"{root_url}{i}"
page_data = requests.get(full_url)
if page_data.status_code == 200:
soup_parser = BeautifulSoup(page_data.content, 'html.parser')
page_headings = soup_parser.find_all('h1')
for heading in page_headings:
print(heading.get_text())
else:
print(f"Failed to fetch page {i}. Error code: {page_data.status_code}")
Extracting JavaScript-Rendered Content
For websites where content is loaded dynamicallly via JavaScript, tools like Selenium that control a web browser are reuqired.
First, install Selenium and obtain the appropriate WebDriver for your browser.
pip install selenium
Example using ChromeDriver:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
# Initialize the browser driver
chrome_service = Service('/path/to/chromedriver')
browser = webdriver.Chrome(service=chrome_service)
# Navigate to the target URL
browser.get('https://example.com')
# Wait for dynamic content to load
time.sleep(3) # Consider using explicit waits for production code
# Find elements after JavaScript execution
dynamic_headings = browser.find_elements(By.TAG_NAME, 'h1')
for heading in dynamic_headings:
print(heading.text)
# Close the browser window
browser.quit()
Implementing Responsible Scraping Practices
Always respect the target site's robots.txt file and avoid overwhelming the server with rapid, successive requests. Introducing delays between requests is a common practice.
import time
import requests
from bs4 import BeautifulSoup
root_url = 'https://example.com/data/page/'
for i in range(1, 6):
full_url = f"{root_url}{i}"
page_data = requests.get(full_url)
if page_data.status_code == 200:
soup_parser = BeautifulSoup(page_data.content, 'html.parser')
page_headings = soup_parser.find_all('h1')
for heading in page_headings:
print(heading.get_text())
else:
print(f"Failed to fetch page {i}. Error code: {page_data.status_code}")
# Pause for 1 second before the next request
time.sleep(1)