Web Scraping for Practical Data Extraction Using Python

Install Required Dependencies

To begin web scraping, install the necessary Python packages requests and beautifulsoup4.

pip install requests beautifulsoup4

Construct a Simple Data Scraper

This script demonstrates how to retrieve and parse content from a static webpage.

import requests
from bs4 import BeautifulSoup

# Define the target web address
target_page = 'https://example.com'

# Fetch the webpage content
page_response = requests.get(target_page)

# Verify the request was successful
if page_response.status_code == 200:
    # Create a BeautifulSoup object to parse the HTML
    html_parser = BeautifulSoup(page_response.content, 'html.parser')

    # Locate all instances of a specific HTML tag
    main_headings = html_parser.find_all('h1')

    for heading in main_headings:
        print(heading.get_text())
else:
    print(f"Page request failed with status: {page_response.status_code}")

Managing Multi-Page Content

Too collect data from multiple pages that use sequential URLs, a loop can be implemented.

import requests
from bs4 import BeautifulSoup

# Base URL pattern for pagination
root_url = 'https://example.com/data/page/'

for i in range(1, 6):  # Iterate through pages 1 to 5
    full_url = f"{root_url}{i}"
    page_data = requests.get(full_url)

    if page_data.status_code == 200:
        soup_parser = BeautifulSoup(page_data.content, 'html.parser')
        page_headings = soup_parser.find_all('h1')

        for heading in page_headings:
            print(heading.get_text())
    else:
        print(f"Failed to fetch page {i}. Error code: {page_data.status_code}")

Extracting JavaScript-Rendered Content

For websites where content is loaded dynamicallly via JavaScript, tools like Selenium that control a web browser are reuqired.

First, install Selenium and obtain the appropriate WebDriver for your browser.

pip install selenium

Example using ChromeDriver:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

# Initialize the browser driver
chrome_service = Service('/path/to/chromedriver')
browser = webdriver.Chrome(service=chrome_service)

# Navigate to the target URL
browser.get('https://example.com')

# Wait for dynamic content to load
time.sleep(3)  # Consider using explicit waits for production code

# Find elements after JavaScript execution
dynamic_headings = browser.find_elements(By.TAG_NAME, 'h1')
for heading in dynamic_headings:
    print(heading.text)

# Close the browser window
browser.quit()

Implementing Responsible Scraping Practices

Always respect the target site's robots.txt file and avoid overwhelming the server with rapid, successive requests. Introducing delays between requests is a common practice.

import time
import requests
from bs4 import BeautifulSoup

root_url = 'https://example.com/data/page/'

for i in range(1, 6):
    full_url = f"{root_url}{i}"
    page_data = requests.get(full_url)

    if page_data.status_code == 200:
        soup_parser = BeautifulSoup(page_data.content, 'html.parser')
        page_headings = soup_parser.find_all('h1')

        for heading in page_headings:
            print(heading.get_text())
    else:
        print(f"Failed to fetch page {i}. Error code: {page_data.status_code}")
    
    # Pause for 1 second before the next request
    time.sleep(1)

Tags: web scraping python Data Extraction beautifulsoup Selenium

Posted on Sat, 13 Jun 2026 17:08:30 +0000 by toyfruit