Building a Basic Web Scraper with Python

A web scraper automates the extraction of data from websites. The core process involves two primary steps: fetching web content and parsing the desired information.

To begin, install the requests library, which handles HTTP requests.

pip install requests

Many websites restrict automated access. To mimic a real browser, you need to set a User-Agent header. You can find this header in your browser's Developer Tools under the Network tab.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
url = 'https://movie.douban.com/top250'
response = requests.get(url, headers=headers)

if response.ok:
    html_content = response.text
    print('Page fetched successfully.')
else:
    print('Failed to retrieve the page.')

After obtaining the HTML, the next step is parsing it. BeautifulSoup is a popular library for this task.

pip install beautifulsoup4

Use BeautifulSoup to locate HTML elements. For instance, to find all movie titles within <span> tags that have a class attribute of "title":

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')
title_elements = soup.find_all('span', class_='title')

for element in title_elements:
    movie_title = element.string
    if movie_title and '/' not in movie_title:
        print(movie_title)

The condition '/' not in movie_title filters out alternative titles or translations often separated by a slash.

Websites frequently paginate content. Analyze the URL pattern to handle multiple pages. For example, a site might use a query parameter like start to control pagination.

base_url = 'https://movie.douban.com/top250'

for page_start in range(0, 250, 25):
    page_url = f'{base_url}?start={page_start}'
    page_response = requests.get(page_url, headers=headers)
    
    if not page_response.ok:
        continue
        
    page_soup = BeautifulSoup(page_response.text, 'html.parser')
    titles_on_page = page_soup.find_all('span', class_='title')
    
    for title_tag in titles_on_page:
        title_text = title_tag.string
        if title_text and '/' not in title_text:
            print(title_text)

This loop iterates through 10 pages (0, 25, 50...225), fetching and printing the primary title from each entry, effectively collecting data from a multi-page list.

Tags: python web scraping beautifulsoup Requests Data Extraction

Posted on Tue, 19 May 2026 09:30:13 +0000 by lorenzo-s