A web scraper automates the extraction of data from websites. The core process involves two primary steps: fetching web content and parsing the desired information.
To begin, install the requests library, which handles HTTP requests.
pip install requests
Many websites restrict automated access. To mimic a real browser, you need to set a User-Agent header. You can find this header in your browser's Developer Tools under the Network tab.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
url = 'https://movie.douban.com/top250'
response = requests.get(url, headers=headers)
if response.ok:
html_content = response.text
print('Page fetched successfully.')
else:
print('Failed to retrieve the page.')
After obtaining the HTML, the next step is parsing it. BeautifulSoup is a popular library for this task.
pip install beautifulsoup4
Use BeautifulSoup to locate HTML elements. For instance, to find all movie titles within <span> tags that have a class attribute of "title":
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
title_elements = soup.find_all('span', class_='title')
for element in title_elements:
movie_title = element.string
if movie_title and '/' not in movie_title:
print(movie_title)
The condition '/' not in movie_title filters out alternative titles or translations often separated by a slash.
Websites frequently paginate content. Analyze the URL pattern to handle multiple pages. For example, a site might use a query parameter like start to control pagination.
base_url = 'https://movie.douban.com/top250'
for page_start in range(0, 250, 25):
page_url = f'{base_url}?start={page_start}'
page_response = requests.get(page_url, headers=headers)
if not page_response.ok:
continue
page_soup = BeautifulSoup(page_response.text, 'html.parser')
titles_on_page = page_soup.find_all('span', class_='title')
for title_tag in titles_on_page:
title_text = title_tag.string
if title_text and '/' not in title_text:
print(title_text)
This loop iterates through 10 pages (0, 25, 50...225), fetching and printing the primary title from each entry, effectively collecting data from a multi-page list.