This article demonstrates how to create a Python scraper to collect online TV show data and implement advanced search functionality. We use requests and BeautifulSoup for scraping, and pandas for data processing and storage.
#### 1. Scraping Online TV Show Information
First, we need a website that provides TV show listings, assuming we can legally scrape the data. For this example, we use a hypothetical site tvshows.example.com with a list page where each show has a title, description, and link.
Implementation:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def fetch_tv_series(page_url):
"""
Scrape TV series information from the given URL.
:param page_url: URL of the series listing page.
:return: pandas DataFrame containing series data.
"""
response = requests.get(page_url)
response.raise_for_status()
html = BeautifulSoup(response.text, 'html.parser')
series_cards = html.find_all('div', class_='tv-show')
series_list = []
for card in series_cards:
title = card.find('h2').get_text(strip=True)
desc = card.find('p', class_='description').get_text(strip=True)
link = card.find('a')['href']
series_list.append({
'Title': title,
'Description': desc,
'Link': link
})
return pd.DataFrame(series_list)
# Usage
url = "https://tvshows.example.com/list"
df = fetch_tv_series(url)
print(df)
2. Implementing Advanced Search
Beyond simply scraping a full list, we can add keyword‑based search functionality.
Implementation:
def search_series(base_url, keyword):
"""
Search for series matching a keyword.
:param base_url: Search endpoint URL.
:param keyword: Keyword to search for.
:return: DataFrame with search results.
"""
params = {'q': keyword}
response = requests.get(base_url, params=params)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
result_cards = soup.find_all('div', class_='search-result')
results = []
for card in result_cards:
title = card.find('h3').get_text(strip=True)
desc = card.find('p', class_='description').get_text(strip=True)
link = card.find('a')['href']
results.append({
'Title': title,
'Description': desc,
'Link': link
})
return pd.DataFrame(results)
# Usage
search_url = "https://tvshows.example.com/search"
results_df = search_series(search_url, "action")
print(results_df)
3. Important Considerations
- Respect the site’s
robots.txtand terms of service, aswell as copyright and privacy laws.- Some websites employ anti‑scraping measures; you may need to modify headers, use proxeis, or implement delays.
- Advanced search logic depends on the target site’s URL parameters and HTML structure. Adjust accordingly in real projects.