Handling Pagination and Multi-Level Extraction
To effectively scrape structured data across multiple pages, developers must implement logic to identify and follow pagination links. A common use case involves extracting recruitment data where job details are located on separate pages from the listing.
Suppose we are targeting a recruitment portal. The workflow involves extracting the URL for each specific job posting from the main list, requesting the detail page to retrieve the full job description, and subsequently locating the URL for the next page of listings.
The core mechanism relies on yielding new scrapy.Request objects. The callback parameter determines which method handles the response. For pagination, the callback often points back to the current parse method (recursive), while detail pages point to a dedicated parsing method.
Below is a structural outline of a spider implementing this logic:
# -*- coding: utf-8 -*-
import scrapy
class RecruitmentSpider(scrapy.Spider):
name = 'recruiter'
allowed_domains = ['example.com']
start_urls = ['http://example.com/position.php']
def parse(self, response):
# Select table rows, excluding header and footer
rows = response.xpath('//table[@class="list"]//tr')[1:-1]
for row in rows:
# Initialize a dictionary for the item
item = {}
# Extract the relative URL for the detail page
relative_link = row.xpath('.//td[1]/a/@href').extract_first()
if relative_link:
item['detail_link'] = response.urljoin(relative_link)
# Request the detail page and pass the item dictionary via meta
yield scrapy.Request(
item['detail_link'],
callback=self.parse_detail,
meta={'data': item}
)
# Identify and follow the next page link
next_page = response.xpath('//a[@id="next"]/@href').extract_first()
if next_page and next_page != 'javascript:;':
absolute_next_url = response.urljoin(next_page)
yield scrapy.Request(absolute_next_url, callback=self.parse)
def parse_detail(self, response):
# Retrieve the item passed from the parse method
item = response.meta['data']
# Extract specific details from the detail page
item['job_title'] = response.xpath('//tr[@class="header"]/td/text()').extract_first()
item['category'] = response.xpath('//tr[@class="info"]/td[2]/text()').extract_first()
item['responsibilities'] = response.xpath('//ul[@class="req-list"]/li/text()').extract()
yield item
Defining Structured Items and Data Cleaning
For larger scale scraping, it is best practice to define data fields explicitly in the items.py file. This ensures consistency and serves as documentation for the data structure.
Consider scraping a government complaint platform where we need the ID, title, URL, status, author, timestamp, and content.
In items.py:
import scrapy
class ComplaintItem(scrapy.Item):
ticket_id = scrapy.Field()
subject = scrapy.Field()
source_url = scrapy.Field()
status = scrapy.Field()
poster = scrapy.Field()
posted_at = scrapy.Field()
body_images = scrapy.Field()
body_text = scrapy.Field()
The spider utilizes this class to instantiate items. Note the use of the meta dictionary to pass the partially populated item from the list parser to the detail parser.
import scrapy
from project_name.items import ComplaintItem
class ComplaintSpider(scrapy.Spider):
name = 'complaints'
allowed_domains = ['sun0769.com']
start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4']
def parse(self, response):
# Extract rows from the list table
entry_nodes = response.xpath('//div[@class="greyframe"]//table[2]//table/tr')
for node in entry_nodes:
item = ComplaintItem()
# Populate list-level data
item['ticket_id'] = node.xpath('.//td[1]/text()').extract_first()
item['subject'] = node.xpath('.//td[2]/a[2]/@title').extract_first()
item['source_url'] = node.xpath('.//td[2]/a[2]/@href').extract_first()
item['status'] = node.xpath('.//td[3]/span/text()').extract_first()
item['poster'] = node.xpath('.//td[4]/text()').extract_first()
item['posted_at'] = node.xpath('.//td[5]/text()').extract_first()
# Follow pagination link if exists (e.g., ">" symbol)
next_page = response.xpath('.//a[text()=">"]/@href').extract_first()
if next_page:
yield scrapy.Request(next_page, callback=self.parse)
# Request the detail page
if item['source_url']:
yield scrapy.Request(
item['source_url'],
callback=self.parse_content,
meta={'base_item': item}
)
def parse_content(self, response):
# Load the item passed from the list view
item = response.meta['base_item']
# Extract content details
item['body_images'] = response.xpath('//div[@class="textpic"]/img/@src').extract()
item['body_text'] = response.xpath('//div[@class="c1 text14_2"]/div[@class="contentext"]/text()').extract()
yield item
Data cleaning is efficiently handled in the pipelines.py file. This separates the extraction logic from the processing logic.
class ComplaintPipeline(object):
def process_item(self, item, spider):
# Convert relative image URLs to absolute
if item.get('body_images'):
item['body_images'] = ["http://wz.sun0769.com" + img for img in item['body_images']]
# Clean text fields by removing whitespace characters
if item.get('body_text'):
cleaned_text = []
for text in item['body_text']:
text = text.replace("\xa0", "").replace("\t", "")
if text:
cleaned_text.append(text)
item['body_text'] = cleaned_text
# For demonstration, print the item
print(item)
return item
Interactive Debugging with Scrapy Shell
Scrapy Shell is a powerful tool for inspecting responses and testing XPath or CSS selectors without running the full spider. It allows for rapid prototyping of extraction logic.
To launch the shell:
scrapy shell http://www.example.com/page
Once inside, the response object is available for inspection:
response.url: The URL of the current response.response.headers: The HTTP headers of the response.response.body: The raw content (bytes) of the page.response.request.headers: The headers sent with the request.
You can test selectors immediately:
>>> response.xpath('//title/text()').extract_first()
Automating Crawls with CrawlSpider
Standard spiders require manual extraction of every link you wish to follow. The CrawlSpider class automates this by defining rules that match link patterns. It is ideal for sites where you need to follow specific URL structures comprehensively.
Scenario: Scraping Blog Articles
The goal is to crawl a blog domain, identify author profiles, follow pagination, and extract article content.
Create the project using the crawl template:
scrapy startproject blog_project
scrapy genspider -t crawl blog_spider blog.csdn.net
The spider definition relies on LinkExtractor objects and Rule tuples. Each rule specifies a pattern to match, whether to follow the link (if no callback is provided), and the callback to invoke if the link matches.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BlogCrawler(CrawlSpider):
name = 'blog_spider'
allowed_domains = ['blog.csdn.net']
start_urls = ['http://blog.csdn.net/SunnyYoona']
rules = (
# Rule 1: Follow author profile pages
Rule(LinkExtractor(allow=r'blog.csdn.net/\w+$'), follow=True),
# Rule 2: Follow pagination within article lists
Rule(LinkExtractor(allow=r'channelid=\d+&page=\d+$'), follow=True),
# Rule 3: Extract data from individual article pages
Rule(LinkExtractor(allow=r'/\w+/article/details/\d+'), callback='parse_article'),
# Rule 4: Follow list page variations
Rule(LinkExtractor(allow=r'/\w+/article/list/\d+$'), follow=True),
)
def parse_article(self, response):
item = {}
item['title'] = response.xpath("//h1/text()").extract_first()
item['date'] = response.xpath("//span[@class='time']/text()").extract_first()
# Collect raw tag text
item['tags'] = response.xpath("//ul[@class='article_tags clearfix csdn-tracking-statistics']//text()").extract()
yield item
Finally, clean the extracted tags in the pipeline to remove whitespace and separators:
import re
class BlogPipeline(object):
def process_item(self, item, spider):
if item.get('tags'):
# Remove whitespace and slashes
item['tags'] = [re.sub(r'\s|/', '', tag) for tag in item['tags']]
# Filter out empty strings and static labels
item['tags'] = [tag for tag in item['tags'] if tag and tag != '标签:']
print(item)
return item