Advanced Scrapy Techniques: Pagination, Data Pipelines, and Crawling Strategies

Handling Pagination and Multi-Level Extraction

To effectively scrape structured data across multiple pages, developers must implement logic to identify and follow pagination links. A common use case involves extracting recruitment data where job details are located on separate pages from the listing.

Suppose we are targeting a recruitment portal. The workflow involves extracting the URL for each specific job posting from the main list, requesting the detail page to retrieve the full job description, and subsequently locating the URL for the next page of listings.

The core mechanism relies on yielding new scrapy.Request objects. The callback parameter determines which method handles the response. For pagination, the callback often points back to the current parse method (recursive), while detail pages point to a dedicated parsing method.

Below is a structural outline of a spider implementing this logic:

# -*- coding: utf-8 -*-
import scrapy

class RecruitmentSpider(scrapy.Spider):
    name = 'recruiter'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/position.php']

    def parse(self, response):
        # Select table rows, excluding header and footer
        rows = response.xpath('//table[@class="list"]//tr')[1:-1]

        for row in rows:
            # Initialize a dictionary for the item
            item = {}
            
            # Extract the relative URL for the detail page
            relative_link = row.xpath('.//td[1]/a/@href').extract_first()
            if relative_link:
                item['detail_link'] = response.urljoin(relative_link)
                # Request the detail page and pass the item dictionary via meta
                yield scrapy.Request(
                    item['detail_link'], 
                    callback=self.parse_detail,
                    meta={'data': item}
                )

        # Identify and follow the next page link
        next_page = response.xpath('//a[@id="next"]/@href').extract_first()
        if next_page and next_page != 'javascript:;':
            absolute_next_url = response.urljoin(next_page)
            yield scrapy.Request(absolute_next_url, callback=self.parse)

    def parse_detail(self, response):
        # Retrieve the item passed from the parse method
        item = response.meta['data']
        
        # Extract specific details from the detail page
        item['job_title'] = response.xpath('//tr[@class="header"]/td/text()').extract_first()
        item['category'] = response.xpath('//tr[@class="info"]/td[2]/text()').extract_first()
        item['responsibilities'] = response.xpath('//ul[@class="req-list"]/li/text()').extract()
        
        yield item

Defining Structured Items and Data Cleaning

For larger scale scraping, it is best practice to define data fields explicitly in the items.py file. This ensures consistency and serves as documentation for the data structure.

Consider scraping a government complaint platform where we need the ID, title, URL, status, author, timestamp, and content.

In items.py:

import scrapy

class ComplaintItem(scrapy.Item):
    ticket_id = scrapy.Field()
    subject = scrapy.Field()
    source_url = scrapy.Field()
    status = scrapy.Field()
    poster = scrapy.Field()
    posted_at = scrapy.Field()
    body_images = scrapy.Field()
    body_text = scrapy.Field()

The spider utilizes this class to instantiate items. Note the use of the meta dictionary to pass the partially populated item from the list parser to the detail parser.

import scrapy
from project_name.items import ComplaintItem

class ComplaintSpider(scrapy.Spider):
    name = 'complaints'
    allowed_domains = ['sun0769.com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4']

    def parse(self, response):
        # Extract rows from the list table
        entry_nodes = response.xpath('//div[@class="greyframe"]//table[2]//table/tr')
        
        for node in entry_nodes:
            item = ComplaintItem()
            
            # Populate list-level data
            item['ticket_id'] = node.xpath('.//td[1]/text()').extract_first()
            item['subject'] = node.xpath('.//td[2]/a[2]/@title').extract_first()
            item['source_url'] = node.xpath('.//td[2]/a[2]/@href').extract_first()
            item['status'] = node.xpath('.//td[3]/span/text()').extract_first()
            item['poster'] = node.xpath('.//td[4]/text()').extract_first()
            item['posted_at'] = node.xpath('.//td[5]/text()').extract_first()
            
            # Follow pagination link if exists (e.g., ">" symbol)
            next_page = response.xpath('.//a[text()=">"]/@href').extract_first()
            if next_page:
                yield scrapy.Request(next_page, callback=self.parse)

            # Request the detail page
            if item['source_url']:
                yield scrapy.Request(
                    item['source_url'],
                    callback=self.parse_content,
                    meta={'base_item': item}
                )

    def parse_content(self, response):
        # Load the item passed from the list view
        item = response.meta['base_item']
        
        # Extract content details
        item['body_images'] = response.xpath('//div[@class="textpic"]/img/@src').extract()
        item['body_text'] = response.xpath('//div[@class="c1 text14_2"]/div[@class="contentext"]/text()').extract()
        
        yield item

Data cleaning is efficiently handled in the pipelines.py file. This separates the extraction logic from the processing logic.

class ComplaintPipeline(object):
    def process_item(self, item, spider):
        # Convert relative image URLs to absolute
        if item.get('body_images'):
            item['body_images'] = ["http://wz.sun0769.com" + img for img in item['body_images']]
        
        # Clean text fields by removing whitespace characters
        if item.get('body_text'):
            cleaned_text = []
            for text in item['body_text']:
                text = text.replace("\xa0", "").replace("\t", "")
                if text:
                    cleaned_text.append(text)
            item['body_text'] = cleaned_text
            
        # For demonstration, print the item
        print(item)
        return item

Interactive Debugging with Scrapy Shell

Scrapy Shell is a powerful tool for inspecting responses and testing XPath or CSS selectors without running the full spider. It allows for rapid prototyping of extraction logic.

To launch the shell:

scrapy shell http://www.example.com/page

Once inside, the response object is available for inspection:

  • response.url: The URL of the current response.
  • response.headers: The HTTP headers of the response.
  • response.body: The raw content (bytes) of the page.
  • response.request.headers: The headers sent with the request.

You can test selectors immediately:

>>> response.xpath('//title/text()').extract_first()

Automating Crawls with CrawlSpider

Standard spiders require manual extraction of every link you wish to follow. The CrawlSpider class automates this by defining rules that match link patterns. It is ideal for sites where you need to follow specific URL structures comprehensively.

Scenario: Scraping Blog Articles

The goal is to crawl a blog domain, identify author profiles, follow pagination, and extract article content.

Create the project using the crawl template:

scrapy startproject blog_project
scrapy genspider -t crawl blog_spider blog.csdn.net

The spider definition relies on LinkExtractor objects and Rule tuples. Each rule specifies a pattern to match, whether to follow the link (if no callback is provided), and the callback to invoke if the link matches.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BlogCrawler(CrawlSpider):
    name = 'blog_spider'
    allowed_domains = ['blog.csdn.net']
    start_urls = ['http://blog.csdn.net/SunnyYoona']

    rules = (
        # Rule 1: Follow author profile pages
        Rule(LinkExtractor(allow=r'blog.csdn.net/\w+$'), follow=True),
        
        # Rule 2: Follow pagination within article lists
        Rule(LinkExtractor(allow=r'channelid=\d+&page=\d+$'), follow=True),
        
        # Rule 3: Extract data from individual article pages
        Rule(LinkExtractor(allow=r'/\w+/article/details/\d+'), callback='parse_article'),
        
        # Rule 4: Follow list page variations
        Rule(LinkExtractor(allow=r'/\w+/article/list/\d+$'), follow=True),
    )

    def parse_article(self, response):
        item = {}
        item['title'] = response.xpath("//h1/text()").extract_first()
        item['date'] = response.xpath("//span[@class='time']/text()").extract_first()
        # Collect raw tag text
        item['tags'] = response.xpath("//ul[@class='article_tags clearfix csdn-tracking-statistics']//text()").extract()
        yield item

Finally, clean the extracted tags in the pipeline to remove whitespace and separators:

import re

class BlogPipeline(object):
    def process_item(self, item, spider):
        if item.get('tags'):
            # Remove whitespace and slashes
            item['tags'] = [re.sub(r'\s|/', '', tag) for tag in item['tags']]
            # Filter out empty strings and static labels
            item['tags'] = [tag for tag in item['tags'] if tag and tag != '标签:']
        print(item)
        return item

Tags: scrapy python web scraping CrawlSpider data pipeline

Posted on Sat, 27 Jun 2026 16:27:16 +0000 by torvald_helmer