Using CrawlSpider for Automated Web Scraping in Scrapy

Overview

When scraping an entire website like Qiushibaike (Chinese joke site), you have two approaches:

Method 1: Use Scrapy's base Spider class with recursive crawling (manual request callbacks).

Method 2: Use CrawlSpider for automated link extraction and crawling (cleaner and more efficient).

This guide covers:

  • CrawlSpider introduction
  • CrawlSpider usage
    • Creating CrawlSpider-based crawlers
    • LinkExtractor
    • Rule parser

Introduction

CrawlSpider extends the base Spider class with additional powerful features. The most notable is the LinkExtractors module for automatic link extraction. While the base Spider class is designed for scraping URLs in the start_urls list, CrawlSpider is better suited for recursive extracting and following links from crawled pages.

Usage

1. Create a Scrapy project

scrapy startproject projectName

2. Create a CrawlSpider-based crawler

scrapy genspider -t crawl spiderName www.example.com

The -t crawl flag creates a crawler based on CrawlSpider rather than the base Spider class.

3. Generated CrawlSpider File Structure

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class MyCrawlerSpider(CrawlSpider):
    name = 'myCrawler'
    start_urls = ['http://www.example.com/']


    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        return item

Key components:

  • Lines 2-3: Import required CrawlSpider modules
  • Line 7: Class inherits from CrawlSpider
  • Lines 12-14: Define link extraction rules
  • Line 16: Data parsing method

4. LinkExtractor

Extracts URLs from responses based on specified rules.

LinkExtractor(
    allow=r'Items/',           # Regex pattern for allowed URLs
    deny=xxx,                 # Regex pattern for denied URLs
    restrict_xpaths=xxx,      # XPath expression for extraction
    restrict_css=xxx,         # CSS selector for extraction
    deny_domains=xxx,         # Domains to exclude
)

5. Rule

Parses pages based on links extracted by LinkExtractor.

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)

Parameters:

Paramter Description
1st LinkExtractor instance
2nd Callback function for parsing
3rd Follow flag - whether to continue extracting links from parsed pages (defaults to True when callback is None)

6. Rules Tuple

The rules tuple contains one or more Rule objects, each representing a different extraction strategy.

7. CrawlSpider Workflow

  1. Fetch page content from start_urls
  2. LinkExtractor extracts matching URLs from the response
  3. Rule parser processes extracted URLs according to defined rules
  4. Parsed data is stored in items and sent to pipelines

Practical Example

Scraping Qiushibaike Picture Section

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class PicCrawlerSpider(CrawlSpider):
    name = 'picSpider'
    start_urls = ['https://www.qiushibaike.com/pic/']

    # Link extractors for pagination
    pagination_links = LinkExtractor(allow=r'/pic/page/\d+\?')  # s=random param
    first_page_link = LinkExtractor(allow=r'/pic/$')

    rules = (
        Rule(pagination_links, callback='parse_page', follow=True),
        Rule(first_page_link, callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        print(response)

Main Crawler Implementation

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from myproject.items import JokeItem
import re


class JokeSpider(CrawlSpider):
    name = 'jokeSpider'
    start_urls = ['http://www.qiushibaike.com/']

    page_link = LinkExtractor(allow=r'/8hr/page/\d+/')

    rules = (
        Rule(page_link, callback='extract_jokes', follow=True),
    )

    def extract_jokes(self, response):
        joke_divs = response.xpath('//div[@id="content-left"]/div')

        for div in joke_divs:
            item = JokeItem()
            item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n')
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n')

            yield item

Item Definition

# -*- coding: utf-8 -*-
import scrapy


class JokeItem(scrapy.Item):
    author = scrapy.Field()
    content = scrapy.Field()

Pipeline Implementation

# -*- coding: utf-8 -*-
class JokePipeline(object):

    def __init__(self):
        self.file_handler = None

    def open_spider(self, spider):
        print('Spider started')
        self.file_handler = open('./jokes.txt', 'w')

    def process_item(self, item, spider):
        self.file_handler.write(item['author'] + ':' + item['content'] + '\n')
        return item

    def close_spider(self, spider):
        print('Spider finished')
        self.file_handler.close()

Settings Configuration

Add the pipeline to settings.py:

ITEM_PIPELINES = {
    'myproject.pipelines.JokePipeline': 300,
}

Execution

Run the crawler:

scrapy crawl jokeSpider

Result are saved to jokes.txt.

Tags: scrapy CrawlSpider web scraping python LinkExtractor

Posted on Sun, 24 May 2026 19:45:48 +0000 by sheraz