Using CrawlSpider for Automated Web Scraping in Scrapy

Overview

When scraping an entire website like Qiushibaike (Chinese joke site), you have two approaches:

Method 1: Use Scrapy's base Spider class with recursive crawling (manual request callbacks).

Method 2: Use CrawlSpider for automated link extraction and crawling (cleaner and more efficient).

This guide covers:

CrawlSpider introduction
CrawlSpider usage
- Creating CrawlSpider-based crawlers
- LinkExtractor
- Rule parser

Introduction

CrawlSpider extends the base Spider class with additional powerful features. The most notable is the LinkExtractors module for automatic link extraction. While the base Spider class is designed for scraping URLs in the start_urls list, CrawlSpider is better suited for recursive extracting and following links from crawled pages.

Usage

1. Create a Scrapy project

scrapy startproject projectName

2. Create a CrawlSpider-based crawler

scrapy genspider -t crawl spiderName www.example.com

The -t crawl flag creates a crawler based on CrawlSpider rather than the base Spider class.

3. Generated CrawlSpider File Structure

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class MyCrawlerSpider(CrawlSpider):
    name = 'myCrawler'
    start_urls = ['http://www.example.com/']


    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        return item

Key components:

Lines 2-3: Import required CrawlSpider modules
Line 7: Class inherits from CrawlSpider
Lines 12-14: Define link extraction rules
Line 16: Data parsing method

4. LinkExtractor

Extracts URLs from responses based on specified rules.

LinkExtractor(
    allow=r'Items/',           # Regex pattern for allowed URLs
    deny=xxx,                 # Regex pattern for denied URLs
    restrict_xpaths=xxx,      # XPath expression for extraction
    restrict_css=xxx,         # CSS selector for extraction
    deny_domains=xxx,         # Domains to exclude
)

5. Rule

Parses pages based on links extracted by LinkExtractor.

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)

Parameters:

Paramter	Description
1st	LinkExtractor instance
2nd	Callback function for parsing
3rd	Follow flag - whether to continue extracting links from parsed pages (defaults to True when callback is None)

6. Rules Tuple

The rules tuple contains one or more Rule objects, each representing a different extraction strategy.

7. CrawlSpider Workflow

Fetch page content from start_urls
LinkExtractor extracts matching URLs from the response
Rule parser processes extracted URLs according to defined rules
Parsed data is stored in items and sent to pipelines

Practical Example

Scraping Qiushibaike Picture Section

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class PicCrawlerSpider(CrawlSpider):
    name = 'picSpider'
    start_urls = ['https://www.qiushibaike.com/pic/']

    # Link extractors for pagination
    pagination_links = LinkExtractor(allow=r'/pic/page/\d+\?')  # s=random param
    first_page_link = LinkExtractor(allow=r'/pic/$')

    rules = (
        Rule(pagination_links, callback='parse_page', follow=True),
        Rule(first_page_link, callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        print(response)

Main Crawler Implementation

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from myproject.items import JokeItem
import re


class JokeSpider(CrawlSpider):
    name = 'jokeSpider'
    start_urls = ['http://www.qiushibaike.com/']

    page_link = LinkExtractor(allow=r'/8hr/page/\d+/')

    rules = (
        Rule(page_link, callback='extract_jokes', follow=True),
    )

    def extract_jokes(self, response):
        joke_divs = response.xpath('//div[@id="content-left"]/div')

        for div in joke_divs:
            item = JokeItem()
            item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n')
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n')

            yield item

Item Definition

# -*- coding: utf-8 -*-
import scrapy


class JokeItem(scrapy.Item):
    author = scrapy.Field()
    content = scrapy.Field()

Pipeline Implementation

# -*- coding: utf-8 -*-
class JokePipeline(object):

    def __init__(self):
        self.file_handler = None

    def open_spider(self, spider):
        print('Spider started')
        self.file_handler = open('./jokes.txt', 'w')

    def process_item(self, item, spider):
        self.file_handler.write(item['author'] + ':' + item['content'] + '\n')
        return item

    def close_spider(self, spider):
        print('Spider finished')
        self.file_handler.close()

Settings Configuration

Add the pipeline to settings.py:

ITEM_PIPELINES = {
    'myproject.pipelines.JokePipeline': 300,
}

Execution

Run the crawler:

scrapy crawl jokeSpider

Result are saved to jokes.txt.

Tags: scrapy CrawlSpider web scraping python LinkExtractor

Posted on Sun, 24 May 2026 19:45:48 +0000 by sheraz

Freaks City