Overview
When scraping an entire website like Qiushibaike (Chinese joke site), you have two approaches:
Method 1: Use Scrapy's base Spider class with recursive crawling (manual request callbacks).
Method 2: Use CrawlSpider for automated link extraction and crawling (cleaner and more efficient).
This guide covers:
- CrawlSpider introduction
- CrawlSpider usage
- Creating CrawlSpider-based crawlers
- LinkExtractor
- Rule parser
Introduction
CrawlSpider extends the base Spider class with additional powerful features. The most notable is the LinkExtractors module for automatic link extraction. While the base Spider class is designed for scraping URLs in the start_urls list, CrawlSpider is better suited for recursive extracting and following links from crawled pages.
Usage
1. Create a Scrapy project
scrapy startproject projectName
2. Create a CrawlSpider-based crawler
scrapy genspider -t crawl spiderName www.example.com
The -t crawl flag creates a crawler based on CrawlSpider rather than the base Spider class.
3. Generated CrawlSpider File Structure
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MyCrawlerSpider(CrawlSpider):
name = 'myCrawler'
start_urls = ['http://www.example.com/']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = {}
return item
Key components:
- Lines 2-3: Import required CrawlSpider modules
- Line 7: Class inherits from
CrawlSpider - Lines 12-14: Define link extraction rules
- Line 16: Data parsing method
4. LinkExtractor
Extracts URLs from responses based on specified rules.
LinkExtractor(
allow=r'Items/', # Regex pattern for allowed URLs
deny=xxx, # Regex pattern for denied URLs
restrict_xpaths=xxx, # XPath expression for extraction
restrict_css=xxx, # CSS selector for extraction
deny_domains=xxx, # Domains to exclude
)
5. Rule
Parses pages based on links extracted by LinkExtractor.
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)
Parameters:
| Paramter | Description |
|---|---|
| 1st | LinkExtractor instance |
| 2nd | Callback function for parsing |
| 3rd | Follow flag - whether to continue extracting links from parsed pages (defaults to True when callback is None) |
6. Rules Tuple
The rules tuple contains one or more Rule objects, each representing a different extraction strategy.
7. CrawlSpider Workflow
- Fetch page content from start_urls
- LinkExtractor extracts matching URLs from the response
- Rule parser processes extracted URLs according to defined rules
- Parsed data is stored in items and sent to pipelines
Practical Example
Scraping Qiushibaike Picture Section
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PicCrawlerSpider(CrawlSpider):
name = 'picSpider'
start_urls = ['https://www.qiushibaike.com/pic/']
# Link extractors for pagination
pagination_links = LinkExtractor(allow=r'/pic/page/\d+\?') # s=random param
first_page_link = LinkExtractor(allow=r'/pic/$')
rules = (
Rule(pagination_links, callback='parse_page', follow=True),
Rule(first_page_link, callback='parse_page', follow=True),
)
def parse_page(self, response):
print(response)
Main Crawler Implementation
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from myproject.items import JokeItem
import re
class JokeSpider(CrawlSpider):
name = 'jokeSpider'
start_urls = ['http://www.qiushibaike.com/']
page_link = LinkExtractor(allow=r'/8hr/page/\d+/')
rules = (
Rule(page_link, callback='extract_jokes', follow=True),
)
def extract_jokes(self, response):
joke_divs = response.xpath('//div[@id="content-left"]/div')
for div in joke_divs:
item = JokeItem()
item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n')
item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n')
yield item
Item Definition
# -*- coding: utf-8 -*-
import scrapy
class JokeItem(scrapy.Item):
author = scrapy.Field()
content = scrapy.Field()
Pipeline Implementation
# -*- coding: utf-8 -*-
class JokePipeline(object):
def __init__(self):
self.file_handler = None
def open_spider(self, spider):
print('Spider started')
self.file_handler = open('./jokes.txt', 'w')
def process_item(self, item, spider):
self.file_handler.write(item['author'] + ':' + item['content'] + '\n')
return item
def close_spider(self, spider):
print('Spider finished')
self.file_handler.close()
Settings Configuration
Add the pipeline to settings.py:
ITEM_PIPELINES = {
'myproject.pipelines.JokePipeline': 300,
}
Execution
Run the crawler:
scrapy crawl jokeSpider
Result are saved to jokes.txt.