Overview
This article covers an introduction to the Scrapy framework, installation instructions, and fundamental usage patterns.
What is Scrapy?
Scrapy is a powerful Python framework designed for extracting structured data from websites. It provides a complete solution for web crawling tasks, integrating features like asynchronous downloading, queuing mechanisms, distributed processing, parsing capabilities, and data persistence.
A framework essentially offers a reusable project templtae with pre-built functionalities. Learning a framework involves understanding its core features and how to utilize each component effectively.
Installation
Linux:
pip3 install scrapy
Windows:
- Install wheel package:
pip3 install wheel - Download Twisted from http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
- Navigate to the download directory and run:
pip3 install Twisted-17.1.0-cp35-cp35m-win_amd64.whl - Install pywin32:
pip3 install pywin32 - Finally, install Scrapy:
pip3 install scrapy
If installation fails, create a configuration file:
- Open File Explorer
- Enter
%appdata%in the address bar - Create folder
pip - Inside
pip, createpip.iniwith the following content:
[global]
timeout = 6000
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com
Basic Usage
Project Creation
Use the command line to initiate a new project:
scrapy startproject project_name
The generated project structure includes:
project_name/
scrapy.cfg:
project_name/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
Key files:
scrapy.cfg: Main project configurationitems.py: Defines data models for structured storagepipelines.py: Handles data persistence operationssettings.py: Configuration parameters including concurrency limits and delaysspiders/: Directory containing spider implementations
Spider Generation
Navigate to your project directory:
cd project_name
Create a new spider with:
scrapy genspider app_name target_url
Example:
scrapy genspider example example.com
Spider Implementation
After generation, a Python file will be created in the spiders directory with the following basic structure:
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
print(response.text)
print(response.body)
Settings Configuration
Modify settings.py to customize behavior:
# Line 19
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
# Line 22
ROBOTSTXT_OBEY = False
Execution
Run the spider using:
scrapy crawl app_name
Optional: Suppress logs with:
scrapy crawl app_name --nolog
Practical Example: Scraping Jokes from Qidian
# -*- coding: utf-8 -*-
import scrapy
class QiubaiSpider(scrapy.Spider):
name = 'qiubai'
allowed_domains = ['https://www.qiushibaike.com/']
start_urls = ['https://www.qiushibaike.com/']
def parse(self, response):
# Extract joke containers
joke_containers = response.xpath('//div[@id="content-left"]/div')
results = []
for container in joke_containers:
author = container.xpath('.//div[@class="author clearfix"]/a/h2/text()')[0].extract()
content = container.xpath('.//div[@class="content"]/span/text()')[0].extract()
item = {
'author': author,
'content': content
}
results.append(item)
return results