Introduction to Scrapy Framework and Basic Usage

Overview

This article covers an introduction to the Scrapy framework, installation instructions, and fundamental usage patterns.

What is Scrapy?

Scrapy is a powerful Python framework designed for extracting structured data from websites. It provides a complete solution for web crawling tasks, integrating features like asynchronous downloading, queuing mechanisms, distributed processing, parsing capabilities, and data persistence.

A framework essentially offers a reusable project templtae with pre-built functionalities. Learning a framework involves understanding its core features and how to utilize each component effectively.

Installation

Linux:

pip3 install scrapy

Windows:

  1. Install wheel package:
    pip3 install wheel
    
  2. Download Twisted from http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
  3. Navigate to the download directory and run:
    pip3 install Twisted-17.1.0-cp35-cp35m-win_amd64.whl
    
  4. Install pywin32:
    pip3 install pywin32
    
  5. Finally, install Scrapy:
    pip3 install scrapy
    

If installation fails, create a configuration file:

  1. Open File Explorer
  2. Enter %appdata% in the address bar
  3. Create folder pip
  4. Inside pip, create pip.ini with the following content:
[global]
timeout = 6000
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com

Basic Usage

Project Creation

Use the command line to initiate a new project:

scrapy startproject project_name

The generated project structure includes:

project_name/
   scrapy.cfg:
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py

Key files:

  • scrapy.cfg: Main project configuration
  • items.py: Defines data models for structured storage
  • pipelines.py: Handles data persistence operations
  • settings.py: Configuration parameters including concurrency limits and delays
  • spiders/: Directory containing spider implementations

Spider Generation

Navigate to your project directory:

 cd project_name

Create a new spider with:

scrapy genspider app_name target_url

Example:

scrapy genspider example example.com

Spider Implementation

After generation, a Python file will be created in the spiders directory with the following basic structure:

# -*- coding: utf-8 -*-
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        print(response.text)
        print(response.body)

Settings Configuration

Modify settings.py to customize behavior:

# Line 19
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

# Line 22
ROBOTSTXT_OBEY = False

Execution

Run the spider using:

scrapy crawl app_name

Optional: Suppress logs with:

scrapy crawl app_name --nolog

Practical Example: Scraping Jokes from Qidian

# -*- coding: utf-8 -*-
import scrapy

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    allowed_domains = ['https://www.qiushibaike.com/']
    start_urls = ['https://www.qiushibaike.com/']

    def parse(self, response):
        # Extract joke containers
        joke_containers = response.xpath('//div[@id="content-left"]/div')
        results = []
        
        for container in joke_containers:
            author = container.xpath('.//div[@class="author clearfix"]/a/h2/text()')[0].extract()
            content = container.xpath('.//div[@class="content"]/span/text()')[0].extract()
            
            item = {
                'author': author,
                'content': content
            }
            results.append(item)
        
        return results

Tags: scrapy web scraping python Framework Data Extraction

Posted on Sun, 21 Jun 2026 17:18:44 +0000 by faizanno1