Scraping Classical Poetry Websites with Scrapy

Project Setup in PyCharm

Create a new Python project named ScrapyProject in PyCharm.

Scrapy Installation

Package Installation

pip install scrapy

For faster installasion in China:

pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple/

Project Structure Initilaization

scrapy startproject poetry_scraper

Key directories and files:

  • spiders/: Contains spider implementations
  • items.py: Defines data structure
  • pipelines.py: Processes scraped items
  • settings.py: Configures project settings

Spider Implementation

cd poetry_scraper
scrapy genspider verse_extractor gushiwen.cn

Data Model Definition

import scrapy

class VerseItem(scrapy.Item):
    quote = scrapy.Field()
    origin = scrapy.Field()
    source_link = scrapy.Field()
    full_text = scrapy.Field()

Spider Logic

import scrapy
from ..items import VerseItem

class VerseSpider(scrapy.Spider):
    name = 'verse_extractor'
    allowed_domains = ['gushiwen.cn']
    start_urls = ['https://so.gushiwen.cn/mingjus/']

    def parse(self, response):
        for element in response.css('div.sons'):
            relative_url = element.css('a::attr(href)').get()
            absolute_url = f'https://so.gushiwen.cn{relative_url}'
            excerpt = element.css('a::text').get()
            provenance = element.css('p a::text').get()
            
            data_record = VerseItem()
            data_record['source_link'] = absolute_url
            data_record['quote'] = excerpt
            data_record['origin'] = provenance
            
            yield scrapy.Request(
                url=absolute_url,
                meta={'item': data_record},
                callback=self.get_full_content
            )
        
        next_page = response.css('a.amore::attr(href)').get()
        if next_page:
            next_url = f'https://so.gushiwen.cn{next_page}'
            yield scrapy.Request(next_url)

    def get_full_content(self, response):
        data_record = response.meta['item']
        text_elements = response.css('div.contson ::text').getall()
        cleaned_content = ''.join(text_elements).strip()
        data_record['full_text'] = cleaned_content
        yield data_record

Data Processing Pipeline

import json

class VersePipeline:
    def open_spider(self, spider):
        self.output_file = open('poetry_data.jsonl', 'w', encoding='utf-8')
    
    def close_spider(self, spider):
        self.output_file.close()
        
    def process_item(self, item, spider):
        json_line = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.output_file.write(json_line)
        return item

Project Configuration

BOT_NAME = 'poetry_scraper'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'WARNING'

ITEM_PIPELINES = {
    'poetry_scraper.pipelines.VersePipeline': 300,
}

Execution Methods

Terminal Execution

scrapy crawl verse_extractor

Python Script Execution

Create run.py in project root:

from scrapy import cmdline
cmdline.execute("scrapy crawl verse_extractor".split())

Tags: scrapy WebScraping python DataExtraction

Posted on Fri, 12 Jun 2026 16:32:14 +0000 by junrey