Project Setup in PyCharm
Create a new Python project named ScrapyProject in PyCharm.
Scrapy Installation
Package Installation
pip install scrapy
For faster installasion in China:
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple/
Project Structure Initilaization
scrapy startproject poetry_scraper
Key directories and files:
spiders/: Contains spider implementationsitems.py: Defines data structurepipelines.py: Processes scraped itemssettings.py: Configures project settings
Spider Implementation
cd poetry_scraper
scrapy genspider verse_extractor gushiwen.cn
Data Model Definition
import scrapy
class VerseItem(scrapy.Item):
quote = scrapy.Field()
origin = scrapy.Field()
source_link = scrapy.Field()
full_text = scrapy.Field()
Spider Logic
import scrapy
from ..items import VerseItem
class VerseSpider(scrapy.Spider):
name = 'verse_extractor'
allowed_domains = ['gushiwen.cn']
start_urls = ['https://so.gushiwen.cn/mingjus/']
def parse(self, response):
for element in response.css('div.sons'):
relative_url = element.css('a::attr(href)').get()
absolute_url = f'https://so.gushiwen.cn{relative_url}'
excerpt = element.css('a::text').get()
provenance = element.css('p a::text').get()
data_record = VerseItem()
data_record['source_link'] = absolute_url
data_record['quote'] = excerpt
data_record['origin'] = provenance
yield scrapy.Request(
url=absolute_url,
meta={'item': data_record},
callback=self.get_full_content
)
next_page = response.css('a.amore::attr(href)').get()
if next_page:
next_url = f'https://so.gushiwen.cn{next_page}'
yield scrapy.Request(next_url)
def get_full_content(self, response):
data_record = response.meta['item']
text_elements = response.css('div.contson ::text').getall()
cleaned_content = ''.join(text_elements).strip()
data_record['full_text'] = cleaned_content
yield data_record
Data Processing Pipeline
import json
class VersePipeline:
def open_spider(self, spider):
self.output_file = open('poetry_data.jsonl', 'w', encoding='utf-8')
def close_spider(self, spider):
self.output_file.close()
def process_item(self, item, spider):
json_line = json.dumps(dict(item), ensure_ascii=False) + '\n'
self.output_file.write(json_line)
return item
Project Configuration
BOT_NAME = 'poetry_scraper'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'WARNING'
ITEM_PIPELINES = {
'poetry_scraper.pipelines.VersePipeline': 300,
}
Execution Methods
Terminal Execution
scrapy crawl verse_extractor
Python Script Execution
Create run.py in project root:
from scrapy import cmdline
cmdline.execute("scrapy crawl verse_extractor".split())