Scraping Douban Book Data with Scrapy

Scrapy is an asynchronous web crawling framework built on Twisted, enabling efficient and scalable data extraction in Python. To begin scraping book information from Douban’s web site, first install Scrapy using pip:

pip install Scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

Create a new project named douban:

scrapy startproject douban
cd douban

Generaet a spider for the Douban books domain:

scrapy genspider book_spider https://book.douban.com

The project structrue includes key files:

  • items.py: Define data fields like title, author, rating.
  • pipelines.py: Process and store extracted data.
  • settings.py: Configure settings such as user agent and pipelines.
  • spiders/: Contains spider logic.

Douban blocks requests without proper headers. Test access using requests:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
}
resp = requests.get('https://book.douban.com/tag/%E5%90%8D%E8%91%97', headers=headers)
print(resp.status_code)  # Should return 200

Add the same User-Agent to settings.py:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'

In the spider file, set the starting URL to https://book.douban.com/tag/%E5%90%8D%E8%91%97. Use browser developer tools to inspect elements and construct XPath expressions to extract book details such as title, author, and rating from the page structure.

Tags: scrapy web scraping python Douban xpath

Posted on Fri, 15 May 2026 10:30:34 +0000 by webmaster1