Scrapy is an asynchronous web crawling framework built on Twisted, enabling efficient and scalable data extraction in Python. To begin scraping book information from Douban’s web site, first install Scrapy using pip:
pip install Scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
Create a new project named douban:
scrapy startproject douban
cd douban
Generaet a spider for the Douban books domain:
scrapy genspider book_spider https://book.douban.com
The project structrue includes key files:
items.py: Define data fields like title, author, rating.pipelines.py: Process and store extracted data.settings.py: Configure settings such as user agent and pipelines.spiders/: Contains spider logic.
Douban blocks requests without proper headers. Test access using requests:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
}
resp = requests.get('https://book.douban.com/tag/%E5%90%8D%E8%91%97', headers=headers)
print(resp.status_code) # Should return 200
Add the same User-Agent to settings.py:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
In the spider file, set the starting URL to https://book.douban.com/tag/%E5%90%8D%E8%91%97. Use browser developer tools to inspect elements and construct XPath expressions to extract book details such as title, author, and rating from the page structure.