Core Framwork Architecture
Scrapy operates as an asynchronous web scraping framework built upon Twisted. The main components coordinating data flow are:
- Engine: Orchestrates triggers and overall data handling.
- Scheduler: Accepts requests, maintains a priority queue, and deduplicates URLs.
- Downloader: Retrieves page content via non-blocking I/O and returns responses.
- Spiders: Define extraction logic for Items and new request generation.
- Item Pipeline: Processes parsed Items for validation, filtering, or storage.
- Downloader Middleware: Intercepts requests/responses between Engine and Downloader.
- Spider Middleware: Hooks into Spider input/output for processing.
Execution Flow
- Engine retrieves a URL from the Scheduler.
- The URL is wrapped as a Request and forwarded to the Downloader.
- Downloader fetches the resource and returns a Response.
- A Spider callback processes the Response.
- Extracted Items are sent to the Pipeline.
- Extracted URLs are dispatched back into the Scheduler.
Installation and Project Setup
Windows
pip3 install wheel
# Download the appropriate Twisted wheel from:
# http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl
pip3 install pywin32
pip3 install scrapy
Linux
pip3 install scrapy
Scaffolding a Project
scrapy startproject my_crawler
cd my_crawler
Generate a spider targeting a domain:
scrapy genspider example_spider example.com
Directory Layout
my_crawler/
spiders/
example_spider.py
items.py
pipelines.py
middlewares.py
settings.py
scrapy.cfg
Execution Commands
scrapy crawl example_spider
scrapy crawl example_spider --nolog
Spider Implementation Example
Below is a practical spider illustrating XPath selection and recursive following of paginated links.
import scrapy
from scrapy.http import Request
class NewsSpider(scrapy.Spider):
name = 'newsbot'
allowed_domains = ['chouti.com']
start_urls = ['http://chouti.com/']
def parse(self, response):
entries = response.xpath('//div[@id="content-list"]/div[@class="item"]')
with open('crawled_links.log', 'a+', encoding='utf-8') as log_file:
for entry in entries:
title = entry.xpath('.//a/text()').get()
link = entry.xpath('.//a/@href').get()
if title and link:
log_file.write(f"{link} - {title.strip()}\n")
# Discover next page URLs
next_pages = response.xpath('//div[@id="dig_lcpage"]//a/@href').getall()
for relative_url in next_pages:
full_url = f"https://dig.chouti.com{relative_url}"
yield Request(url=full_url, callback=self.parse)
XPath Selectors Deep Dive
Scrapy relies on Selector objects built from a Response to apply XPath queries. A consistent pattern is to instantiate a Selector and chain methods for extraction.
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
html_body = '''
<html>
<body>
<ul>
<li class="item-0"><a id="link1" href="page1.html">first item</a></li>
<li class="item-1"><a href="page2.html">second item<span>span content</span></a></li>
</ul>
<div><a href="external.html">external link</a></div>
</body>
</html>
'''
resp = HtmlResponse(url='http://demo.com', body=html_body, encoding='utf-8')
selector = Selector(response=resp)
# Select all anchor elements
all_anchors = selector.xpath('//a')
# Select second anchor within the same level
second_anchor = selector.xpath('//a[2]')
# Anchor with any id attribute
anchors_with_id = selector.xpath('//a[@id]')
# Specific id match
specific_id = selector.xpath('//a[@id="link1"]')
# Combined attribute conditions
combined_attrs = selector.xpath('//a[contains(@href, "page") and @id]')
# Starts-with function
starts_with_link = selector.xpath('//a[starts-with(@href, "page")]')
# Regex test on attribute (requires namespace registration)
regex_id = selector.xpath(
'//a[re:test(@id, "link\\d+")]',
namespaces={'re': 'http://exslt.org/regular-expressions'}
)
# Extract text for matched elements
text_values = selector.xpath('//a[re:test(@id, "link\\d+")]/text()').getall()
# Extract href attributes
href_attributes = selector.xpath('//a[re:test(@id, "link\\d+")]/@href').getall()
# Full absolute path
absolute_path_href = selector.xpath('/html/body/ul/li/a/@href').getall()
# Relative path with extract_first
first_href = selector.xpath('//body/ul/li/a/@href').get()
# Iterate and extract nested elements
list_items = selector.xpath('//body/ul/li')
for li in list_items:
span = li.xpath('./a/span')
# Equivalent: 'a/span' or '*/a/span'
print(span.get())
Mastering these selection patterns enables precise data harvesting across complex page structures.