Data Flow Between Scrapy Components
Scrapy manages communication between different components through two fundamental objects: Request and Response. The Spider generates Request objects, which travel through the engine and download middleware before being executed by the downloader. Each Request eventually produces a Response that flows back through the middleware to the originating spider for parsing.
Creating Request Objects in Spider
Request objects are instantiated within Spider methods, typically in start_requests() or other generators. The following demonstrates how a crawler initiates multiple requests:
from urllib.parse import quote
class ProductSpider(CrawlSpider):
name = 'products'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
def start_requests(self):
search_terms = self.settings.getlist('SEARCH_TERMS')
max_pages = self.settings.getint('MAX_PAGES', 10)
for term in search_terms:
for page_num in range(1, max_pages + 1):
search_url = f'https://www.example.com/search?q={quote(term)}'
yield Request(
url=search_url,
callback=self.parse_results,
meta={'page': page_num, 'term': term},
dont_filter=True
)
The yield statement generates Request instances that the Scrapy engine subsequently processes.
Handling Requests in Download Middleware
Middleware intercepts Request objects through the process_request() method. This example integrates Selenium to handle JavaScript-rendered pages:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from logging import getLogger
class BrowserMiddleware:
def __init__(self, driver_path, wait_timeout):
self.logger = getLogger(__name__)
self.driver_path = driver_path
self.timeout = wait_timeout
def process_request(self, request, spider):
if 'skip_browser' in request.meta:
return None
self.logger.debug(f'Loading {request.url} in browser')
current_page = request.meta.get('page', 1)
try:
driver = webdriver.Firefox(executable_path=self.driver_path)
driver.set_window_size(1366, 768)
driver.get(request.url)
if current_page > 1:
page_input = WebDriverWait(driver, self.timeout).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'input[name="page"]'))
)
page_input.clear()
page_input.send_keys(str(current_page))
driver.find_element(By.CSS_SELECTOR, 'button.submit').click()
WebDriverWait(driver, self.timeout).until(
EC.text_to_be_present_in_element(
(By.CSS_SELECTOR, '.pagination .active'),
str(current_page)
)
)
return HtmlResponse(
url=request.url,
body=driver.page_source.encode('utf-8'),
request=request,
encoding='utf-8',
status=200
)
except Exception as e:
self.logger.error(f'Browser error: {e}')
return HtmlResponse(url=request.url, status=500, request=request)
finally:
driver.quit()
@classmethod
def from_crawler(cls, crawler):
return cls(
driver_path=crawler.settings.get('GECKODRIVER_PATH'),
wait_timeout=crawler.settings.get('SELENIUM_TIMEOUT', 30)
)
When a custom Response object is returned from middleware, the request chain terminates there and processing continues to the callback.
Request Class Parameters
scrapy.http.Request(
url,
callback=None,
method='GET',
headers=None,
body=None,
cookies=None,
meta=None,
encoding='utf-8',
priority=0,
dont_filter=False,
errback=None
)
url — The target URL for the HTTP request.
callback — Callable invoked with the downloaded Response. Defaults to the spider's parse() method when omitted.
method — HTTP verb, defaults to 'GET'. Supported methods include 'GET', 'POST', 'PUT', 'HEAD', 'OPTIONS'.
headers — Dictionary of HTTP headers sent with the request.
body — Request payload as bytes or string. Used for POST and PUT requests.
cookies — Supports two formats:
# Simple dictionary format
simple_request = Request(
url='http://api.example.com/endpoint',
cookies={'session_id': 'abc123', 'user_pref': 'dark'}
)
# List format for cookie attributes
detailed_request = Request(
url='http://api.example.com/endpoint',
cookies=[
{
'name': 'tracking',
'value': 'xyz789',
'domain': '.example.com',
'path': '/api'
}
]
)
The list format allows controlling cookie scope through domain and path attributes.
To prevent automatic cookie merging from previous responses:
Request(
url='http://example.com',
cookies={'token': 'secret'},
meta={'dont_merge_cookies': True}
)
meta — Dictionary for passing arbitrary metadata between components. When provided, the dict is shallow-copied.
encoding — Character encoding for URL and headers, defaults to 'utf-8'.
priority — Integer affecting scheduling order. Higher values execute first.
dont_filter — When True, bypasses the scheduler's duplicate request filter. Use sparingly.
errback — Callable invoked when the request fails. Receives a Twisted Failure instance.
Request Attributes and Methods
url — Read-only string containing request URL.
method — HTTP method string such as 'GET' or 'POST'.
headers — Case-insensitive dict-like object for HTTP headers.
body — Raw request body as bytes.
meta — Mutable dict storing metadata. Modified by middleware and extensions.
copy() — Returns a clone of the current Request.
**replace(kwargs) — Returns a new Request with specified fields overridden:
original = Request(url='http://example.com', meta={'id': 1})
modified = original.replace(method='POST', meta={'id': 2, 'extra': 'data'})
Passing Data Through Callbacks
The Request.meta dictionary enables data sharing between callback functions without relying on class attributes:
def extract_category(self, response):
category = response.css('.category::text').get()
item = ProductItem()
item['category'] = category
for product_url in response.css('.product-link::attr(href)'):
new_request = Request(
url=response.urljoin(product_url),
callback=self.parse_product
)
new_request.meta['item'] = item
new_request.meta['source_page'] = response.url
yield new_request
def parse_product(self, response):
stored_item = response.meta['item']
source = response.meta['source_page']
item = ProductItem()
item['name'] = response.css('.product-title::text').get()
item['price'] = response.css('.price::text').re_first(r'[\d.]+')
item['category'] = stored_item['category']
item['discovered_via'] = source
yield item
Error Handling with errback
The errback parameter registers a callback for handling request failures including timeouts and DNS errors:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import (
DNSLookupError,
TimeoutError,
TCPTimedOutError,
ConnectionRefusedError
)
class RobustSpider(scrapy.Spider):
name = 'reliable_crawler'
def start_requests(self):
urls = [
'https://httpbin.org/',
'https://httpbin.org/status/404',
'https://httpbin.org/status/500',
'https://httpbin.org:99999/',
'https://nonexistent-domain.invalid/',
]
for url in urls:
yield scrapy.Request(
url,
callback=self.handle_success,
errback=self.handle_failure,
dont_filter=True
)
def handle_success(self, response):
self.logger.info(f'Response {response.status} from {response.url}')
def handle_failure(self, failure):
self.logger.error(f'Request failed: {failure}')
if failure.check(HttpError):
http_error = failure.value
response = http_error.response
self.logger.error(f'HTTP {response.status} at {response.url}')
elif failure.check(DNSLookupError):
self.logger.error(f'DNS resolution failed for {failure.request.url}')
elif failure.check(TimeoutError, TCPTimedOutError):
self.logger.error(f'Timeout reaching {failure.request.url}')
elif failure.check(ConnectionRefusedError):
self.logger.error(f'Connection refused at {failure.request.url}')
Reserved Request.meta Keys
Scrapy reserves several meta keys for internal use:
| Key | Purpose |
|---|---|
dont_redirect |
Prevents request from following redirects |
dont_retry |
Disables automatic retry on failure |
handle_httpstatus_list |
Specifies acceptable HTTP status codes |
handle_httpstatus_all |
Accepts all HTTP status codes |
dont_merge_cookies |
Prevents merging cookies from prior responses |
cookiejar |
Enables per-request cookie jar isolation |
dont_cache |
Instructs middleware to skip response caching |
redirect_urls |
Stores redirect chain for the request |
bindaddress |
Source address for outgoing connections |
dont_obey_robotstxt |
Ignores robots.txt restrictions |
download_timeout |
Per-request download timeout in seconds |
download_maxsize |
Maximum response size in bytes |
download_latency |
Actual time taken to complete download |
proxy |
Proxy URL in format http://host:port |