Understanding Scrapy's Request Object and Data Flow Between Components

Data Flow Between Scrapy Components

Scrapy manages communication between different components through two fundamental objects: Request and Response. The Spider generates Request objects, which travel through the engine and download middleware before being executed by the downloader. Each Request eventually produces a Response that flows back through the middleware to the originating spider for parsing.

Creating Request Objects in Spider

Request objects are instantiated within Spider methods, typically in start_requests() or other generators. The following demonstrates how a crawler initiates multiple requests:

from urllib.parse import quote

class ProductSpider(CrawlSpider):
    name = 'products'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']
    
    def start_requests(self):
        search_terms = self.settings.getlist('SEARCH_TERMS')
        max_pages = self.settings.getint('MAX_PAGES', 10)
        
        for term in search_terms:
            for page_num in range(1, max_pages + 1):
                search_url = f'https://www.example.com/search?q={quote(term)}'
                yield Request(
                    url=search_url,
                    callback=self.parse_results,
                    meta={'page': page_num, 'term': term},
                    dont_filter=True
                )

The yield statement generates Request instances that the Scrapy engine subsequently processes.

Handling Requests in Download Middleware

Middleware intercepts Request objects through the process_request() method. This example integrates Selenium to handle JavaScript-rendered pages:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from logging import getLogger

class BrowserMiddleware:
    
    def __init__(self, driver_path, wait_timeout):
        self.logger = getLogger(__name__)
        self.driver_path = driver_path
        self.timeout = wait_timeout
        
    def process_request(self, request, spider):
        if 'skip_browser' in request.meta:
            return None
            
        self.logger.debug(f'Loading {request.url} in browser')
        current_page = request.meta.get('page', 1)
        
        try:
            driver = webdriver.Firefox(executable_path=self.driver_path)
            driver.set_window_size(1366, 768)
            driver.get(request.url)
            
            if current_page > 1:
                page_input = WebDriverWait(driver, self.timeout).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, 'input[name="page"]'))
                )
                page_input.clear()
                page_input.send_keys(str(current_page))
                driver.find_element(By.CSS_SELECTOR, 'button.submit').click()
            
            WebDriverWait(driver, self.timeout).until(
                EC.text_to_be_present_in_element(
                    (By.CSS_SELECTOR, '.pagination .active'),
                    str(current_page)
                )
            )
            
            return HtmlResponse(
                url=request.url,
                body=driver.page_source.encode('utf-8'),
                request=request,
                encoding='utf-8',
                status=200
            )
        except Exception as e:
            self.logger.error(f'Browser error: {e}')
            return HtmlResponse(url=request.url, status=500, request=request)
        finally:
            driver.quit()
    
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            driver_path=crawler.settings.get('GECKODRIVER_PATH'),
            wait_timeout=crawler.settings.get('SELENIUM_TIMEOUT', 30)
        )

When a custom Response object is returned from middleware, the request chain terminates there and processing continues to the callback.

Request Class Parameters

scrapy.http.Request(
    url,
    callback=None,
    method='GET',
    headers=None,
    body=None,
    cookies=None,
    meta=None,
    encoding='utf-8',
    priority=0,
    dont_filter=False,
    errback=None
)

url — The target URL for the HTTP request.

callback — Callable invoked with the downloaded Response. Defaults to the spider's parse() method when omitted.

method — HTTP verb, defaults to 'GET'. Supported methods include 'GET', 'POST', 'PUT', 'HEAD', 'OPTIONS'.

headers — Dictionary of HTTP headers sent with the request.

body — Request payload as bytes or string. Used for POST and PUT requests.

cookies — Supports two formats:

# Simple dictionary format
simple_request = Request(
    url='http://api.example.com/endpoint',
    cookies={'session_id': 'abc123', 'user_pref': 'dark'}
)

# List format for cookie attributes
detailed_request = Request(
    url='http://api.example.com/endpoint',
    cookies=[
        {
            'name': 'tracking',
            'value': 'xyz789',
            'domain': '.example.com',
            'path': '/api'
        }
    ]
)

The list format allows controlling cookie scope through domain and path attributes.

To prevent automatic cookie merging from previous responses:

Request(
    url='http://example.com',
    cookies={'token': 'secret'},
    meta={'dont_merge_cookies': True}
)

meta — Dictionary for passing arbitrary metadata between components. When provided, the dict is shallow-copied.

encoding — Character encoding for URL and headers, defaults to 'utf-8'.

priority — Integer affecting scheduling order. Higher values execute first.

dont_filter — When True, bypasses the scheduler's duplicate request filter. Use sparingly.

errback — Callable invoked when the request fails. Receives a Twisted Failure instance.

Request Attributes and Methods

url — Read-only string containing request URL.

method — HTTP method string such as 'GET' or 'POST'.

headers — Case-insensitive dict-like object for HTTP headers.

body — Raw request body as bytes.

meta — Mutable dict storing metadata. Modified by middleware and extensions.

copy() — Returns a clone of the current Request.

**replace(kwargs) — Returns a new Request with specified fields overridden:

original = Request(url='http://example.com', meta={'id': 1})
modified = original.replace(method='POST', meta={'id': 2, 'extra': 'data'})

Passing Data Through Callbacks

The Request.meta dictionary enables data sharing between callback functions without relying on class attributes:

def extract_category(self, response):
    category = response.css('.category::text').get()
    item = ProductItem()
    item['category'] = category
    
    for product_url in response.css('.product-link::attr(href)'):
        new_request = Request(
            url=response.urljoin(product_url),
            callback=self.parse_product
        )
        new_request.meta['item'] = item
        new_request.meta['source_page'] = response.url
        yield new_request

def parse_product(self, response):
    stored_item = response.meta['item']
    source = response.meta['source_page']
    
    item = ProductItem()
    item['name'] = response.css('.product-title::text').get()
    item['price'] = response.css('.price::text').re_first(r'[\d.]+')
    item['category'] = stored_item['category']
    item['discovered_via'] = source
    yield item

Error Handling with errback

The errback parameter registers a callback for handling request failures including timeouts and DNS errors:

import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import (
    DNSLookupError,
    TimeoutError,
    TCPTimedOutError,
    ConnectionRefusedError
)

class RobustSpider(scrapy.Spider):
    name = 'reliable_crawler'
    
    def start_requests(self):
        urls = [
            'https://httpbin.org/',
            'https://httpbin.org/status/404',
            'https://httpbin.org/status/500',
            'https://httpbin.org:99999/',
            'https://nonexistent-domain.invalid/',
        ]
        for url in urls:
            yield scrapy.Request(
                url,
                callback=self.handle_success,
                errback=self.handle_failure,
                dont_filter=True
            )
    
    def handle_success(self, response):
        self.logger.info(f'Response {response.status} from {response.url}')
        
    def handle_failure(self, failure):
        self.logger.error(f'Request failed: {failure}')
        
        if failure.check(HttpError):
            http_error = failure.value
            response = http_error.response
            self.logger.error(f'HTTP {response.status} at {response.url}')
            
        elif failure.check(DNSLookupError):
            self.logger.error(f'DNS resolution failed for {failure.request.url}')
            
        elif failure.check(TimeoutError, TCPTimedOutError):
            self.logger.error(f'Timeout reaching {failure.request.url}')
            
        elif failure.check(ConnectionRefusedError):
            self.logger.error(f'Connection refused at {failure.request.url}')

Reserved Request.meta Keys

Scrapy reserves several meta keys for internal use:

Key Purpose
dont_redirect Prevents request from following redirects
dont_retry Disables automatic retry on failure
handle_httpstatus_list Specifies acceptable HTTP status codes
handle_httpstatus_all Accepts all HTTP status codes
dont_merge_cookies Prevents merging cookies from prior responses
cookiejar Enables per-request cookie jar isolation
dont_cache Instructs middleware to skip response caching
redirect_urls Stores redirect chain for the request
bindaddress Source address for outgoing connections
dont_obey_robotstxt Ignores robots.txt restrictions
download_timeout Per-request download timeout in seconds
download_maxsize Maximum response size in bytes
download_latency Actual time taken to complete download
proxy Proxy URL in format http://host:port

Tags: scrapy python web-scraping request-object middleware

Posted on Fri, 08 May 2026 21:33:48 +0000 by bow-viper1