Scrapy Framework Setup and XPath Querying Techniques

Core Framwork Architecture

Scrapy operates as an asynchronous web scraping framework built upon Twisted. The main components coordinating data flow are:

  • Engine: Orchestrates triggers and overall data handling.
  • Scheduler: Accepts requests, maintains a priority queue, and deduplicates URLs.
  • Downloader: Retrieves page content via non-blocking I/O and returns responses.
  • Spiders: Define extraction logic for Items and new request generation.
  • Item Pipeline: Processes parsed Items for validation, filtering, or storage.
  • Downloader Middleware: Intercepts requests/responses between Engine and Downloader.
  • Spider Middleware: Hooks into Spider input/output for processing.

Execution Flow

  1. Engine retrieves a URL from the Scheduler.
  2. The URL is wrapped as a Request and forwarded to the Downloader.
  3. Downloader fetches the resource and returns a Response.
  4. A Spider callback processes the Response.
  5. Extracted Items are sent to the Pipeline.
  6. Extracted URLs are dispatched back into the Scheduler.

Installation and Project Setup

Windows

pip3 install wheel
# Download the appropriate Twisted wheel from:
# http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl
pip3 install pywin32
pip3 install scrapy

Linux

pip3 install scrapy

Scaffolding a Project

scrapy startproject my_crawler
cd my_crawler

Generate a spider targeting a domain:

scrapy genspider example_spider example.com

Directory Layout

my_crawler/
    spiders/
        example_spider.py
    items.py
    pipelines.py
    middlewares.py
    settings.py
scrapy.cfg

Execution Commands

scrapy crawl example_spider
scrapy crawl example_spider --nolog

Spider Implementation Example

Below is a practical spider illustrating XPath selection and recursive following of paginated links.

import scrapy
from scrapy.http import Request

class NewsSpider(scrapy.Spider):
    name = 'newsbot'
    allowed_domains = ['chouti.com']
    start_urls = ['http://chouti.com/']

    def parse(self, response):
        entries = response.xpath('//div[@id="content-list"]/div[@class="item"]')
        with open('crawled_links.log', 'a+', encoding='utf-8') as log_file:
            for entry in entries:
                title = entry.xpath('.//a/text()').get()
                link = entry.xpath('.//a/@href').get()
                if title and link:
                    log_file.write(f"{link} - {title.strip()}\n")

        # Discover next page URLs
        next_pages = response.xpath('//div[@id="dig_lcpage"]//a/@href').getall()
        for relative_url in next_pages:
            full_url = f"https://dig.chouti.com{relative_url}"
            yield Request(url=full_url, callback=self.parse)

XPath Selectors Deep Dive

Scrapy relies on Selector objects built from a Response to apply XPath queries. A consistent pattern is to instantiate a Selector and chain methods for extraction.

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

html_body = '''
<html>
 <body>
  <ul>
   <li class="item-0"><a id="link1" href="page1.html">first item</a></li>
   <li class="item-1"><a href="page2.html">second item<span>span content</span></a></li>
  </ul>
  <div><a href="external.html">external link</a></div>
 </body>
</html>
'''

resp = HtmlResponse(url='http://demo.com', body=html_body, encoding='utf-8')
selector = Selector(response=resp)

# Select all anchor elements
all_anchors = selector.xpath('//a')

# Select second anchor within the same level
second_anchor = selector.xpath('//a[2]')

# Anchor with any id attribute
anchors_with_id = selector.xpath('//a[@id]')

# Specific id match
specific_id = selector.xpath('//a[@id="link1"]')

# Combined attribute conditions
combined_attrs = selector.xpath('//a[contains(@href, "page") and @id]')

# Starts-with function
starts_with_link = selector.xpath('//a[starts-with(@href, "page")]')

# Regex test on attribute (requires namespace registration)
regex_id = selector.xpath(
    '//a[re:test(@id, "link\\d+")]',
    namespaces={'re': 'http://exslt.org/regular-expressions'}
)

# Extract text for matched elements
text_values = selector.xpath('//a[re:test(@id, "link\\d+")]/text()').getall()

# Extract href attributes
href_attributes = selector.xpath('//a[re:test(@id, "link\\d+")]/@href').getall()

# Full absolute path
absolute_path_href = selector.xpath('/html/body/ul/li/a/@href').getall()

# Relative path with extract_first
first_href = selector.xpath('//body/ul/li/a/@href').get()

# Iterate and extract nested elements
list_items = selector.xpath('//body/ul/li')
for li in list_items:
    span = li.xpath('./a/span')
    # Equivalent: 'a/span' or '*/a/span'
    print(span.get())

Mastering these selection patterns enables precise data harvesting across complex page structures.

Tags: scrapy web scraping xpath python

Posted on Mon, 15 Jun 2026 17:21:19 +0000 by Buffas