Web Scraping with Feapder: Architecture, Configuration, and Browser Rendering

Framework Overview

feapder is a robust Python scraping framework that simplifies data extraction through four built-in spider templates: AirSpider, Spider, TaskSpider, and BatchSpider. It natively supports resumable crawling, alert notifications, browser rendering, and large-scale data deduplication. Deployment and scheduling are managed via the feaplat management system.

Environment Setup

Official Documentation: https://feapder.com

conda create -n feapder_env python=3.9
conda activate feapder_env
pip install "feapder[all]"

Core Architecture

ComponentFunctionality
spiderCentral scheduling engine
parser_controlManages and dispatches parsing tasks
collectorBatch retrieves tasks from the queue into memory to reduce database access frequency
parserData extraction logic handler
start_requestEntry point for initial task generation
item_bufferBuffer queue for batching database insert operations
request_bufferBuffer queue for batching task storage operations
requestHTTP client wrapping requests for network data retrieval
responseResponse object supporting xpath, css, and re parsing with automatic encoding handling

Execution Workflow

  1. The spider engine invokes start_request to generate initial tasks.
  2. Generated tasks are pushed into the request_buffer.
  3. The spider flushes tasks from request_buffer to the task queue database in batches.
  4. The spider triggers the collector to pull tasks from the database queue into memory.
  5. parser_control retrieves tasks from the collector's memory queue.
  6. parser_control delegates the task to the request module for fetching.
  7. The request module downloads the web content.
  8. Downloaded data is encapsulated into a response object.
  9. The response is returned to parser_control (handled across multiple threads).
  10. parser_control invokes the corresponding parser to extract data from the response.
  11. Extracted item objects and newly discovered request objects are routed to item_buffer and request_buffer respectively.
  12. The spider flushes both buffers to persist data and new tasks into the database.

Component Deep Dive

Initializing a Basic Scraper

Generate a new spider file using the CLI:

feapder create -s movie_scraper

Upon execution, select the AirSpider template. This lightweight template is ideal for rapid development. The generated class inherits from feapder.AirSpider and contains start_requests and parse methods.

import feapder

class MovieScraper(feapder.AirSpider):
    def start_requests(self):
        for offset in range(0, 250, 25):
            yield feapder.Request(f"https://movie.example.com/top?offset={offset}")

    def parse(self, request, response):
        cards = response.xpath('//div[@class="movie-card"]')
        for card in cards:
            data = dict()
            data['name'] = card.xpath('.//h3/a/text()').extract_first()
            data['url'] = card.xpath('.//h3/a/@href').extract_first()
            data['rating'] = card.xpath('.//span[@class="score"]/text()').extract_first()
            yield feapder.Request(data['url'], callback=self.parse_movie_info, item=data)

    def parse_movie_info(self, request, response):
        synopsis_block = response.xpath('//div[@class="full-synopsis"]//text()')
        if synopsis_block:
            request.item['synopsis'] = synopsis_block.extract_first().strip()
        else:
            request.item['synopsis'] = response.xpath('//div[@class="short-synopsis"]//text()').extract_first().strip()
        print(request.item)

if __name__ == "__main__":
    MovieScraper(thread_count=5).start()

Persisting Data to MySQL

Create a script to initialize the database schema:

from feapder.db.mysqldb import MysqlDB

db = MysqlDB(ip='localhost', port=3306, user_name='admin', user_pass='password', db='scraper_db')

create_table_sql = """
    CREATE TABLE IF NOT EXISTS movie_data (
        id INT PRIMARY KEY AUTO_INCREMENT,
        name VARCHAR(255) NOT NULL,
        rating VARCHAR(255) NOT NULL,
        url VARCHAR(255) NOT NULL,
        synopsis TEXT
    );
"""
db.execute(create_table_sql)

insert_query = """
    INSERT IGNORE INTO movie_data (id, name, rating, url, synopsis) VALUES (
        0, 'Test Film', '9.5', 'https://example.com', 'A test description'
    );
"""
db.add(insert_query)

Generate the configuration file to connect the spider to MySQL:

feapder create --setting

Update setting.py with database credentials:

MYSQL_IP = "localhost"
MYSQL_PORT = 3306
MYSQL_DB = "scraper_db"
MYSQL_USER_NAME = "admin"
MYSQL_USER_PASS = "password"

Generate an Item class mapped to the database table. The filename must match the table name:

feapder create -i movie_data

Implement the generated Item within the spider:

import feapder
from movie_data_item import MovieDataItem

class MovieScraper(feapder.AirSpider):
    def start_requests(self):
        for offset in range(0, 250, 25):
            yield feapder.Request(f"https://movie.example.com/top?offset={offset}")

    def parse(self, request, response):
        cards = response.xpath('//div[@class="movie-card"]')
        for card in cards:
            item = MovieDataItem()
            item['name'] = card.xpath('.//h3/a/text()').extract_first()
            item['url'] = card.xpath('.//h3/a/@href').extract_first()
            item['rating'] = card.xpath('.//span[@class="score"]/text()').extract_first()
            yield feapder.Request(item['url'], callback=self.parse_movie_info, item=item)

    def parse_movie_info(self, request, response):
        synopsis_block = response.xpath('//div[@class="full-synopsis"]//text()')
        if synopsis_block:
            request.item['synopsis'] = synopsis_block.extract_first().strip()
        else:
            request.item['synopsis'] = response.xpath('//div[@class="short-synopsis"]//text()').extract_first().strip()
        yield request.item

if __name__ == "__main__":
    MovieScraper().start()

Request Interception Middleware

Middleware allows modification of requests prior to dispatch, such as injecting headers or proxies. Overriding download_midware applies globally to all requests.

import feapder

class MovieScraper(feapder.AirSpider):
    def start_requests(self):
        for offset in range(0, 250, 25):
            yield feapder.Request(f"https://movie.example.com/top?offset={offset}")

    def download_midware(self, request):
        request.headers = {
            'User-Agent': 'Mozilla/5.0 (Custom Agent)'
        }
        request.proxies = {
            "http": "http://proxy.local:8080"
        }
        return request

For request-specific middleware, pass the function reference to the download_midware parameter in the Request constructor:

import feapder

class MovieScraper(feapder.AirSpider):
    def start_requests(self):
        for offset in range(0, 250, 25):
            yield feapder.Request(
                f"https://movie.example.com/top?offset={offset}",
                download_midware=self.custom_proxy_handler
            )

    def custom_proxy_handler(self, request):
        request.headers = {'X-Custom-Header': 'value'}
        return request

Response Validation and Retry Logic

The validate method checks response integrity. Raising an exception triggers an automatic retry, returning False discards the request, and returning True or None proceeds to the parsing method.

import feapder

class MovieScraper(feapder.AirSpider):
    # ... other methods ...

    def validate(self, request, response):
        if response.status_code != 200:
            raise Exception(f"Invalid status: {response.status_code}")

Dynamic Content Rendering via Selenium

For JavaScript-heavy pages, set render=True in the request. The framework maintains a browser pool and reuses instances automatically.

def start_requests(self):
    yield feapder.Request("https://news.example.com/", render=True)

Enable WebDriver configuration in setting.py:

WEBDRIVER = dict(
    pool_size=1,
    load_images=True,
    user_agent=None,
    proxy=None,
    headless=False,
    driver_type="CHROME",
    timeout=30,
    window_size=(1024, 800),
    executable_path=None,
    render_time=0,
    custom_argument=["--ignore-certificate-errors"],
    xhr_url_regexes=None,
    auto_install_driver=True,
    download_path=None,
    use_stealth_js=False,
)

Interacting with the rendered browser instance:

import feapder
from selenium.webdriver.common.by import By
from feapder.utils.webdriver import WebDriver

class SearchScraper(feapder.AirSpider):
    def start_requests(self):
        yield feapder.Request("https://www.searchengine.com", render=True)

    def parse(self, request, response):
        driver: WebDriver = response.browser
        search_input = driver.find_element(By.ID, 'search-box')
        search_input.send_keys('feapder')
        driver.find_element(By.ID, 'submit-btn').click()

To intercept XHR network requests, configure xhr_url_regexes in the settings dictionary. Access intercepted data using xhr_text, xhr_, or xhr_response.

driver: WebDriver = response.browser
api_data = driver.xhr_("regex_pattern")

Structuring a Complete Project

Generate a structured project directory:

feapder create -p retail_scraper

This creates folders for items and spiders, alongside main.py and setting.py.

After configuring the database credentials in setting.py, define the table schema:

from feapder.db.mysqldb import MysqlDB

db = MysqlDB(ip='localhost', user_name='admin', user_pass='password', db='scraper_db')
create_table_sql = """
    CREATE TABLE product_info (
        id INT PRIMARY KEY AUTO_INCREMENT,
        product_name VARCHAR(255) DEFAULT NULL,
        cost VARCHAR(255) DEFAULT NULL
    );
"""
db.execute(create_table_sql)

Generate the corresponding Item inside the items directory:

cd items
feapder create -i product_info

Create the spider script inside the spiders directory:

cd spiders
feapder create -s product_spider

Implement the spider with scrolling and rendering logic:

import time
import feapder
from random import uniform
from items.product_info_item import ProductInfoItem
from selenium.webdriver.common.by import By
from feapder.utils.webdriver import WebDriver

class ProductSpider(feapder.AirSpider):
    def start_requests(self):
        base_url = 'https://store.example.com/search?q=laptops&page={}'
        for idx in range(1, 6):
            yield feapder.Request(url=base_url.format(idx), render=True)

    def parse(self, request, response):
        driver: WebDriver = response.browser
        time.sleep(2)
        self.scroll_to_bottom(driver)

        products = driver.find_elements(By.XPATH, '//div[@class="product-card"]')
        for product in products:
            cost = product.find_element(By.XPATH, './/span[@class="price"]').text
            product_name = product.find_element(By.XPATH, './/h4').text

            item = ProductInfoItem()
            item.product_name = product_name
            item.cost = cost
            yield item

    def scroll_to_bottom(self, driver):
        for step in range(1, 10):
            driver.execute_script(f'window.scrollTo(0, {step * 500});')
            time.sleep(uniform(0.5, 1.5))

if __name__ == "__main__":
    ProductSpider(thread_count=1).start()

Tags: python web scraping Feapder Selenium MySQL

Posted on Tue, 02 Jun 2026 16:22:41 +0000 by Ekano