Framework Overview
feapder is a robust Python scraping framework that simplifies data extraction through four built-in spider templates: AirSpider, Spider, TaskSpider, and BatchSpider. It natively supports resumable crawling, alert notifications, browser rendering, and large-scale data deduplication. Deployment and scheduling are managed via the feaplat management system.
Environment Setup
Official Documentation: https://feapder.com
conda create -n feapder_env python=3.9
conda activate feapder_env
pip install "feapder[all]"
Core Architecture
| Component | Functionality |
|---|---|
spider | Central scheduling engine |
parser_control | Manages and dispatches parsing tasks |
collector | Batch retrieves tasks from the queue into memory to reduce database access frequency |
parser | Data extraction logic handler |
start_request | Entry point for initial task generation |
item_buffer | Buffer queue for batching database insert operations |
request_buffer | Buffer queue for batching task storage operations |
request | HTTP client wrapping requests for network data retrieval |
response | Response object supporting xpath, css, and re parsing with automatic encoding handling |
Execution Workflow
- The
spiderengine invokesstart_requestto generate initial tasks. - Generated tasks are pushed into the
request_buffer. - The
spiderflushes tasks fromrequest_bufferto the task queue database in batches. - The
spidertriggers thecollectorto pull tasks from the database queue into memory. parser_controlretrieves tasks from thecollector's memory queue.parser_controldelegates the task to therequestmodule for fetching.- The
requestmodule downloads the web content. - Downloaded data is encapsulated into a
responseobject. - The
responseis returned toparser_control(handled across multiple threads). parser_controlinvokes the correspondingparserto extract data from theresponse.- Extracted
itemobjects and newly discoveredrequestobjects are routed toitem_bufferandrequest_bufferrespectively. - The
spiderflushes both buffers to persist data and new tasks into the database.
Component Deep Dive
Initializing a Basic Scraper
Generate a new spider file using the CLI:
feapder create -s movie_scraper
Upon execution, select the AirSpider template. This lightweight template is ideal for rapid development. The generated class inherits from feapder.AirSpider and contains start_requests and parse methods.
import feapder
class MovieScraper(feapder.AirSpider):
def start_requests(self):
for offset in range(0, 250, 25):
yield feapder.Request(f"https://movie.example.com/top?offset={offset}")
def parse(self, request, response):
cards = response.xpath('//div[@class="movie-card"]')
for card in cards:
data = dict()
data['name'] = card.xpath('.//h3/a/text()').extract_first()
data['url'] = card.xpath('.//h3/a/@href').extract_first()
data['rating'] = card.xpath('.//span[@class="score"]/text()').extract_first()
yield feapder.Request(data['url'], callback=self.parse_movie_info, item=data)
def parse_movie_info(self, request, response):
synopsis_block = response.xpath('//div[@class="full-synopsis"]//text()')
if synopsis_block:
request.item['synopsis'] = synopsis_block.extract_first().strip()
else:
request.item['synopsis'] = response.xpath('//div[@class="short-synopsis"]//text()').extract_first().strip()
print(request.item)
if __name__ == "__main__":
MovieScraper(thread_count=5).start()
Persisting Data to MySQL
Create a script to initialize the database schema:
from feapder.db.mysqldb import MysqlDB
db = MysqlDB(ip='localhost', port=3306, user_name='admin', user_pass='password', db='scraper_db')
create_table_sql = """
CREATE TABLE IF NOT EXISTS movie_data (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
rating VARCHAR(255) NOT NULL,
url VARCHAR(255) NOT NULL,
synopsis TEXT
);
"""
db.execute(create_table_sql)
insert_query = """
INSERT IGNORE INTO movie_data (id, name, rating, url, synopsis) VALUES (
0, 'Test Film', '9.5', 'https://example.com', 'A test description'
);
"""
db.add(insert_query)
Generate the configuration file to connect the spider to MySQL:
feapder create --setting
Update setting.py with database credentials:
MYSQL_IP = "localhost"
MYSQL_PORT = 3306
MYSQL_DB = "scraper_db"
MYSQL_USER_NAME = "admin"
MYSQL_USER_PASS = "password"
Generate an Item class mapped to the database table. The filename must match the table name:
feapder create -i movie_data
Implement the generated Item within the spider:
import feapder
from movie_data_item import MovieDataItem
class MovieScraper(feapder.AirSpider):
def start_requests(self):
for offset in range(0, 250, 25):
yield feapder.Request(f"https://movie.example.com/top?offset={offset}")
def parse(self, request, response):
cards = response.xpath('//div[@class="movie-card"]')
for card in cards:
item = MovieDataItem()
item['name'] = card.xpath('.//h3/a/text()').extract_first()
item['url'] = card.xpath('.//h3/a/@href').extract_first()
item['rating'] = card.xpath('.//span[@class="score"]/text()').extract_first()
yield feapder.Request(item['url'], callback=self.parse_movie_info, item=item)
def parse_movie_info(self, request, response):
synopsis_block = response.xpath('//div[@class="full-synopsis"]//text()')
if synopsis_block:
request.item['synopsis'] = synopsis_block.extract_first().strip()
else:
request.item['synopsis'] = response.xpath('//div[@class="short-synopsis"]//text()').extract_first().strip()
yield request.item
if __name__ == "__main__":
MovieScraper().start()
Request Interception Middleware
Middleware allows modification of requests prior to dispatch, such as injecting headers or proxies. Overriding download_midware applies globally to all requests.
import feapder
class MovieScraper(feapder.AirSpider):
def start_requests(self):
for offset in range(0, 250, 25):
yield feapder.Request(f"https://movie.example.com/top?offset={offset}")
def download_midware(self, request):
request.headers = {
'User-Agent': 'Mozilla/5.0 (Custom Agent)'
}
request.proxies = {
"http": "http://proxy.local:8080"
}
return request
For request-specific middleware, pass the function reference to the download_midware parameter in the Request constructor:
import feapder
class MovieScraper(feapder.AirSpider):
def start_requests(self):
for offset in range(0, 250, 25):
yield feapder.Request(
f"https://movie.example.com/top?offset={offset}",
download_midware=self.custom_proxy_handler
)
def custom_proxy_handler(self, request):
request.headers = {'X-Custom-Header': 'value'}
return request
Response Validation and Retry Logic
The validate method checks response integrity. Raising an exception triggers an automatic retry, returning False discards the request, and returning True or None proceeds to the parsing method.
import feapder
class MovieScraper(feapder.AirSpider):
# ... other methods ...
def validate(self, request, response):
if response.status_code != 200:
raise Exception(f"Invalid status: {response.status_code}")
Dynamic Content Rendering via Selenium
For JavaScript-heavy pages, set render=True in the request. The framework maintains a browser pool and reuses instances automatically.
def start_requests(self):
yield feapder.Request("https://news.example.com/", render=True)
Enable WebDriver configuration in setting.py:
WEBDRIVER = dict(
pool_size=1,
load_images=True,
user_agent=None,
proxy=None,
headless=False,
driver_type="CHROME",
timeout=30,
window_size=(1024, 800),
executable_path=None,
render_time=0,
custom_argument=["--ignore-certificate-errors"],
xhr_url_regexes=None,
auto_install_driver=True,
download_path=None,
use_stealth_js=False,
)
Interacting with the rendered browser instance:
import feapder
from selenium.webdriver.common.by import By
from feapder.utils.webdriver import WebDriver
class SearchScraper(feapder.AirSpider):
def start_requests(self):
yield feapder.Request("https://www.searchengine.com", render=True)
def parse(self, request, response):
driver: WebDriver = response.browser
search_input = driver.find_element(By.ID, 'search-box')
search_input.send_keys('feapder')
driver.find_element(By.ID, 'submit-btn').click()
To intercept XHR network requests, configure xhr_url_regexes in the settings dictionary. Access intercepted data using xhr_text, xhr_, or xhr_response.
driver: WebDriver = response.browser
api_data = driver.xhr_("regex_pattern")
Structuring a Complete Project
Generate a structured project directory:
feapder create -p retail_scraper
This creates folders for items and spiders, alongside main.py and setting.py.
After configuring the database credentials in setting.py, define the table schema:
from feapder.db.mysqldb import MysqlDB
db = MysqlDB(ip='localhost', user_name='admin', user_pass='password', db='scraper_db')
create_table_sql = """
CREATE TABLE product_info (
id INT PRIMARY KEY AUTO_INCREMENT,
product_name VARCHAR(255) DEFAULT NULL,
cost VARCHAR(255) DEFAULT NULL
);
"""
db.execute(create_table_sql)
Generate the corresponding Item inside the items directory:
cd items
feapder create -i product_info
Create the spider script inside the spiders directory:
cd spiders
feapder create -s product_spider
Implement the spider with scrolling and rendering logic:
import time
import feapder
from random import uniform
from items.product_info_item import ProductInfoItem
from selenium.webdriver.common.by import By
from feapder.utils.webdriver import WebDriver
class ProductSpider(feapder.AirSpider):
def start_requests(self):
base_url = 'https://store.example.com/search?q=laptops&page={}'
for idx in range(1, 6):
yield feapder.Request(url=base_url.format(idx), render=True)
def parse(self, request, response):
driver: WebDriver = response.browser
time.sleep(2)
self.scroll_to_bottom(driver)
products = driver.find_elements(By.XPATH, '//div[@class="product-card"]')
for product in products:
cost = product.find_element(By.XPATH, './/span[@class="price"]').text
product_name = product.find_element(By.XPATH, './/h4').text
item = ProductInfoItem()
item.product_name = product_name
item.cost = cost
yield item
def scroll_to_bottom(self, driver):
for step in range(1, 10):
driver.execute_script(f'window.scrollTo(0, {step * 500});')
time.sleep(uniform(0.5, 1.5))
if __name__ == "__main__":
ProductSpider(thread_count=1).start()