Extracting Structured Data from HTML with Python's BeautifulSoup
To install the library along with a high-performence parser:
pip install beautifulsoup4 lxml
Begin by importing the class and initializing the parser with your markup:
from bs4 import BeautifulSoup
markup = """
<article class="product-listing">
<header>
<h1 id="main-title">Electronics ...
Posted on Mon, 25 May 2026 21:15:27 +0000 by forum
Using CrawlSpider for Automated Web Scraping in Scrapy
Overview
When scraping an entire website like Qiushibaike (Chinese joke site), you have two approaches:
Method 1: Use Scrapy's base Spider class with recursive crawling (manual request callbacks).
Method 2: Use CrawlSpider for automated link extraction and crawling (cleaner and more efficient).
This guide covers:
CrawlSpider introduction
Crawl ...
Posted on Sun, 24 May 2026 19:45:48 +0000 by sheraz
Python Web Scraping Fundamentals: Request Handling and Network Operations
GET Requests with Dictionary Parameters
When making GET requests with query parameters, we can construct URLs dynamically using dictionaries:
import urllib.request
import urllib.parse
import string
def get_params():
base_url = "http://www.baidu.com/s"
params = {
"query": "中文",
" ...
Posted on Sat, 23 May 2026 17:18:26 +0000 by tj71587
Applying XPath Expressions with Python's lxml Library
Installation
Install the library using pip:
pip install lxml
XPath Core Concepts
Node Types
XPath defines seven node types: element, attribute, text, namespace, processing instruction, comment, and the document (root) node. An XML document is represented as a node tree, with the root of the tree being the document or root node.
Consider this s ...
Posted on Wed, 20 May 2026 18:13:14 +0000 by alego
Building a Basic Web Scraper with Python
A web scraper automates the extraction of data from websites. The core process involves two primary steps: fetching web content and parsing the desired information.
To begin, install the requests library, which handles HTTP requests.
pip install requests
Many websites restrict automated access. To mimic a real browser, you need to set a User-A ...
Posted on Tue, 19 May 2026 09:30:13 +0000 by lorenzo-s
Introduction to the Scrapy Framework for Web Scraping
This article explores the fundamentals of web scraping using Python, covering aspects from basic browser automation to the powerful Scrapy framework.
Web Scraping with Selenium
Selenium is a popular tool for browser automation, enabling the simulation of user interactions with web pages. The following example demonstrates how to use Selenium wi ...
Posted on Sun, 17 May 2026 23:33:31 +0000 by strago
DrissionPage: Unifying Browser Automation and HTTP Requests in Python
Web automation is frequently used to monitor e-commerce prices. The following script demonstrates how to track a product price and trigger an API notification when a target threshold is met. This example utilizes the mixed mode to handle navigation and data extraction seamlessly.
from DrissionPage import WebPage
import requests
import time
# I ...
Posted on Sun, 17 May 2026 13:29:16 +0000 by Dodon
Web Scraping and Visualization Techniques for Location Data
Pandas can be used to load and filter Excel datasets containing geographic coordinates. For example, to extract Starbucks store locations in Shanghai from a spreadsheet, read the file and apply a city-based filter.
import pandas as pd
data_frame = pd.read_excel("stores_data.xlsx")
shanghai_locations = data_frame[data_frame['city'] == ...
Posted on Sun, 17 May 2026 08:54:17 +0000 by rckehoe
Extracting Web Table Data and Exporting to Excel Using a Tampermonkey Script
Often we need to download tabular data displayed on a web page as an Excel file. When the site does not provide a suitable download option, a custom script can be used to gather the data and export it. Using a Tampermonkey (userscript) makes it easy to share the solution with others. Below are two practical approaches: one that simulates manual ...
Posted on Sun, 17 May 2026 06:41:49 +0000 by WendyB
Building Distributed Scrapy Spiders with Redis
RedisSpider Overview
RedisSpider extends Scrapy's base Spider class to enable distributed crawling. Instead of using a static start_urls list, this spider reads URLs from a Redis queue.
Key Differences from Standard Spider
The main modifications involve imports, inheritance, and replacing the static URL list with a Redis key:
from scrapy_redis. ...
Posted on Sun, 17 May 2026 03:26:31 +0000 by glassroof