Extracting Structured Data from HTML with Python's BeautifulSoup

To install the library along with a high-performence parser: pip install beautifulsoup4 lxml Begin by importing the class and initializing the parser with your markup: from bs4 import BeautifulSoup markup = """ <article class="product-listing"> <header> <h1 id="main-title">Electronics ...

Posted on Mon, 25 May 2026 21:15:27 +0000 by forum

Using CrawlSpider for Automated Web Scraping in Scrapy

Overview When scraping an entire website like Qiushibaike (Chinese joke site), you have two approaches: Method 1: Use Scrapy's base Spider class with recursive crawling (manual request callbacks). Method 2: Use CrawlSpider for automated link extraction and crawling (cleaner and more efficient). This guide covers: CrawlSpider introduction Crawl ...

Posted on Sun, 24 May 2026 19:45:48 +0000 by sheraz

Python Web Scraping Fundamentals: Request Handling and Network Operations

GET Requests with Dictionary Parameters When making GET requests with query parameters, we can construct URLs dynamically using dictionaries: import urllib.request import urllib.parse import string def get_params(): base_url = "http://www.baidu.com/s" params = { "query": "中文", &quot ...

Posted on Sat, 23 May 2026 17:18:26 +0000 by tj71587

Applying XPath Expressions with Python's lxml Library

Installation Install the library using pip: pip install lxml XPath Core Concepts Node Types XPath defines seven node types: element, attribute, text, namespace, processing instruction, comment, and the document (root) node. An XML document is represented as a node tree, with the root of the tree being the document or root node. Consider this s ...

Posted on Wed, 20 May 2026 18:13:14 +0000 by alego

Building a Basic Web Scraper with Python

A web scraper automates the extraction of data from websites. The core process involves two primary steps: fetching web content and parsing the desired information. To begin, install the requests library, which handles HTTP requests. pip install requests Many websites restrict automated access. To mimic a real browser, you need to set a User-A ...

Posted on Tue, 19 May 2026 09:30:13 +0000 by lorenzo-s

Introduction to the Scrapy Framework for Web Scraping

This article explores the fundamentals of web scraping using Python, covering aspects from basic browser automation to the powerful Scrapy framework. Web Scraping with Selenium Selenium is a popular tool for browser automation, enabling the simulation of user interactions with web pages. The following example demonstrates how to use Selenium wi ...

Posted on Sun, 17 May 2026 23:33:31 +0000 by strago

DrissionPage: Unifying Browser Automation and HTTP Requests in Python

Web automation is frequently used to monitor e-commerce prices. The following script demonstrates how to track a product price and trigger an API notification when a target threshold is met. This example utilizes the mixed mode to handle navigation and data extraction seamlessly. from DrissionPage import WebPage import requests import time # I ...

Posted on Sun, 17 May 2026 13:29:16 +0000 by Dodon

Web Scraping and Visualization Techniques for Location Data

Pandas can be used to load and filter Excel datasets containing geographic coordinates. For example, to extract Starbucks store locations in Shanghai from a spreadsheet, read the file and apply a city-based filter. import pandas as pd data_frame = pd.read_excel("stores_data.xlsx") shanghai_locations = data_frame[data_frame['city'] == ...

Posted on Sun, 17 May 2026 08:54:17 +0000 by rckehoe

Extracting Web Table Data and Exporting to Excel Using a Tampermonkey Script

Often we need to download tabular data displayed on a web page as an Excel file. When the site does not provide a suitable download option, a custom script can be used to gather the data and export it. Using a Tampermonkey (userscript) makes it easy to share the solution with others. Below are two practical approaches: one that simulates manual ...

Posted on Sun, 17 May 2026 06:41:49 +0000 by WendyB

Building Distributed Scrapy Spiders with Redis

RedisSpider Overview RedisSpider extends Scrapy's base Spider class to enable distributed crawling. Instead of using a static start_urls list, this spider reads URLs from a Redis queue. Key Differences from Standard Spider The main modifications involve imports, inheritance, and replacing the static URL list with a Redis key: from scrapy_redis. ...

Posted on Sun, 17 May 2026 03:26:31 +0000 by glassroof