Web Scraping with XPath: Extracting News Headlines from 36Kr
Having previously explored the powerful BeautifulSoup library for HTML parsing and techniques for capturing HTTP requests through on line tools, we now turn our attention to another fundamental web scraping approach: XPath. XPath serves as a query language designed to navigate and select specific portions of XML documents. While originally deve ...
Posted on Fri, 12 Jun 2026 18:32:30 +0000 by 9mm
Essential Web Scraping Techniques using Urllib and Requests in Python
Utilizing the urllib Module for Web Requests
The urllib library is a built-in Python module used for handling URLs. It provides several ways to fetch data from the web, ranging from simple function calls to complex custom handlers.
Basic Web Access and Custom Openers
The simplest way to retrieve a webpage is using urlopen. For more advanced con ...
Posted on Tue, 26 May 2026 16:22:22 +0000 by andybrooke
Information Technology Terminology Analysis and Visualization System
Core Functionality
The system automates the collection, processing, and presentation of trending IT terminology through multiple analytical stages.
Data Harvesting Module
Content is systematically extracted from technical news sources using HTTP protocols. The collcetion mechenism targets blog.cnblogs.com for initial dataset generation.
import ...
Posted on Tue, 19 May 2026 15:11:58 +0000 by cabaz777
Integrating Selenium with Scrapy for Dynamic Content Extraction
Dynamic Data Handling in Scrapy with Selenium IntegrationWhen scraping websites with the Scrapy framework, you often encounter pages where content is dynamically loaded through JavaScript. Direct HTTP requests made by Scrapy to these URLs will not retrieve the dynamically generated data. However, browsers successfully render and display this co ...
Posted on Sat, 16 May 2026 15:50:31 +0000 by jediman
Advanced Web Scraping Techniques for Rankings, Products, and Images
Extracting Structured Ranking Data
Utilizing requests alongside BeautifulSoup enables efficient extraction of tabular data from static web pages. The following implementation targets university ranking lists, parsing specific DOM elements to compile rank, institution name, location, type, and score.
import requests
from bs4 import BeautifulSoup ...
Posted on Sat, 16 May 2026 07:09:57 +0000 by sarathi
:Practical XPath Parsing with Python's lxml Library
The lxml library serves as a powerful Pythonic wrapper around C libraries like libxml2 and libxslt, delivering exceptional perofrmance for parsing HTML and XML documents. Its comprehensive support for XPath 1.0 makes it an ideal choice for targeted data extraction tasks.
Setup
Install the package via pip:
pip install lxml
Initializing Parser O ...
Posted on Fri, 15 May 2026 15:42:04 +0000 by zackcez
Advanced HTML Parsing Strategies with PyQuery
Core Overview
PyQuery provides an efficient interface for DOM manipulation in Python, mirroring the functionality of jQuery. It leverages the lxml parser back end to handle complex HTML structures.
Environment Setup
Installation requires the core parsing libraries.
pip install lxml pyquery
Initializing the Document Object
Processing begins by ...
Posted on Wed, 13 May 2026 21:53:15 +0000 by Eckstra
Using Docker with Python for Web Scraping: A Practical Guide
Python has emerged as one of the fastest-growing mainstream programming languages and ranks as the second most beloved language among developers according to the Stack Overflow 2019 survey. For developers working with .NET or Java, learning Python can serve as an excellent second language—especially given its powerful capabilities in areas like ...
Posted on Tue, 12 May 2026 21:27:19 +0000 by freelancer
Understanding Scrapy's Request Object and Data Flow Between Components
Data Flow Between Scrapy Components
Scrapy manages communication between different components through two fundamental objects: Request and Response. The Spider generates Request objects, which travel through the engine and download middleware before being executed by the downloader. Each Request eventually produces a Response that flows back th ...
Posted on Fri, 08 May 2026 21:33:48 +0000 by bow-viper1
Automating 12306 Train Ticket Queries: A Comparative Study of HTTP Scraping and Selenium Automation
Automating 12306 Train Ticket Queries: A Comparative Study of HTTP Scraping and Selenium Automation
Automating routine tasks on web platforms often requires choosing between lightweight API interaction and robust browser automation. This article details two Python implementations for querying train availability on the 12306 railway system. The ...
Posted on Fri, 08 May 2026 15:57:37 +0000 by maxelcat