Web Scraping with XPath: Extracting News Headlines from 36Kr

Having previously explored the powerful BeautifulSoup library for HTML parsing and techniques for capturing HTTP requests through on line tools, we now turn our attention to another fundamental web scraping approach: XPath. XPath serves as a query language designed to navigate and select specific portions of XML documents. While originally deve ...

Posted on Fri, 12 Jun 2026 18:32:30 +0000 by 9mm

Essential Web Scraping Techniques using Urllib and Requests in Python

Utilizing the urllib Module for Web Requests The urllib library is a built-in Python module used for handling URLs. It provides several ways to fetch data from the web, ranging from simple function calls to complex custom handlers. Basic Web Access and Custom Openers The simplest way to retrieve a webpage is using urlopen. For more advanced con ...

Posted on Tue, 26 May 2026 16:22:22 +0000 by andybrooke

Information Technology Terminology Analysis and Visualization System

Core Functionality The system automates the collection, processing, and presentation of trending IT terminology through multiple analytical stages. Data Harvesting Module Content is systematically extracted from technical news sources using HTTP protocols. The collcetion mechenism targets blog.cnblogs.com for initial dataset generation. import ...

Posted on Tue, 19 May 2026 15:11:58 +0000 by cabaz777

Integrating Selenium with Scrapy for Dynamic Content Extraction

Dynamic Data Handling in Scrapy with Selenium IntegrationWhen scraping websites with the Scrapy framework, you often encounter pages where content is dynamically loaded through JavaScript. Direct HTTP requests made by Scrapy to these URLs will not retrieve the dynamically generated data. However, browsers successfully render and display this co ...

Posted on Sat, 16 May 2026 15:50:31 +0000 by jediman

Advanced Web Scraping Techniques for Rankings, Products, and Images

Extracting Structured Ranking Data Utilizing requests alongside BeautifulSoup enables efficient extraction of tabular data from static web pages. The following implementation targets university ranking lists, parsing specific DOM elements to compile rank, institution name, location, type, and score. import requests from bs4 import BeautifulSoup ...

Posted on Sat, 16 May 2026 07:09:57 +0000 by sarathi

:Practical XPath Parsing with Python's lxml Library

The lxml library serves as a powerful Pythonic wrapper around C libraries like libxml2 and libxslt, delivering exceptional perofrmance for parsing HTML and XML documents. Its comprehensive support for XPath 1.0 makes it an ideal choice for targeted data extraction tasks. Setup Install the package via pip: pip install lxml Initializing Parser O ...

Posted on Fri, 15 May 2026 15:42:04 +0000 by zackcez

Advanced HTML Parsing Strategies with PyQuery

Core Overview PyQuery provides an efficient interface for DOM manipulation in Python, mirroring the functionality of jQuery. It leverages the lxml parser back end to handle complex HTML structures. Environment Setup Installation requires the core parsing libraries. pip install lxml pyquery Initializing the Document Object Processing begins by ...

Posted on Wed, 13 May 2026 21:53:15 +0000 by Eckstra

Using Docker with Python for Web Scraping: A Practical Guide

Python has emerged as one of the fastest-growing mainstream programming languages and ranks as the second most beloved language among developers according to the Stack Overflow 2019 survey. For developers working with .NET or Java, learning Python can serve as an excellent second language—especially given its powerful capabilities in areas like ...

Posted on Tue, 12 May 2026 21:27:19 +0000 by freelancer

Understanding Scrapy's Request Object and Data Flow Between Components

Data Flow Between Scrapy Components Scrapy manages communication between different components through two fundamental objects: Request and Response. The Spider generates Request objects, which travel through the engine and download middleware before being executed by the downloader. Each Request eventually produces a Response that flows back th ...

Posted on Fri, 08 May 2026 21:33:48 +0000 by bow-viper1

Automating 12306 Train Ticket Queries: A Comparative Study of HTTP Scraping and Selenium Automation

Automating 12306 Train Ticket Queries: A Comparative Study of HTTP Scraping and Selenium Automation Automating routine tasks on web platforms often requires choosing between lightweight API interaction and robust browser automation. This article details two Python implementations for querying train availability on the 12306 railway system. The ...

Posted on Fri, 08 May 2026 15:57:37 +0000 by maxelcat