Python has emerged as one of the fastest-growing mainstream programming languages and ranks as the second most beloved language among developers according to the Stack Overflow 2019 survey. For developers working with .NET or Java, learning Python can serve as an excellent second language—especially given its powerful capabilities in areas like web scraping, machine learning, and artificial intelligence. Moreover, Python's simplicity makes it beginner-friendly, requiring no formal books to get started.
Due to preparations for the Changsha Developer Conference on April 21st, progress was delayed. However, we've invited notable experts including Zhang Shanyou (Tencent senior tech expert and Microsoft MVP), Liang Tongming (author of 52ABP framework and Microsoft MVP), Wang Peng (renowned tech writer), Zhuo Wei (senior engineer at Tencent), and Hu Liwei (sanior product manager at Tencent Cloud). Those interested can register via the public account menu [Contact Us] → [Register] to join us. Technology transcends language boundaries, and we look forward to sharing knowledge and exchanging ideas with you!
Table of Contents
Introducsion to Python
Official Docker Image
Web Scraping Blog Posts Using Python
Requirements Overview
Understanding BeautifulSoup
Analyzing Data Extraction Patterns
Implementing Scraping Logic
Creating a Dockerfile
Runing and Viewing Results
Introduction to Python
Python is a high-level, interpreted programming language designed for automation tasks. Over time, it has evolved into a versatile tool suitable for large-scale applications. As one of the fastest-growing mainstream languages and the second most popular among developers, Python excels across various domains:
- Web and Internet development
- Scientific computing and statistics
- Education
- Desktop application development
- Software engineering
- Backend services
Learning Python is straightforward, yet it empowers users to rapidly acquire skills in machine learning and even deep learning. While basic Python proficiency is valuable, combining it with a statically typed language like .NET or Java enhances your capabilities. Thus, we recommend that .NET and Java developers consider Python as a secondary language due to its strength in fields like data scraping, AI, and algorithmic solutions.
Official Docker Image
The official Python image can be found here: https://hub.docker.com/_/python
Always ensure you're using the official image.
Web Scraping Blog Posts Using Python
Requirements Overview
This tutorial demonstrates how to scrape a list of blog posts from my personal blog using Python, extracting titles, links, dates, and summaries.
Blog URL: http://www.cnblogs.com/codelove/
Content displayed below:
Understanding BeautifulSoup
BeautifulSoup is a Python library used for parsing HTML and XML documents efficiently. It supports multiple parsers and serves as a robust tool for web scraping. In this guide, we'll use BeautifulSoup to extract blog data.
BeautifulSoup Official Site: https://beautifulsoup.readthedocs.io
Parser Options Include:
Analyzing Data Extraction Patterns
To begin, open Chrome and navigate to: http://www.cnblogs.com/codelove/
Press F12 to open Developer Tools. From there, identify key elements:
- Blog entries (div.day)
- Post titles (div.postTitle a)
- Additional metadata such as date, link, and summary (no screenshots needed).
We also observed pagination patterns through URL analysis:
With these insights, we’re ready to proceed with implementation.
Implementing Scraping Logic
Before coding, consult the BeautifulSoup documentation. Based on our needs, here’s the Python script:
# Refer to the BeautifulSoup docs: https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id52
from bs4 import BeautifulSoup
import requests
import time
import re
base_url = "https://www.cnblogs.com/codelove/default.html?page={page}"
current_page = 0
while True:
current_page += 1
target_url = base_url.format(page=current_page)
response = requests.get(target_url)
soup = BeautifulSoup(response.text, 'html5lib')
posts = soup.select(".forFlow .day")
if not posts:
break
print(f"Fetching: {target_url}")
for post in posts:
title = post.select(".postTitle a")[0].string
print(f"{'-' * 30}{title}{'-' * 30}")
link = post.select(".postTitle a")[0]["href"]
print(link)
date = post.select(".dayTitle a")[0].get_text()
print(date)
description = post.select(".postCon > div")[0].get_text()
print(description)
print("-" * 80)
This script navigates through pages, extracts blog details using CSS selectors, and displays them. Comments explain each step clearly.
Creating a Dockerfile
After writing the code, we’ll package it into a Docker container for consistent local development without installing SDKs. Here’s the Dockerfile:
# Base image
FROM python:3.7-slim
# Set working directory
WORKDIR /app
# Copy project files
COPY . /app
# Install dependencies
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Run the application
CMD ["python", "app.py"]
Since we rely on libraries like BeautifulSoup, we need to specify dependencies in requirements.txt:
html5libbeautifulsoup4requests
Running and Viewing Results
Once built, execute the container to see output similar to the following: