Using Docker with Python for Web Scraping: A Practical Guide

Python has emerged as one of the fastest-growing mainstream programming languages and ranks as the second most beloved language among developers according to the Stack Overflow 2019 survey. For developers working with .NET or Java, learning Python can serve as an excellent second language—especially given its powerful capabilities in areas like web scraping, machine learning, and artificial intelligence. Moreover, Python's simplicity makes it beginner-friendly, requiring no formal books to get started.

Due to preparations for the Changsha Developer Conference on April 21st, progress was delayed. However, we've invited notable experts including Zhang Shanyou (Tencent senior tech expert and Microsoft MVP), Liang Tongming (author of 52ABP framework and Microsoft MVP), Wang Peng (renowned tech writer), Zhuo Wei (senior engineer at Tencent), and Hu Liwei (sanior product manager at Tencent Cloud). Those interested can register via the public account menu [Contact Us] → [Register] to join us. Technology transcends language boundaries, and we look forward to sharing knowledge and exchanging ideas with you!

Table of Contents

Introducsion to Python

Official Docker Image

Web Scraping Blog Posts Using Python

Requirements Overview

Understanding BeautifulSoup

Analyzing Data Extraction Patterns

Implementing Scraping Logic

Creating a Dockerfile

Runing and Viewing Results

Introduction to Python

Python is a high-level, interpreted programming language designed for automation tasks. Over time, it has evolved into a versatile tool suitable for large-scale applications. As one of the fastest-growing mainstream languages and the second most popular among developers, Python excels across various domains:

  • Web and Internet development
  • Scientific computing and statistics
  • Education
  • Desktop application development
  • Software engineering
  • Backend services

Learning Python is straightforward, yet it empowers users to rapidly acquire skills in machine learning and even deep learning. While basic Python proficiency is valuable, combining it with a statically typed language like .NET or Java enhances your capabilities. Thus, we recommend that .NET and Java developers consider Python as a secondary language due to its strength in fields like data scraping, AI, and algorithmic solutions.

Official Docker Image

The official Python image can be found here: https://hub.docker.com/_/python

Always ensure you're using the official image.

Web Scraping Blog Posts Using Python

Requirements Overview

This tutorial demonstrates how to scrape a list of blog posts from my personal blog using Python, extracting titles, links, dates, and summaries.

Blog URL: http://www.cnblogs.com/codelove/

Content displayed below:

Understanding BeautifulSoup

BeautifulSoup is a Python library used for parsing HTML and XML documents efficiently. It supports multiple parsers and serves as a robust tool for web scraping. In this guide, we'll use BeautifulSoup to extract blog data.

BeautifulSoup Official Site: https://beautifulsoup.readthedocs.io

Parser Options Include:

Analyzing Data Extraction Patterns

To begin, open Chrome and navigate to: http://www.cnblogs.com/codelove/

Press F12 to open Developer Tools. From there, identify key elements:

  • Blog entries (div.day)
  • Post titles (div.postTitle a)
  • Additional metadata such as date, link, and summary (no screenshots needed).

We also observed pagination patterns through URL analysis:

With these insights, we’re ready to proceed with implementation.

Implementing Scraping Logic

Before coding, consult the BeautifulSoup documentation. Based on our needs, here’s the Python script:

# Refer to the BeautifulSoup docs: https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id52

from bs4 import BeautifulSoup
import requests
import time
import re

base_url = "https://www.cnblogs.com/codelove/default.html?page={page}"
current_page = 0

while True:
    current_page += 1
    target_url = base_url.format(page=current_page)
    response = requests.get(target_url)
    
    soup = BeautifulSoup(response.text, 'html5lib')
    
    posts = soup.select(".forFlow .day")
    
    if not posts:
        break
        
    print(f"Fetching: {target_url}")
    
    for post in posts:
        title = post.select(".postTitle a")[0].string
        print(f"{'-' * 30}{title}{'-' * 30}")
        
        link = post.select(".postTitle a")[0]["href"]
        print(link)
        
        date = post.select(".dayTitle a")[0].get_text()
        print(date)
        
        description = post.select(".postCon > div")[0].get_text()
        print(description)
        
        print("-" * 80)

This script navigates through pages, extracts blog details using CSS selectors, and displays them. Comments explain each step clearly.

Creating a Dockerfile

After writing the code, we’ll package it into a Docker container for consistent local development without installing SDKs. Here’s the Dockerfile:

# Base image
FROM python:3.7-slim

# Set working directory
WORKDIR /app

# Copy project files
COPY . /app

# Install dependencies
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Run the application
CMD ["python", "app.py"]

Since we rely on libraries like BeautifulSoup, we need to specify dependencies in requirements.txt:

html5libbeautifulsoup4requests

Running and Viewing Results

Once built, execute the container to see output similar to the following:

Tags: python docker web-scraping beautifulsoup Requests

Posted on Tue, 12 May 2026 21:27:19 +0000 by freelancer