Practical Python Method for Batch Scraping WeChat Official Account Article Links

Modern large language models have streamlined post-scraping text processing, replacing manual tag stripping and formatting with fast, robust cleaning workflows. Beyond cleaning, these tools enable efficient core idea extraction and content rephrasing for legitimate use cases.

Scraping web content requires identifying consistent, traversable resource URLs. Common strategies include inferring pagination patterns via query parameters or parsing embedded hyperlinks from source HTML. WeChat Official Account articles resist both approaches: individual post URLs contain complex, non-sequential encrypted parameters, and posts lack direct navigation to adjacent entries from the same account.

A reliable workaround exists using WeChat Public Platform’s internal tools. The process requires an active Official Account (registration is straightforward and free for personal use). After logging in, navigate to Drafts > Create New Graphic, click the Hyperlink button at the top, select Official Account Articles, and search for the target account. When the article list loads, open browser developer tools (F12) and switch to the Network tab. Clicking "Next Page" reveals a structured API request with predictable parameters.

Key API parameters include begin (starting index of the article batch), count (number of articles per request, capped at a low value by the platform), and fakeid (unique encrypted identifier for the target account). The response returns a JSON payload with article metadata.

The following Python implementation uses Python 3.10+, requests, pandas, time, math, and random to safely gather links:

import requests
import pandas as pd
import time
import math
import random

# API endpoint for article list retrieval
API_ENDPOINT = "https://mp.weixin.qq.com/cgi-bin/appmsg"

# Replace with values captured from developer tools
REQUEST_HEADERS = {
    "Cookie": "PLACEHOLDER_COOKIE_FROM_DEVTOOLS",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}

BASE_PAYLOAD = {
    "token": "PLACEHOLDER_TOKEN_FROM_API_URL",
    "lang": "zh_CN",
    "f": "json",
    "ajax": "1",
    "action": "list_ex",
    "count": "5",
    "query": "",
    "fakeid": "PLACEHOLDER_TARGET_ACCOUNT_FAKEID",
    "type": "9"
}

def fetch_article_metadata():
    article_records = []
    # Initial request to get total article count
    BASE_PAYLOAD["begin"] = "0"
    try:
        initial_resp = requests.get(API_ENDPOINT, headers=REQUEST_HEADERS, params=BASE_PAYLOAD, timeout=10)
        initial_resp.raise_for_status()
        initial_data = initial_resp.json()
        total_articles = int(initial_data.get("app_msg_cnt", 0))
        total_batches = math.ceil(total_articles / int(BASE_PAYLOAD["count"]))
        print(f"Found {total_articles} articles across {total_batches} batches")
    except Exception as e:
        print(f"Failed to fetch initial metadata: {str(e)}")
        return

    for batch_idx in range(total_batches):
        BASE_PAYLOAD["begin"] = str(batch_idx * int(BASE_PAYLOAD["count"]))
        # Randomized delays to avoid rate limiting
        if batch_idx > 0 and batch_idx % 10 == 0:
            # Save intermediate results and take a longer break
            temp_df = pd.DataFrame(article_records, columns=["headline", "permalink", "publish_timestamp"])
            temp_df.to_csv("wechat_article_links.csv", mode="w" if batch_idx == 10 else "a", encoding="utf-8-sig", index=False, header=batch_idx == 10)
            article_records = []
            pause_duration = random.randint(70, 100)
            print(f"Paused for {pause_duration}s after batch {batch_idx}")
            time.sleep(pause_duration)
        else:
            time.sleep(random.randint(18, 28))

        try:
            batch_resp = requests.get(API_ENDPOINT, headers=REQUEST_HEADERS, params=BASE_PAYLOAD, timeout=10)
            batch_resp.raise_for_status()
            batch_data = batch_resp.json()
            if "app_msg_list" not in batch_data:
                print(f"No article list in batch {batch_idx}, stopping early")
                break
            for post in batch_data["app_msg_list"]:
                formatted_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(post["create_time"]))
                article_records.append([post["title"], post["link"], formatted_time])
            print(f"Processed batch {batch_idx + 1}/{total_batches}")
        except Exception as e:
            print(f"Error processing batch {batch_idx}: {str(e)}")
            break

    # Save remaining records
    if article_records:
        final_df = pd.DataFrame(article_records, columns=["headline", "permalink", "publish_timestamp"])
        final_df.to_csv("wechat_article_links.csv", mode="a", encoding="utf-8-sig", index=False, header=False)
        print(f"Saved final {len(article_records)} records")

if __name__ == "__main__":
    fetch_article_metadata()

Critical notes:

  • Replace PLACEHOLDER_COOKIE_FROM_DEVTOOLS, PLACEHOLDER_TOKEN_FROM_API_URL, and PLACEHOLDER_TARGET_ACCOUNT_FAKEID with actual values captured from your Network tab.
  • Cookies and tokens expire quickly; refresh them if requests return errors.
  • Strictly adhere to WeChat’s terms of service and local regulations regarding web scraping and content usage.

Tags: python web scraping WeChat Official Account Data Collection API Integration

Posted on Mon, 08 Jun 2026 16:31:28 +0000 by spfoonnewb