Modern large language models have streamlined post-scraping text processing, replacing manual tag stripping and formatting with fast, robust cleaning workflows. Beyond cleaning, these tools enable efficient core idea extraction and content rephrasing for legitimate use cases.
Scraping web content requires identifying consistent, traversable resource URLs. Common strategies include inferring pagination patterns via query parameters or parsing embedded hyperlinks from source HTML. WeChat Official Account articles resist both approaches: individual post URLs contain complex, non-sequential encrypted parameters, and posts lack direct navigation to adjacent entries from the same account.
A reliable workaround exists using WeChat Public Platform’s internal tools. The process requires an active Official Account (registration is straightforward and free for personal use). After logging in, navigate to Drafts > Create New Graphic, click the Hyperlink button at the top, select Official Account Articles, and search for the target account. When the article list loads, open browser developer tools (F12) and switch to the Network tab. Clicking "Next Page" reveals a structured API request with predictable parameters.
Key API parameters include begin (starting index of the article batch), count (number of articles per request, capped at a low value by the platform), and fakeid (unique encrypted identifier for the target account). The response returns a JSON payload with article metadata.
The following Python implementation uses Python 3.10+, requests, pandas, time, math, and random to safely gather links:
import requests
import pandas as pd
import time
import math
import random
# API endpoint for article list retrieval
API_ENDPOINT = "https://mp.weixin.qq.com/cgi-bin/appmsg"
# Replace with values captured from developer tools
REQUEST_HEADERS = {
"Cookie": "PLACEHOLDER_COOKIE_FROM_DEVTOOLS",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
BASE_PAYLOAD = {
"token": "PLACEHOLDER_TOKEN_FROM_API_URL",
"lang": "zh_CN",
"f": "json",
"ajax": "1",
"action": "list_ex",
"count": "5",
"query": "",
"fakeid": "PLACEHOLDER_TARGET_ACCOUNT_FAKEID",
"type": "9"
}
def fetch_article_metadata():
article_records = []
# Initial request to get total article count
BASE_PAYLOAD["begin"] = "0"
try:
initial_resp = requests.get(API_ENDPOINT, headers=REQUEST_HEADERS, params=BASE_PAYLOAD, timeout=10)
initial_resp.raise_for_status()
initial_data = initial_resp.json()
total_articles = int(initial_data.get("app_msg_cnt", 0))
total_batches = math.ceil(total_articles / int(BASE_PAYLOAD["count"]))
print(f"Found {total_articles} articles across {total_batches} batches")
except Exception as e:
print(f"Failed to fetch initial metadata: {str(e)}")
return
for batch_idx in range(total_batches):
BASE_PAYLOAD["begin"] = str(batch_idx * int(BASE_PAYLOAD["count"]))
# Randomized delays to avoid rate limiting
if batch_idx > 0 and batch_idx % 10 == 0:
# Save intermediate results and take a longer break
temp_df = pd.DataFrame(article_records, columns=["headline", "permalink", "publish_timestamp"])
temp_df.to_csv("wechat_article_links.csv", mode="w" if batch_idx == 10 else "a", encoding="utf-8-sig", index=False, header=batch_idx == 10)
article_records = []
pause_duration = random.randint(70, 100)
print(f"Paused for {pause_duration}s after batch {batch_idx}")
time.sleep(pause_duration)
else:
time.sleep(random.randint(18, 28))
try:
batch_resp = requests.get(API_ENDPOINT, headers=REQUEST_HEADERS, params=BASE_PAYLOAD, timeout=10)
batch_resp.raise_for_status()
batch_data = batch_resp.json()
if "app_msg_list" not in batch_data:
print(f"No article list in batch {batch_idx}, stopping early")
break
for post in batch_data["app_msg_list"]:
formatted_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(post["create_time"]))
article_records.append([post["title"], post["link"], formatted_time])
print(f"Processed batch {batch_idx + 1}/{total_batches}")
except Exception as e:
print(f"Error processing batch {batch_idx}: {str(e)}")
break
# Save remaining records
if article_records:
final_df = pd.DataFrame(article_records, columns=["headline", "permalink", "publish_timestamp"])
final_df.to_csv("wechat_article_links.csv", mode="a", encoding="utf-8-sig", index=False, header=False)
print(f"Saved final {len(article_records)} records")
if __name__ == "__main__":
fetch_article_metadata()
Critical notes:
- Replace
PLACEHOLDER_COOKIE_FROM_DEVTOOLS,PLACEHOLDER_TOKEN_FROM_API_URL, andPLACEHOLDER_TARGET_ACCOUNT_FAKEIDwith actual values captured from your Network tab. - Cookies and tokens expire quickly; refresh them if requests return errors.
- Strictly adhere to WeChat’s terms of service and local regulations regarding web scraping and content usage.