Scrape WeChat Official Account Articles Using Sogou Search with Selenium and PhantomJS

WeChat official account articles can be accessed through two primary scraping methods: direct extraction of MP article links, or indirect retrieval via Sogou's dedicated WeChat search engine (weixin.sogou.com). Direct MP links are challenging to obtain due to non-transparent URL patterns and access restrictions, so this implementation leverages Sogou's search interface to dynamically fetch, parse, and store article data local.

Scraping Methodology

  1. Target Sogou WeChat Search Sogou's WeChat search allows precise lookup of official accounts. To avoid duplicate results (common with Chinese account names), search using the account's unique English identifier—this ensures a single, accurate result, minimizing post-processing filtering. The search request URL is constructed with the URL-encoded account name: http://weixin.sogou.com/weixin?type=1&query={encoded_account_name}&ie=utf8&s_from=input&_sug_=n&_sug_type_=

  2. Fetch Search Results Use the requests library with a persistent session and custom User-Agent header to mimic browser traffic, reducing the risk of being blocked. This session maintains state across requests for consistency.

  3. Extract Official Account Homepage Parse the search result HTML using PyQuery (a jQuery-inspired HTML parser) to locate the link to the official account's homepage. The target selector is div.txt-box p.tit a, which points directly to the account's dynamic content page.

  4. Render Dynamic Content Official account homepages load article lists dynamically via JavaScript, so static HTTP requests cannot capture the full content. Use Selenium with PhantomJS to load the page, wait for JavaScript execution, and extract the complete rendered DOM.

  5. Parse Article Metadata and Content Once the full DOM is obtained, use PyQuery to extract individual article elements. For each article, extract metadata (title, URL, summary, publication date, cover image) and the full article content by re-rendering the article's dynamic page.

  6. Persist Data

  • Save each article's full content as an HTML file in a directory named after the official account.
  • Export article metadata to a Excel spreadsheet using pyExcelerator for tabular analysis.
  • Store structured metadata (JSON format) for programmatic access to scraped data.

Implementation Code

Basic Version (JSON + HTML Export)

#!/usr/bin/python
# coding: utf-8

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from urllib import quote
from pyquery import PyQuery as pq
from selenium import webdriver

import requests
import time
import re
import json
import os

class WeChatArticleScraper:
    def __init__(self, account_id):
        self.account_id = account_id
        self.sogou_search_url = f'http://weixin.sogou.com/weixin?type=1&query={quote(self.account_id)}&ie=utf8&s_from=input&_sug_=n&_sug_type_='
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        self.timeout = 5
        self.session = requests.Session()

    def fetch_sogou_search_page(self):
        self.log_with_timestamp(f"Search URL: {self.sogou_search_url}")
        return self.session.get(self.sogou_search_url, headers=self.headers, timeout=self.timeout).content

    def extract_official_account_url(self, search_html):
        doc = pq(search_html)
        return doc('div.txt-box p.tit a').attr('href')

    def render_dynamic_page(self, url):
        browser = webdriver.PhantomJS()
        browser.get(url)
        time.sleep(3)
        full_html = browser.execute_script("return document.documentElement.outerHTML")
        browser.quit()
        return full_html

    def parse_article_list(self, rendered_html):
        doc = pq(rendered_html)
        return doc('div.weui_msg_card')

    def convert_articles_to_list(self, articles):
        articles_list = []
        count = 1
        if articles:
            for post in articles.items():
                self.log_with_timestamp(f"Processing article {count}/{len(articles)}")
                articles_list.append(self.parse_single_post(post))
                count +=1
        return articles_list

    def parse_single_post(self, post):
        post_dict = {}
        post_element = post('.weui_media_box[id]')
        post_title = post_element('h4.weui_media_title').text()
        self.log_with_timestamp(f"Title: {post_title}")
        post_url = 'http://mp.weixin.qq.com' + post_element('h4.weui_media_title').attr('hrefs')
        self.log_with_timestamp(f"URL: {post_url}")
        post_summary = post_element('.weui_media_desc').text()
        self.log_with_timestamp(f"Summary: {post_summary}")
        post_date = post_element('.weui_media_extra_info').text()
        self.log_with_timestamp(f"Published: {post_date}")
        cover_image = self.extract_cover_image(post_element)
        post_content = self.fetch_post_content(post_url).html()
        
        self.save_post_as_html(post_title, post_date, post_content)
        return {
            'title': post_title,
            'url': post_url,
            'summary': post_summary,
            'date': post_date,
            'cover_image': cover_image,
            'content': post_content
        }

    def extract_cover_image(self, post_element):
        style_attr = post_element('.weui_media_hd').attr('style')
        pattern = re.compile(r'background-image:url\((.*?)\)')
        matches = pattern.findall(style_attr)
        image_url = matches[0] if matches else ''
        self.log_with_timestamp(f"Cover Image: {image_url}")
        return image_url

    def fetch_post_content(self, post_url):
        post_html = self.render_dynamic_page(post_url)
        return pq(post_html)('#js_content')

    def save_post_as_html(self, title, date, content):
        dir_path = self.account_id
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
        file_name = f"{dir_path}/{title}_{date}.html"
        with open(file_name, 'w') as f:
            f.write(content)

    def save_json_data(self, data):
        file_name = f"{self.account_id}/{self.account_id}_articles.json"
        with open(file_name, 'w') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)

    def log_with_timestamp(self, message):
        print(f"{time.strftime('%Y-%m-%d %H:%M:%S')}: {message}")

    def check_for_block(self, rendered_html):
        return pq(rendered_html)('#verify_change').text() != ''

    def run(self):
        dir_path = self.account_id
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
        
        self.log_with_timestamp(f"Starting scrape for account: {self.account_id}")
        search_html = self.fetch_sogou_search_page()
        
        self.log_with_timestamp("Extracting official account URL...")
        official_url = self.extract_official_account_url(search_html)
        self.log_with_timestamp(f"Official Account URL: {official_url}")
        
        self.log_with_timestamp("Rendering dynamic page with PhantomJS...")
        rendered_html = self.render_dynamic_page(official_url)
        
        if self.check_for_block(rendered_html):
            self.log_with_timestamp("Scraper blocked by Sogou. Please try again later.")
            return
        
        self.log_with_timestamp("Parsing article list...")
        articles = self.parse_article_list(rendered_html)
        self.log_with_timestamp(f"Found {len(articles)} articles.")
        
        self.log_with_timestamp("Converting articles to structured data...")
        articles_data = self.convert_articles_to_list(articles)
        
        self.log_with_timestamp("Saving data to JSON file...")
        self.save_json_data(articles_data)
        
        self.log_with_timestamp("Scraping completed successfully.")

if __name__ == '__main__':
    account_name = raw_input("Enter official account English name: ")
    if not account_name:
        account_name = 'python6359'
    scraper = WeChatArticleScraper(account_name)
    scraper.run()

Enhanced Version (Excel Support)

#!/usr/bin/python
# coding: utf-8

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from urllib import quote
from pyquery import PyQuery as pq
from selenium import webdriver
from pyExcelerator import Workbook

import requests
import time
import re
import json
import os

class WeChatArticleScraper:
    def __init__(self, account_id):
        self.account_id = account_id
        self.sogou_search_url = f'http://weixin.sogou.com/weixin?type=1&query={quote(self.account_id)}&ie=utf8&s_from=input&_sug_=n&_sug_type_='
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        self.timeout = 5
        self.session = requests.Session()
        
        self.excel_headers = ['ID', 'Publish Date', 'Title', 'URL', 'Summary']
        self.workbook = Workbook()
        self.sheet = self.workbook.add_sheet(time.strftime('%Y-%m-%d'))
        for col, header in enumerate(self.excel_headers):
            self.sheet.write(0, col, header)

    def fetch_sogou_search_page(self):
        self.log_with_timestamp(f"Search URL: {self.sogou_search_url}")
        return self.session.get(self.sogou_search_url, headers=self.headers, timeout=self.timeout).content

    def extract_official_account_url(self, search_html):
        doc = pq(search_html)
        return doc('div.txt-box p.tit a').attr('href')

    def render_dynamic_page(self, url):
        browser = webdriver.PhantomJS()
        browser.get(url)
        time.sleep(3)
        full_html = browser.execute_script("return document.documentElement.outerHTML")
        browser.quit()
        return full_html

    def parse_article_list(self, rendered_html):
        doc = pq(rendered_html)
        return doc('div.weui_media_box.appmsg')

    def convert_articles_to_list(self, articles):
        articles_list = []
        count = 1
        if articles:
            for post in articles.items():
                self.log_with_timestamp(f"Processing article {count}/{len(articles)}")
                articles_list.append(self.parse_single_post(post, count))
                count +=1
        self.workbook.save(f"{self.account_id}/{self.account_id}_articles.xls")
        return articles_list

    def parse_single_post(self, post, row_num):
        post_dict = {}
        post_element = post('.weui_media_box[id]')
        post_title = post_element('h4.weui_media_title').text()
        self.log_with_timestamp(f"Title: {post_title}")
        post_url = 'http://mp.weixin.qq.com' + post_element('h4.weui_media_title').attr('hrefs')
        self.log_with_timestamp(f"URL: {post_url}")
        post_summary = post_element('.weui_media_desc').text()
        self.log_with_timestamp(f"Summary: {post_summary}")
        post_date = post_element('.weui_media_extra_info').text()
        self.log_with_timestamp(f"Published: {post_date}")
        cover_image = self.extract_cover_image(post_element)
        post_content = self.fetch_post_content(post_url).html()
        
        self.save_post_as_html(post_title, post_date, post_content)
        self.sheet.write(row_num, 0, row_num)
        self.sheet.write(row_num, 1, post_date)
        self.sheet.write(row_num, 2, post_title)
        self.sheet.write(row_num, 3, post_url)
        self.sheet.write(row_num, 4, post_summary)
        return {
            'title': post_title,
            'url': post_url,
            'summary': post_summary,
            'date': post_date,
            'cover_image': cover_image,
            'content': post_content
        }

    def extract_cover_image(self, post_element):
        style_attr = post_element('.weui_media_hd').attr('style')
        pattern = re.compile(r'background-image:url\((.*?)\)')
        matches = pattern.findall(style_attr)
        image_url = matches[0] if matches else ''
        self.log_with_timestamp(f"Cover Image: {image_url}")
        return image_url

    def fetch_post_content(self, post_url):
        post_html = self.render_dynamic_page(post_url)
        return pq(post_html)('#js_content')

    def save_post_as_html(self, title, date, content):
        dir_path = self.account_id
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
        file_name = f"{dir_path}/{title}_{date}.html"
        with open(file_name, 'w') as f:
            f.write(content)

    def save_json_data(self, data):
        file_name = f"{self.account_id}/{self.account_id}_articles.json"
        with open(file_name, 'w') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)

    def log_with_timestamp(self, message):
        print(f"{time.strftime('%Y-%m-%d %H:%M:%S')}: {message}")

    def check_for_block(self, rendered_html):
        return pq(rendered_html)('#verify_change').text() != ''

    def run(self):
        dir_path = self.account_id
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
        
        self.log_with_timestamp(f"Starting scrape for account: {self.account_id}")
        search_html = self.fetch_sogou_search_page()
        
        self.log_with_timestamp("Extracting official account URL...")
        official_url = self.extract_official_account_url(search_html)
        self.log_with_timestamp(f"Official Account URL: {official_url}")
        
        self.log_with_timestamp("Rendering dynamic page with PhantomJS...")
        rendered_html = self.render_dynamic_page(official_url)
        
        if self.check_for_block(rendered_html):
            self.log_with_timestamp("Scraper blocked by Sogou. Please try again later.")
            return
        
        self.log_with_timestamp("Parsing article list...")
        articles = self.parse_article_list(rendered_html)
        self.log_with_timestamp(f"Found {len(articles)} articles.")
        
        self.log_with_timestamp("Converting articles to structured data...")
        articles_data = self.convert_articles_to_list(articles)
        
        self.log_with_timestamp("Saving data to JSON file...")
        self.save_json_data(articles_data)
        
        self.log_with_timestamp("Scraping completed successfully.")

if __name__ == '__main__':
    print("""
    ***************************************** 
    **    WeChat Official Account Scraper  ** 
    **      Created with Python 2.7        ** 
    *****************************************
    """)
    account_name = raw_input("Enter official account English name: ")
    if not account_name:
        account_name = 'python6359'
    scraper = WeChatArticleScraper(account_name)
    scraper.run()

Key Technical Concepts Covered

  • HTTP Requesst: Using requests.get with sessions and custom headers for reliable data fetching.
  • HTML Parsing: PyQuery for efficient, jQuery-like traversal of HTML documents.
  • Dynamic Page Rendering: Selenium with PhantomJS to handle JavaScript-heavy content loading.
  • File System Operations: Directory creation and file I/O for storing HTML, Excel, and JSON data.
  • URL Encoding: urllib.quote to safely encode special characters in search queries.
  • Logging: Timestamped logging for tracking scraper execution and debugging.
  • Data Serialization: JSON and Excel formats for structured data storage and analysis.
  • Regular Expressions: Extracting cover image URLs from inline CSS styles.

Tags: Python Web Scraping Selenium PhantomJS WeChat Official Accounts PyQuery

Posted on Fri, 08 May 2026 23:44:13 +0000 by jwinn