Web Scraping with XPath: Extracting News Headlines from 36Kr

Having previously explored the powerful BeautifulSoup library for HTML parsing and techniques for capturing HTTP requests through on line tools, we now turn our attention to another fundamental web scraping approach: XPath. XPath serves as a query language designed to navigate and select specific portions of XML documents. While originally developed for XML processing, it proves equal effective for HTML document analysis.

HTML and XML share numerous structural similariteis, including tags and attributes, making XPath an ideal tool for element identification within HTML documents. Web scrapers can leverage XPath expressions to pinpoint data locations, then utilize XPath parsers to process HTML documents and extract required information.

Let's dive directly into our scraping task: extracting trending news and performing news searches on 36Kr.

XPath-Based Web Scraping

Even without prior XPath experience, you'll quickly discover its purpose aligns with BeautifulSoup—we're simply using different expression syntaxes and methodologies. Before beginning our scraping work, we can install an XPath browser extension. Previously, with BeautifulSoup, we manually inspected HTML code to locate tags and write parsing code, which was time-consuming. Browser plugins allow direct extraction of XPath expressions during element selection.

XPath Browser Extensions

Multiple browser extensions are available for XPath extraction. These tools enable easy copying of expressions during element selection. When activating the plugin, a visual selector appears, highlighting elements with red borders. After selection, multiple XPath expressions are presented, allowing you to choose the most appropriate one.

Trending News Extraction

With our tool knowledge established, we proceed to data extraction and page information parsing. First, install the lxml dependency:

from lxml import html
import requests

trending_news_collection = []

request_headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}

def fetch_trending_news():
    source_url = 'https://36kr.com/hot-list/catalog'
    api_response = requests.get(url=source_url, headers=request_headers)
    
    # Convert byte response to string and parse with lxml
    parsed_content = html.fromstring(api_response.text)

    # Locate elements using XPath
    titles = parsed_content.xpath("//a[@class='article-item-title weight-bold']")
    descriptions = parsed_content.xpath("//a[@class='article-item-description ellipsis-2']")

    if len(titles) == len(descriptions):
        for title_element, desc_element in zip(titles, descriptions):
            # Extract link and text content
            article_link = title_element.get('href')
            article_title = title_element.text_content().strip()
            article_description = desc_element.text_content().strip()
            
            trending_news_collection.append({
                "headline": article_title,
                "url": article_link,
                "summary": article_description
            })
    else:
        print("Target elements not found")

fetch_trending_news()
print(trending_news_collection)

This implementation extracts article headlines, links, and summaries from 36Kr's trending section, storing them in a collection. The lxml library handles HTML parsing while requests manages HTTP communication. An empty list stores the extracted article information.

Common Challenges Encountered

After successfully extracting trending article titles and links, the next step typically involves visiting individual URLs to retrieve detailed news content. However, when attempting to fetch content from single URL endpoints, expected information wasn't retrieved due to the following scenario:

Web page data typically exists in two forms: static HTML content (like recipe pages we've previously parsed) or dynamically loaded via Ajax requests (such as Tencent Cloud community article lists). Usually, matching content can be located through searching. Despite spending an hour, I couldn't successfully retrieve required information. Initially suspecting page redirection issues, I used packet capture tools but found no relevant data.

Upon re-examining static HTML code, I discovered an unusual aspect at the page end—certain HTML sections appeared encrypted. This encryption likely implements anti-scraping measures, so I decided against attempting decryption.

News Search Functionality

Beyond trending articles, 36Kr offers news search capabilities. Let's examine search functionality implementation. Static pages often suffice for basic information extraction, but accessing more comprehensive data requires sending AJAX requests.

Here's the implementation:

from lxml import html
from urllib.parse import urlencode
import requests

def search_articles_by_keyword(search_term):
    encoded_query = urlencode({'q': search_term}, quote_via=lambda s, safe, encoding, errors: s.replace(' ', '%20'))
    search_url = f'https://36kr.com/search/articles/{search_term}'
    response = requests.get(url=search_url, headers=request_headers)
    
    parsed_doc = html.fromstring(response.text)
    article_elements = parsed_doc.xpath("//p[@class='title-wrapper ellipsis-2']//a")
    
    for link_element in article_elements:
        content_text = link_element.text_content().strip()
        link_address = link_element.get("href")
        print(f"Title: {content_text}")
        print(f"Link: {link_address}")

def advanced_search_api(search_term):
    api_headers = {
        'Accept': '*/*',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
        'Connection': 'keep-alive',
        'Content-Type': 'application/json',
        'Cookie': 'Hm_lvt_713123c60a0e86982326bae1a51083e1=1710743069; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218b40a4b8576e0-0508814adc1724-745d5774-2073600-18b40a4b858109a%22%2C%22%24device_id%22%3A%2218b40a4b8576e0-0508814adc1724-745d5774-2073600-18b40a4b858109a%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; Hm_lvt_1684191ccae0314c6254306a8333d090=1710743069; aliyungf_tc=9f944307bb330cb7a00e123533aad0ee8a0e932e77510b0782e3ea63cddc99cf; Hm_lpvt_713123c60a0e86982326bae1a51083e1=1710750569; Hm_lpvt_1684191ccae0314c6254306a8333d090=1710750569',
        'Origin': 'https://36kr.com',
        'Referer': 'https://36kr.com/',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-site',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0',
        'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
    }

    payload_data = {
        'partner_id': 'web',
        'timestamp': 1710751467467,
        'param': {
            'searchType': 'article',
            'searchWord': search_term,
            'sort': 'score',
            'pageSize': 20,
            'pageEvent': 1,
            'pageCallback': 'eyJmaXJzdElkIjo5NSwibGFzdElkIjo1MSwiZmlyc3RDcmVhdGVUaW1lIjo3NTU4MSwibGFzdENyZWF0ZVRpbWUiOjIzOTk3LCJsYXN0UGFyYW0iOiJ7XCJwcmVQYWdlXCI6MSxcIm5leHRQYWdlXCI6MixcInBhZ2VOb1wiOjEsXCJwYWdlU2l6ZVwiOjIwLFwidG90YWxQYWdlXCI6MTAsXCJ0b3RhbENvdW50XCI6MjAwfSJ9',
            'siteId': 1,
            'platformId': 2,
        },
    }

    api_response = requests.post(
        'https://gateway.36kr.com/api/mis/nav/search/resultbytype',
        headers=api_headers,
        json=payload_data,
    )
    
    response_data = api_response.json()
    for item in response_data['data']['itemList']:
        processed_title = item['widgetTitle'].replace('<em>', '').replace('</em>', '')
        print(processed_title)
        article_route = item['route']
        print(article_route)

search_articles_by_keyword('OpenAI')
advanced_search_api('artificial intelligence')

We have two functions: search_articles_by_keyword and advanced_search_api, both retrieving article information from 36Kr.

  1. search_articles_by_keyword(keyword):
    • URL-encodes the search term
    • Constructs search URL and sends GET request for page content
    • Uses lxml's html module to parse HTML content
    • Applies XPath to locate elements and extract article titles and URLs
  2. advanced_search_api(keyword):
    • Defines request headers and JSON payload
    • Sends POST request to specific API endpoint for article data
    • Processes returned JSON data to extract article titles and URLs

Tags: xpath web-scraping lxml python-requests html-parsing

Posted on Fri, 12 Jun 2026 18:32:30 +0000 by 9mm