Web Scraping with HTTP APIs: A Practical Guide to Data Extraction

While traditional web scraping often involves parsing HTML content, there's a more efficient approach: directly accessing data through HTTP APIs. Since most modern web services expose their data via RESTful APIs that typically use JSON format, this tutorial will explore how to extract data by making direct API calls. For those unfamiliar with REST APIs or JSON, reviewing these concepts beforehand is recommended. We'll use the Tencent Cloud Developer Community as our practical example throughout this guide.

API Endpoint Discovery

Finding the right API endpoints is straightforward. Start by navigating to the Tencent Cloud Community website and opening the browser's developer tools (F12). Examine the network requests tab and filter for XHR requests to isolate API calls from static resources. While identifying the correct endpoint might require some investigation, well-designed APIs usually have descriptive names that indicate their purpose. Let's begin by locating the endpoint for homepage activity data.

Essential Development Tools

A highly useful tool for API scraping automation is https://curlconverter.com/. This utility significantly accelerates the scraping workflow by automatically converting browser requests into code. Instead of manually crafting requests with proper headers and cookies, you can capture the request directly from your browser and generate executable code in your preferred programming language.

To use this tool:

Identify the target API request in your browser's network tab
Right-click and copy the request (in Edge browser)
Paste the copied request into curlconverter.com
Select your programming language and generate the code

This approach eliminates the tedious process of manually configuring requests and allows you to focus on data processing.

Scraping Community Homepage

With this methodology, you can extract virtually any available data, provided you avoid triggering anti-bot measures through excessive requests. Let's demonstrate by analyzing popular article categories on the community homepage. Here's an implementation using Python:

import datetime
import requests
from collections import defaultdict

category_stats = defaultdict(int)
article_collection = []
total_articles = 0

def fetch_articles(page_number):
    global total_articles, article_collection
    
    request_headers = {
        'authority': 'cloud.tencent.com',
        'accept': 'application/json, text/plain, */*',
        'accept-language': 'en-US,en;q=0.9',
        'content-type': 'application/json',
        'origin': 'https://cloud.tencent.com',
        'referer': 'https://cloud.tencent.com/developer',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }

    payload = {
        'pageNumber': page_number,
        'pageSize': 50,
        'type': 'recommend'
    }

    api_response = requests.post(
        'https://cloud.tencent.com/developer/api/home/article-list',
        headers=request_headers,
        json=payload
    )
    
    data = api_response.json()
    
    for entry in data['list']:
        process_categories(entry)
        article_collection.append({
            'title': entry['title'],
            'publish_date': entry['createTime'],
            'abstract': entry['summary']
        })
    
    total_articles = data['total']
    cutoff_date = datetime.datetime(2023, 1, 1)
    cutoff_timestamp = int(cutoff_date.timestamp())
    
    return 0 if article_collection[-1]['publish_date'] < cutoff_timestamp else 1

def process_categories(article):
    for tag in article.get('tags', []):
        category_stats[tag['tagName']] += 1

def display_top_categories(limit=10):
    sorted_categories = sorted(category_stats.items(), 
                              key=lambda item: item[1], 
                              reverse=True)
    
    for category, count in sorted_categories[:limit]:
        print(f"{category}: {count} articles")

page_counter = 1
while True:
    should_continue = fetch_articles(page_counter)
    page_counter += 1
    if not should_continue:
        break

display_top_categories()

This script retrieves articles from the API, processes category information, and generates a frequency analysis of popular topics. Additionally, we can extract promotional banner data for comprehensive analysis:

import requests

def retrieve_banners():
    request_headers = {
        'authority': 'cloud.tencent.com',
        'accept': 'application/json, text/plain, */*',
        'content-type': 'application/json',
        'origin': 'https://cloud.tencent.com',
        'referer': 'https://cloud.tencent.com/developer'
    }

    payload = {
        'cate': 'cloud_banner',
        'preview': False
    }

    response = requests.post(
        'https://cloud.tencent.com/developer/api/common/getAds',
        headers=request_headers,
        json=payload
    )
    
    banner_data = response.json()
    return [{
        'headline': item['content']['pcTitle'],
        'destination': item['content']['url']
    } for item in banner_data['list']]

active_banners = retrieve_banners()
for banner in active_banners:
    print(f"Banner: {banner['headline']} -> {banner['destination']}")

Accessing Personal Content

For accessing personal articles, authentication is required. While implementing a complete login flow programmatically is possible, it involves complex token management and security considerations. For learning purposes, we'll use sesion cookies obtained from an authenticated browser session.

The HTTP protocol is stateless, so cookies maintain session continuity across requests. Here's how to access your personal articles using session cookies:

import requests

personal_articles = []

def get_user_articles(page_num):
    request_headers = {
        'authority': 'cloud.tencent.com',
        'accept': 'application/json, text/plain, */*',
        'content-type': 'application/json',
        'cookie': 'your_session_cookie_here',  # Replace with actual cookie
        'origin': 'https://cloud.tencent.com',
        'referer': 'https://cloud.tencent.com/developer/creator/article',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    request_payload = {
        'hostStatus': 0,
        'sortType': 'create',
        'page': page_num,
        'pagesize': 20
    }

    response = requests.post(
        'https://cloud.tencent.com/developer/api/creator/articleList',
        headers=request_headers,
        json=request_payload
    )
    
    data = response.json()
    
    for article in data['list']:
        personal_articles.append({
            'title': article['title'],
            'date': article['createTime'],
            'excerpt': article['summary']
        })
    
    total_count = data['total']
    current_index = (page_num - 1) * 20 + len(data['list'])
    
    return 0 if current_index >= total_count else 1

# To obtain your session cookie:
# 1. Log in to the website
# 2. Open developer tools
# 3. Find the cookie in the network request headers
# 4. Copy and paste it into the script

Remember to replace 'your_session_cookie_here' with your actual session cookie from the authenticated browser session. This approach provides a balance between functionality and simplicity for educational purposes.

Tags: web scraping API HTTP Requests python Requests Library

Posted on Sat, 04 Jul 2026 17:49:05 +0000 by xdentan

Freaks City

Web Scraping with HTTP APIs: A Practical Guide to Data Extraction

API Endpoint Discovery

Essential Development Tools

Scraping Community Homepage

Accessing Personal Content

Hot Tags