While traditional web scraping often involves parsing HTML content, there's a more efficient approach: directly accessing data through HTTP APIs. Since most modern web services expose their data via RESTful APIs that typically use JSON format, this tutorial will explore how to extract data by making direct API calls. For those unfamiliar with REST APIs or JSON, reviewing these concepts beforehand is recommended. We'll use the Tencent Cloud Developer Community as our practical example throughout this guide.
API Endpoint Discovery
Finding the right API endpoints is straightforward. Start by navigating to the Tencent Cloud Community website and opening the browser's developer tools (F12). Examine the network requests tab and filter for XHR requests to isolate API calls from static resources. While identifying the correct endpoint might require some investigation, well-designed APIs usually have descriptive names that indicate their purpose. Let's begin by locating the endpoint for homepage activity data.
Essential Development Tools
A highly useful tool for API scraping automation is https://curlconverter.com/. This utility significantly accelerates the scraping workflow by automatically converting browser requests into code. Instead of manually crafting requests with proper headers and cookies, you can capture the request directly from your browser and generate executable code in your preferred programming language.
To use this tool:
- Identify the target API request in your browser's network tab
- Right-click and copy the request (in Edge browser)
- Paste the copied request into curlconverter.com
- Select your programming language and generate the code
This approach eliminates the tedious process of manually configuring requests and allows you to focus on data processing.
Scraping Community Homepage
With this methodology, you can extract virtually any available data, provided you avoid triggering anti-bot measures through excessive requests. Let's demonstrate by analyzing popular article categories on the community homepage. Here's an implementation using Python:
import datetime
import requests
from collections import defaultdict
category_stats = defaultdict(int)
article_collection = []
total_articles = 0
def fetch_articles(page_number):
global total_articles, article_collection
request_headers = {
'authority': 'cloud.tencent.com',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'application/json',
'origin': 'https://cloud.tencent.com',
'referer': 'https://cloud.tencent.com/developer',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
payload = {
'pageNumber': page_number,
'pageSize': 50,
'type': 'recommend'
}
api_response = requests.post(
'https://cloud.tencent.com/developer/api/home/article-list',
headers=request_headers,
json=payload
)
data = api_response.json()
for entry in data['list']:
process_categories(entry)
article_collection.append({
'title': entry['title'],
'publish_date': entry['createTime'],
'abstract': entry['summary']
})
total_articles = data['total']
cutoff_date = datetime.datetime(2023, 1, 1)
cutoff_timestamp = int(cutoff_date.timestamp())
return 0 if article_collection[-1]['publish_date'] < cutoff_timestamp else 1
def process_categories(article):
for tag in article.get('tags', []):
category_stats[tag['tagName']] += 1
def display_top_categories(limit=10):
sorted_categories = sorted(category_stats.items(),
key=lambda item: item[1],
reverse=True)
for category, count in sorted_categories[:limit]:
print(f"{category}: {count} articles")
page_counter = 1
while True:
should_continue = fetch_articles(page_counter)
page_counter += 1
if not should_continue:
break
display_top_categories()
This script retrieves articles from the API, processes category information, and generates a frequency analysis of popular topics. Additionally, we can extract promotional banner data for comprehensive analysis:
import requests
def retrieve_banners():
request_headers = {
'authority': 'cloud.tencent.com',
'accept': 'application/json, text/plain, */*',
'content-type': 'application/json',
'origin': 'https://cloud.tencent.com',
'referer': 'https://cloud.tencent.com/developer'
}
payload = {
'cate': 'cloud_banner',
'preview': False
}
response = requests.post(
'https://cloud.tencent.com/developer/api/common/getAds',
headers=request_headers,
json=payload
)
banner_data = response.json()
return [{
'headline': item['content']['pcTitle'],
'destination': item['content']['url']
} for item in banner_data['list']]
active_banners = retrieve_banners()
for banner in active_banners:
print(f"Banner: {banner['headline']} -> {banner['destination']}")
Accessing Personal Content
For accessing personal articles, authentication is required. While implementing a complete login flow programmatically is possible, it involves complex token management and security considerations. For learning purposes, we'll use sesion cookies obtained from an authenticated browser session.
The HTTP protocol is stateless, so cookies maintain session continuity across requests. Here's how to access your personal articles using session cookies:
import requests
personal_articles = []
def get_user_articles(page_num):
request_headers = {
'authority': 'cloud.tencent.com',
'accept': 'application/json, text/plain, */*',
'content-type': 'application/json',
'cookie': 'your_session_cookie_here', # Replace with actual cookie
'origin': 'https://cloud.tencent.com',
'referer': 'https://cloud.tencent.com/developer/creator/article',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
request_payload = {
'hostStatus': 0,
'sortType': 'create',
'page': page_num,
'pagesize': 20
}
response = requests.post(
'https://cloud.tencent.com/developer/api/creator/articleList',
headers=request_headers,
json=request_payload
)
data = response.json()
for article in data['list']:
personal_articles.append({
'title': article['title'],
'date': article['createTime'],
'excerpt': article['summary']
})
total_count = data['total']
current_index = (page_num - 1) * 20 + len(data['list'])
return 0 if current_index >= total_count else 1
# To obtain your session cookie:
# 1. Log in to the website
# 2. Open developer tools
# 3. Find the cookie in the network request headers
# 4. Copy and paste it into the script
Remember to replace 'your_session_cookie_here' with your actual session cookie from the authenticated browser session. This approach provides a balance between functionality and simplicity for educational purposes.