Automating 12306 Train Ticket Queries: A Comparative Study of HTTP Scraping and Selenium Automation
Automating routine tasks on web platforms often requires choosing between lightweight API interaction and robust browser automation. This article details two Python implementations for querying train availability on the 12306 railway system. The first approach leverages HTTP request simulation to interact directly with backend endpoints, while the second employs Selenium to orchestrate browser actions, providing a fallback for complex session handling.
- API-Based Data Extraction with Requests
The most efficient method for retrieving structured data is to simulate HTTP requests to the platform's internal API. This technique bypasses the rendering overhead of the user interface but requires careful management of request headers and parameter encoding.
1.1. Environment and Data Preparation
The implementation relies on the requests library for network operations and prettytable for tabular output. A crucial prerequisite is a JSON configuration file mapping city names to their official three-letter station codes, which are required for constructing query URLs.
1.2. Request Construction and Response Handling
Analysis of the web interface reveals that ticket queries are routed to a specific endpoint returning JSON payloads. To avoid being blocked, the client must mimic a standard browser by including appropriate User-Agent and cookie headers. The query parameters include the travel date, origin code, destination code, and passenger type.
Upon receiving the response, the data is parsed from the JSON dictionary. The raw results contain pipe-delimited strings where specific indices correspond to distinct attributes such as train number, departure time, and seat availability. Mapping these indices to meaningful keys ensures maintainable code.
1.3. Implementation Code
The following class-based structure encapsulates the query logic, separating configuration, network requests, and data formatting.
import requests
import json
from prettytable import PrettyTable
class TicketQueryAPI:
def __init__(self, city_map_path):
with open(city_map_path, 'r', encoding='utf-8') as f:
self.station_codes = json.load(f)
self.base_url = "https://kyfw.12306.cn/otn/leftTicket/queryU"
self.session = requests.Session()
# Configure headers to mimic a legitimate browser session
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Referer': 'https://kyfw.12306.cn/otn/leftTicket/init'
})
def build_query_url(self, travel_date, origin, destination):
"""Constructs the API URL with encoded station codes."""
from_code = self.station_codes.get(origin)
to_code = self.station_codes.get(destination)
if not from_code or not to_code:
raise ValueError("Invalid city name; check station code mapping.")
params = [
f"leftTicketDTO.train_date={travel_date}",
f"leftTicketDTO.from_station={from_code}",
f"leftTicketDTO.to_station={to_code}",
"purpose_codes=ADULT"
]
return f"{self.base_url}?{'&'.join(params)}"
def fetch_schedule(self, date, origin, dest):
"""Retrieves and parses ticket data."""
url = self.build_query_url(date, origin, dest)
response = self.session.get(url)
response.raise_for_status()
raw_data = response.json()['data']['result']
return self._parse_results(raw_data)
def _parse_results(self, items):
"""Maps pipe-delimited strings to a structured list of records."""
# Indices correspond to specific fields in the 12306 response
FIELD_MAP = {
'train_code': 3, 'dep_time': 8, 'arr_time': 9, 'duration': 10,
'special_seat': 32, 'first_class': 31, 'second_class': 30,
'soft_sleeper': 23, 'hard_sleeper': 28, 'hard_seat': 29, 'no_seat': 26
}
records = []
for item in items:
fields = item.split('|')
record = {
'Train': fields[FIELD_MAP['train_code']],
'Dep': fields[FIELD_MAP['dep_time']],
'Arr': fields[FIELD_MAP['arr_time']],
'Dur': fields[FIELD_MAP['duration']],
'Spl': fields[FIELD_MAP['special_seat']],
'1st': fields[FIELD_MAP['first_class']],
'2nd': fields[FIELD_MAP['second_class']],
'SS': fields[FIELD_MAP['soft_sleeper']],
'HS': fields[FIELD_MAP['hard_sleeper']],
'HD': fields[FIELD_MAP['hard_seat']],
'NS': fields[FIELD_MAP['no_seat']]
}
records.append(record)
return records
def display_results(self, records):
"""Outputs the parsed data using PrettyTable."""
tb = PrettyTable()
tb.field_names = ['Train', 'Dep', 'Arr', 'Dur', 'Spl', '1st', '2nd', 'SS', 'HS', 'HD', 'NS']
for rec in records:
tb.add_row([rec[k] for k in tb.field_names])
print(tb)
# Usage Example
# api = TicketQueryAPI('city_code.json')
# data = api.fetch_schedule('2025-06-10', 'Beijing', 'Shanghai')
# api.display_results(data)
- Browser Automation with Selenium
While API scraping is efficient, managing session tokens and cookies can become cumbersome if the server enforces strict authentication cycles. Selenium offers an alternative by automating a real browser instance. This approach interacts with the Document Object Model (DOM) directly, handling dynamic content and login flows more naturally.
2.1. WebDriver Configuration
The solution requires the Selenium library and a browser driver (e.g., Edge or Chrome). Key components include:
WebDriverWait: Implements explicit waits to handle asynchronous page loads, ensuring elements exist before interaction. Expected Conditions: Validates element states, such as clickability or visibility. Locators: Uses CSS selectors or IDs to target input fields and buttons.
2.2. Automation Workflow
The script initializes the browser and navigates to the login portal. It locates the credential fields, inputs user data, and handles two-factor authentication prompts. Since SMS verification requires human intervention, the script pauses to accept input from the console. Post-login, the automation navigates to the ticket search interface, populates the origin, destination, and date fields, and submits the query.
2.3. Implementation Code
The following example demonstrates a modular automator class. Sensitive credentials are abstracted to prevent harcdoding.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
class TrainQueryBot:
def __init__(self, browser_type='edge'):
self.driver = webdriver.Edge() if browser_type == 'edge' else webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 20)
self.login_url = "https://kyfw.12306.cn/otn/resources/login.html"
def authenticate_session(self, username, password, card_suffix):
"""Handles the login sequence including manual verification."""
self.driver.get(self.login_url)
# Enter username
user_input = self.wait.until(EC.presence_of_element_located((By.ID, 'J-userName')))
user_input.send_keys(username)
# Enter password and submit
pwd_input = self.driver.find_element(By.ID, 'J-password')
pwd_input.send_keys(password)
self.driver.find_element(By.ID, 'J-login').click()
# Handle card verification and SMS code
card_input = self.wait.until(EC.presence_of_element_located((By.ID, 'id_card')))
card_input.send_keys(card_suffix)
sms_code = input("Enter SMS verification code: ")
self.driver.find_element(By.ID, 'code').send_keys(sms_code)
self.driver.find_element(By.ID, 'sureClick').click()
# Wait for successful redirect to ticket page
self.wait.until(EC.presence_of_element_located((By.ID, 'link_for_ticket')))
print("Authentication successful.")
time.sleep(2)
def perform_search(self, origin, destination, travel_date):
"""Fills search form and executes query."""
self.driver.find_element(By.ID, 'link_for_ticket').click()
time.sleep(3) # Allow page transition
# Populate origin
self._interact_with_field('fromStationText', origin)
# Populate destination
self._interact_with_field('toStationText', destination)
# Populate date
self._interact_with_field('train_date', travel_date)
# Submit query
self.driver.find_element(By.ID, 'query_ticket').click()
print("Query submitted. Review results in browser.")
def _interact_with_field(self, element_id, value):
"""Clears, inputs value, and triggers search via Enter key."""
elem = self.driver.find_element(By.ID, element_id)
elem.click()
elem.clear()
elem.send_keys(value)
elem.send_keys(Keys.ENTER)
time.sleep(1) # Brief pause to allow dropdown/filter updates
def close(self):
"""Terminates the browser session."""
self.driver.quit()
# Usage Example
# bot = TrainQueryBot()
# try:
# bot.authenticate_session('your_username', 'your_password', '1234')
# bot.perform_search('Guangzhou', 'Chengdu', '2025-07-01')
# input("Press Enter to close browser...")
# finally:
# bot.close()
- Technical Challenges and Resolutions
Implementing these solutions presented several technical hurdles:
Anti-Scraping Mechanisms: Initial HTTP requests failed due to missing headers. The solution involved capturing valid Cookie and User-Agent strings from a live browser session and attaching them to the request object. Data Parsing Complexity: The API response does not return a clean JSON object for each seat type. Instead, it returns concatenated strings. Parsing required splitting these strings by the pipe delimiter and mapping array indices to specific seat categories, necessitating a robust mapping dictionary to avoid errors. Browser Lifecycle Management: In the Selenium implementation, the browser window would close immediately upon script completion. This was resolved by adding explicit wait commands or blocking input prompts at the end of the execution flow to keep the driver instance active for debugging and result inspection. Dynamic Element Loading: Form fields in the query interface are dynamically populated. Using implicit waits or hardcoded time.sleep proved unreliable. Implementing WebDriverWait with specific expected conditions ensured interactions only occurred when elements were fully rendered and clickable.