Session-Based Cookie Handling
When scraping user-specific data, traditional requests.get() calls often fail to retrieve the target information. Consider a scenario where you need to access a user's profile page on a social networking site. Without proper cookie management, you'll receive the login page instead of the authenticated profile data.
Cookie mechanism operates as follows: the web server sends session data to maintain state between client and server. HTTP protocol itself is stateless—each request-response cycle completes independently. Cookies enable servers to recognize returning visitors and maintain login sessions.
To access authenticated content, your requests must carry the authentication cookies obtained during login.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
if __name__ == "__main__":
login_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=201873958471'
# Session object automatically manages cookie storage and attachment
session = requests.Session()
user_headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}
login_form = {
'email': '17701256561',
'icode': '',
'origURL': 'http://www.renren.com/home',
'domain': 'renren.com',
'key_id': '1',
'captcha_type': 'web_login',
'password': '7b456e6c3eb6615b2e122a2942ef3845da1f91e3de075179079a3b84952508e4',
'rkey': '44fd96c219c593f3c9612360c80310a3',
'f': 'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Dm7m_NSUp5Ri_ZrK5eNIpn_dMs48UAcvT-N_kmysWgYW%26wd%3D%26eqid%3Dba95daf5000065ce000000035b120219',
}
# Initial request stores authentication cookies in session
session.post(url=login_url, data=login_form, headers=user_headers)
# Subsequent requests automatically include stored cookies
profile_url = 'http://www.renren.com/960481378/profile'
response = session.get(url=profile_url, headers=user_headers)
response.encoding = 'utf-8'
with open('./profile.html', 'w', encoding='utf-8') as output:
output.write(response.text)
Proxy Configuration
Proxy servers act as intermediaries between clients and target servers. In web scraping, proxies serve critical purposes:
- Bypass IP-based rate limiting: Many websites block suspicious IPs after detecting rapid access patterns. Rotating proxy IPs allows continuous scraping operations.
- Geographic access: Some resources are restricted to specific regions. Proxies with appropriate IP addresses can circumvent these limitations.
Proxy Types
| Type | Direction | Purpose |
|---|---|---|
| Forward Proxy | Client → Proxy → Server | Protects client anonymity |
| Reverse Proxy | Client → Proxy ← Server | Load balancing, server protection |
Implementation
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
import random
if __name__ == "__main__":
user_agents = [
{"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"},
{"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}
]
proxy_pool = [
{"http": "112.115.57.20:3128"},
{'http': '121.41.171.223:3128'}
]
selected_ua = random.choice(user_agents)
selected_proxy = random.choice(proxy_pool)
target_url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'
response = requests.get(
url=target_url,
headers=selected_ua,
proxies=selected_proxy
)
response.encoding = 'utf-8'
with open('result.html', 'wb') as f:
f.write(response.content)
# Revert to direct connection (empty proxy dict)
requests.get(target_url, proxies={"http": ""})
Popular proxy aggregation sites include Goubanjia and Kuaidaili.
Concurrent Scraping with Thread Pools
Sequential scraping becomes impractical when dealing with numerous resources. Thread pools enable paralel execution, significantly reducing total execution time.
Standard Sequential Approach
%%time
import requests
import random
from lxml import etree
import re
from fake_useragent import UserAgent
source_url = 'http://www.pearvideo.com/category_1'
ua = UserAgent().random
headers = {'User-Agent': ua}
# Fetch main page
main_page = requests.get(url=source_url, headers=headers).text
# Extract video detail links
parser = etree.HTML(main_page)
video_items = parser.xpath('//div[@id="listvideoList"]/ul/li')
video_links = []
for item in video_items:
link = 'http://www.pearvideo.com/' + item.xpath('./div/a/@href')[0]
video_links.append(link)
# Sequential download
for link in video_links:
page_content = requests.get(url=link, headers=headers).text
video_src = re.findall('srcUrl="(.*?)"', page_content, re.S)[0]
video_data = requests.get(url=video_src, headers=headers).content
filename = f"{random.randint(1, 10000)}.mp4"
with open(filename, 'wb') as f:
f.write(video_data)
print(f"Downloaded: {filename}")
Optimized Thread Pool Approach
%%time
import requests
import random
from lxml import etree
import re
from fake_useragent import UserAgent
from multiprocessing.dummy import Pool
# Initialize thread pool
worker_pool = Pool()
source_url = 'http://www.pearvideo.com/category_1'
ua = UserAgent().random
headers = {'User-Agent': ua}
# Fetch and parse main page
main_page = requests.get(url=source_url, headers=headers).text
parser = etree.HTML(main_page)
video_items = parser.xpath('//div[@id="listvideoList"]/ul/li')
# Collect detail URLs
detail_links = []
for item in video_items:
link = 'http://www.pearvideo.com/' + item.xpath('./div/a/@href')[0]
detail_links.append(link)
# Extract video source URLs
video_sources = []
for link in detail_links:
page_content = requests.get(url=link, headers=headers).text
video_src = re.findall('srcUrl="(.*?)"', page_content, re.S)[0]
video_sources.append(video_src)
# Parallel fetch operations
fetch_func = lambda src: requests.get(url=src, headers=headers).content
downloaded_videos = worker_pool.map(fetch_func, video_sources)
# Parallel storage operations
def persist_video(data):
filename = f"{random.randint(1, 10000)}.mp4"
with open(filename, 'wb') as f:
f.write(data)
print(f"Saved: {filename}")
worker_pool.map(persist_video, downloaded_videos)
worker_pool.close()
worker_pool.join()
The thread pool implementation distributes network I/O across multiple concurrent workers, dramatically improving throughput when fetching video content from multiple sources.