Advanced Web Scraping with Python Requests: Session Management, Proxies, and Thread Pools

Session-Based Cookie Handling

When scraping user-specific data, traditional requests.get() calls often fail to retrieve the target information. Consider a scenario where you need to access a user's profile page on a social networking site. Without proper cookie management, you'll receive the login page instead of the authenticated profile data.

Cookie mechanism operates as follows: the web server sends session data to maintain state between client and server. HTTP protocol itself is stateless—each request-response cycle completes independently. Cookies enable servers to recognize returning visitors and maintain login sessions.

To access authenticated content, your requests must carry the authentication cookies obtained during login.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests

if __name__ == "__main__":
    login_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=201873958471'
    
    # Session object automatically manages cookie storage and attachment
    session = requests.Session()
    
    user_headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    }
    
    login_form = {
        'email': '17701256561',
        'icode': '',
        'origURL': 'http://www.renren.com/home',
        'domain': 'renren.com',
        'key_id': '1',
        'captcha_type': 'web_login',
        'password': '7b456e6c3eb6615b2e122a2942ef3845da1f91e3de075179079a3b84952508e4',
        'rkey': '44fd96c219c593f3c9612360c80310a3',
        'f': 'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Dm7m_NSUp5Ri_ZrK5eNIpn_dMs48UAcvT-N_kmysWgYW%26wd%3D%26eqid%3Dba95daf5000065ce000000035b120219',
    }
    
    # Initial request stores authentication cookies in session
    session.post(url=login_url, data=login_form, headers=user_headers)
    
    # Subsequent requests automatically include stored cookies
    profile_url = 'http://www.renren.com/960481378/profile'
    response = session.get(url=profile_url, headers=user_headers)
    response.encoding = 'utf-8'
    
    with open('./profile.html', 'w', encoding='utf-8') as output:
        output.write(response.text)

Proxy Configuration

Proxy servers act as intermediaries between clients and target servers. In web scraping, proxies serve critical purposes:

  1. Bypass IP-based rate limiting: Many websites block suspicious IPs after detecting rapid access patterns. Rotating proxy IPs allows continuous scraping operations.
  2. Geographic access: Some resources are restricted to specific regions. Proxies with appropriate IP addresses can circumvent these limitations.

Proxy Types

Type Direction Purpose
Forward Proxy Client → Proxy → Server Protects client anonymity
Reverse Proxy Client → Proxy ← Server Load balancing, server protection

Implementation

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
import random

if __name__ == "__main__":
    user_agents = [
        {"user-agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)"},
        {"user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"},
        {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}
    ]
    
    proxy_pool = [
        {"http": "112.115.57.20:3128"},
        {'http': '121.41.171.223:3128'}
    ]
    
    selected_ua = random.choice(user_agents)
    selected_proxy = random.choice(proxy_pool)
    
    target_url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'
    
    response = requests.get(
        url=target_url,
        headers=selected_ua,
        proxies=selected_proxy
    )
    response.encoding = 'utf-8'
    
    with open('result.html', 'wb') as f:
        f.write(response.content)
    
    # Revert to direct connection (empty proxy dict)
    requests.get(target_url, proxies={"http": ""})

Popular proxy aggregation sites include Goubanjia and Kuaidaili.

Concurrent Scraping with Thread Pools

Sequential scraping becomes impractical when dealing with numerous resources. Thread pools enable paralel execution, significantly reducing total execution time.

Standard Sequential Approach

%%time
import requests
import random
from lxml import etree
import re
from fake_useragent import UserAgent

source_url = 'http://www.pearvideo.com/category_1'

ua = UserAgent().random
headers = {'User-Agent': ua}

# Fetch main page
main_page = requests.get(url=source_url, headers=headers).text

# Extract video detail links
parser = etree.HTML(main_page)
video_items = parser.xpath('//div[@id="listvideoList"]/ul/li')

video_links = []
for item in video_items:
    link = 'http://www.pearvideo.com/' + item.xpath('./div/a/@href')[0]
    video_links.append(link)

# Sequential download
for link in video_links:
    page_content = requests.get(url=link, headers=headers).text
    video_src = re.findall('srcUrl="(.*?)"', page_content, re.S)[0]
    
    video_data = requests.get(url=video_src, headers=headers).content
    filename = f"{random.randint(1, 10000)}.mp4"
    
    with open(filename, 'wb') as f:
        f.write(video_data)
        print(f"Downloaded: {filename}")

Optimized Thread Pool Approach

%%time
import requests
import random
from lxml import etree
import re
from fake_useragent import UserAgent
from multiprocessing.dummy import Pool

# Initialize thread pool
worker_pool = Pool()

source_url = 'http://www.pearvideo.com/category_1'
ua = UserAgent().random
headers = {'User-Agent': ua}

# Fetch and parse main page
main_page = requests.get(url=source_url, headers=headers).text
parser = etree.HTML(main_page)
video_items = parser.xpath('//div[@id="listvideoList"]/ul/li')

# Collect detail URLs
detail_links = []
for item in video_items:
    link = 'http://www.pearvideo.com/' + item.xpath('./div/a/@href')[0]
    detail_links.append(link)

# Extract video source URLs
video_sources = []
for link in detail_links:
    page_content = requests.get(url=link, headers=headers).text
    video_src = re.findall('srcUrl="(.*?)"', page_content, re.S)[0]
    video_sources.append(video_src)

# Parallel fetch operations
fetch_func = lambda src: requests.get(url=src, headers=headers).content
downloaded_videos = worker_pool.map(fetch_func, video_sources)

# Parallel storage operations
def persist_video(data):
    filename = f"{random.randint(1, 10000)}.mp4"
    with open(filename, 'wb') as f:
        f.write(data)
        print(f"Saved: {filename}")

worker_pool.map(persist_video, downloaded_videos)
worker_pool.close()
worker_pool.join()

The thread pool implementation distributes network I/O across multiple concurrent workers, dramatically improving throughput when fetching video content from multiple sources.

Tags: python web scraping Requests Session Management Cookies

Posted on Fri, 26 Jun 2026 17:27:49 +0000 by Genesis730