Fetching Web Pages with Python's urllib Library for GET Requests

urllib Module Overview

The urllib module is a built-in Python library designed for HTTP requests. In Python 3, the primary submodules are urllib.request for handling requests and urllib.parse for URL encoding. This module enables programmatic browser simulation for data extraction tasks.

Practical Examples

Example 1: Retrieving Baidu Homepage Content

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse

if __name__ == "__main__":
    target_url = 'http://www.baidu.com/'
    http_response = urllib.request.urlopen(url=target_url)
    raw_data = http_response.read()
    print(raw_data)

Function Referance:

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT): Sends HTTP request
- url: Target endpoint
- data: Request payload (used for POST requests)

Response Object Methods:

response.headers: Response headers
response.getcode(): HTTP status code
response.geturl(): Requested URL
response.read(): Response body (bytes)

Example 2: Saving News Page to Local File

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse

if __name__ == "__main__":
    target_url = 'http://news.baidu.com/'
    http_response = urllib.request.urlopen(url=target_url)
    page_content = http_response.read().decode()
    
    with open('./news_data.html', 'w', encoding='utf-8') as file_handle:
        file_handle.write(page_content)
    print('File saved successfully')

The decode() method converts byte data to UTF-8 string format suitable for text storage.

Example 3: Downloading Binary Image Data

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
import ssl

# Bypass SSL certificate verification for HTTPS requests
ssl._create_default_https_context = ssl._create_unverified_context

if __name__ == "__main__":
    image_url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1536918978042&di=172c5a4583ca1d17a1a49dba2914cfb9&imgtype=0&src=http%3A%2F%2Fimgsrc.baidu.com%2Fimgad%2Fpic%2Fitem%2F0dd7912397dda144f04b5d9cb9b7d0a20cf48659.jpg'
    http_response = urllib.request.urlopen(url=image_url)
    binary_data = http_response.read()
    
    with open('./downloaded_image.jpg', 'wb') as file_handle:
        file_handle.write(binary_data)
    print('Image downloaded successfully')

Binary files require 'wb' write mode without decoding.

Example 4: Handling Non-ASCII Characters in URLs

URLs must contain only ASCII characters. Special characters must be percent-encoded before use.

Task: Search Baidu with the query term 'Jay Chou' and save results.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse

if __name__ == "__main__":
    base_url = 'http://www.baidu.com/s?'
    query_params = {
        'ie': 'utf-8',
        'wd': '周杰伦'
    }
    encoded_params = urllib.parse.urlencode(query_params)
    full_url = base_url + encoded_params
    
    print(full_url)
    
    http_response = urllib.request.urlopen(url=full_url)
    result_data = http_response.read()
    
    with open('./search_results.html', 'wb') as file_handle:
        file_handle.write(result_data)
    print('Results saved successfully')

The urllib.parse.urlencode() function handles percent-encoding of Unicode characters.

Example 5: Custom Request Objects with User-Agent Spoofing

Many websites implement basic bot detection by examining the User-Agent header. The default urllib request uses a python-urllib identifier, which servers can easily detect and block.

Creating a custom Request object allows header customization to mimic legitimate browser requests.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

if __name__ == "__main__":
    base_url = 'http://www.baidu.com/s?'
    query_params = {
        'ie': 'utf-8',
        'wd': '周杰伦'
    }
    encoded_params = urllib.parse.urlencode(query_params)
    full_url = base_url + encoded_params
    
    # Browser User-Agent string obtained from developer tools
    request_headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    
    # Construct custom request with headers
    custom_request = urllib.request.Request(url=full_url, headers=request_headers)
    
    # Send the disguised request
    http_response = urllib.request.urlopen(custom_request)
    result_data = http_response.read()
    
    with open('./search_results.html', 'wb') as file_handle:
        file_handle.write(result_data)
    print('Data retrieval complete')

The Request constructor accepts url, headers, and optional data parameters for more sophisticated request hadnling.

Tags: python urllib web scraping GET requests HTTP

Posted on Sun, 14 Jun 2026 16:53:18 +0000 by kosmidd

Freaks City