urllib Module Overview
The urllib module is a built-in Python library designed for HTTP requests. In Python 3, the primary submodules are urllib.request for handling requests and urllib.parse for URL encoding. This module enables programmatic browser simulation for data extraction tasks.
Practical Examples
Example 1: Retrieving Baidu Homepage Content
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
if __name__ == "__main__":
target_url = 'http://www.baidu.com/'
http_response = urllib.request.urlopen(url=target_url)
raw_data = http_response.read()
print(raw_data)
Function Referance:
urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT): Sends HTTP requesturl: Target endpointdata: Request payload (used for POST requests)
Response Object Methods:
response.headers: Response headersresponse.getcode(): HTTP status coderesponse.geturl(): Requested URLresponse.read(): Response body (bytes)
Example 2: Saving News Page to Local File
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
if __name__ == "__main__":
target_url = 'http://news.baidu.com/'
http_response = urllib.request.urlopen(url=target_url)
page_content = http_response.read().decode()
with open('./news_data.html', 'w', encoding='utf-8') as file_handle:
file_handle.write(page_content)
print('File saved successfully')
The decode() method converts byte data to UTF-8 string format suitable for text storage.
Example 3: Downloading Binary Image Data
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
import ssl
# Bypass SSL certificate verification for HTTPS requests
ssl._create_default_https_context = ssl._create_unverified_context
if __name__ == "__main__":
image_url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1536918978042&di=172c5a4583ca1d17a1a49dba2914cfb9&imgtype=0&src=http%3A%2F%2Fimgsrc.baidu.com%2Fimgad%2Fpic%2Fitem%2F0dd7912397dda144f04b5d9cb9b7d0a20cf48659.jpg'
http_response = urllib.request.urlopen(url=image_url)
binary_data = http_response.read()
with open('./downloaded_image.jpg', 'wb') as file_handle:
file_handle.write(binary_data)
print('Image downloaded successfully')
Binary files require 'wb' write mode without decoding.
Example 4: Handling Non-ASCII Characters in URLs
URLs must contain only ASCII characters. Special characters must be percent-encoded before use.
Task: Search Baidu with the query term 'Jay Chou' and save results.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
if __name__ == "__main__":
base_url = 'http://www.baidu.com/s?'
query_params = {
'ie': 'utf-8',
'wd': '周杰伦'
}
encoded_params = urllib.parse.urlencode(query_params)
full_url = base_url + encoded_params
print(full_url)
http_response = urllib.request.urlopen(url=full_url)
result_data = http_response.read()
with open('./search_results.html', 'wb') as file_handle:
file_handle.write(result_data)
print('Results saved successfully')
The urllib.parse.urlencode() function handles percent-encoding of Unicode characters.
Example 5: Custom Request Objects with User-Agent Spoofing
Many websites implement basic bot detection by examining the User-Agent header. The default urllib request uses a python-urllib identifier, which servers can easily detect and block.
Creating a custom Request object allows header customization to mimic legitimate browser requests.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
if __name__ == "__main__":
base_url = 'http://www.baidu.com/s?'
query_params = {
'ie': 'utf-8',
'wd': '周杰伦'
}
encoded_params = urllib.parse.urlencode(query_params)
full_url = base_url + encoded_params
# Browser User-Agent string obtained from developer tools
request_headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
# Construct custom request with headers
custom_request = urllib.request.Request(url=full_url, headers=request_headers)
# Send the disguised request
http_response = urllib.request.urlopen(custom_request)
result_data = http_response.read()
with open('./search_results.html', 'wb') as file_handle:
file_handle.write(result_data)
print('Data retrieval complete')
The Request constructor accepts url, headers, and optional data parameters for more sophisticated request hadnling.