This article explores the fundamentals of web scraping using Python, covering aspects from basic browser automation to the powerful Scrapy framework.
Web Scraping with Selenium
Selenium is a popular tool for browser automation, enabling the simulation of user interactions with web pages. The following example demonstrates how to use Selenium with Chrome to log into Douban.
from selenium import webdriver
import time
# Initialize the Chrome WebDriver
driver = webdriver.Chrome()
# Navigate to the Douban login page
driver.get("https://www.douban.com")
# Enter username and password
driver.find_element_by_id("form_email").send_keys('your_username') # Replace with actual username
driver.find_element_by_id("form_password").send_keys('your_password') # Replace with actual password
# Wait for potential CAPTCHA or manual login steps
time.sleep(20)
# Click the login button
driver.find_element_by_class_name("bn-submit").click()
# Keep the browser open for a while
time.sleep(10)
# Close the browser
driver.quit()
Configuring Logging
Effective logging is crucial for debugging and monitoring web scraping tasks. A simple loggging setup can be achieved by creating a dedicated module.
Create a file named base_logger.py:
import logging
# Configure basic logging settings
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s'
)
# Get a logger instance
logger = logging.getLogger(__name__)
if __name__ == '__main__':
# Example usage
logger.info("This is a message from base_logger.")
Then, import and use this logger in other Python files.
from base_logger import logger
if __name__ == '__main__':
logger.warning("This is a warning from another script.")
Understanding the Scrapy Framework
Scrapy is a high-level, Python-based web crawling framework designed for extracting structured data from websites. It is built upon the Twisted asynchronous networking library, which enhances performance by handling network communications efficiently.
Core Components of Scrapy:
Scrapy Engine: Orchestrates the entire process, managing communication between spiders, pipelines, downloaders, and schedulers.
Scheduler: Manages the queue of requests to be processed, deciding the order in which they are fetched.
Downloader: Fetches web page content based on requests from the engine.
Spiders: Custom Python classes responsible for parsing responses and extracting data.
Item Pipeline: Processes extracted data, performing tasks like validation, filtering, and storage.
Downloader Middlewares: Allow for customization of the downloader's behavior.
Spider Middlewares: Enable custom logic to be applied to the communication between the engine and spiders.
Scrapy Workflow:
The Engine receives a request from a Spider.
The Request is added to the Scheduler.
The Engine requests a Request from the Scheduler.
The Downloader fetches the content for the Request, potentially using Downloader Middlewares.
If a download fails, the Engine notifies the Scheduler for retry.
If successful, the Response is passed to the Spider via the Engine, typically through the parse method.
The Spider processes the Response, extracts data (creating Items), and can yield new Requests for subsequent pages.
Extracted Items are sent to the Item Pipeline for processing, while new Requests are sent back to the Engine to be scheduled.
This cycle repeats until the Scheduler has no more Requests.
Creating a Scrapy Project
- Project Initialization: Navigate to your desired directory and run the following commandd:
scrapy startproject mySpider
This command creates a standard Scrapy project structure:
scrapy.cfg: Project configuration file.
mySpider/: The main Python package for your project.
mySpider/items.py: Defines the data structure for extracted items.
mySpider/pipelines.py: Handles item processing after extraction.
mySpider/settings.py: Contains project-specific settings.
mySpider/spiders/: Directory for spider code.
- Generating a Spider: To create a spider for the target domain, use the genspider command:
cd mySpider
scrapy genspider itcast "itcast.cn"
This generates a basic spider file (e.g., mySpider/spiders/itcast.py) with the following structure:
# -*- coding: utf-8 -*-
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.cn']
start_urls = ['http://itcast.cn/'] # Initial URL(s)
def parse(self, response):
# Default parsing method
pass
-
Defining Spider Properties:
name: A unique identifier for the spider. allowed_domains: A list of domains the spider is allowed to crawl. start_urls: A list or tuple of URLs from which the spider begins crawling. parse(self, response): The callback method used to process downloaded responses. It extracts data and can yield new requests.
Modify the spider to target a specific URL and implement the parsing logic. For example, to crawl the Itcast teachers page:
# -*- coding: utf-8 -*-
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.cn']
start_urls = ["http://www.itcast.cn/channel/teacher.shtml"] # Target URL
def parse(self, response):
# Extract teacher names using XPath
teacher_names = response.xpath('//div[@class="tea_con"]//h3/text()').extract()
print(teacher_names)
- Running the Spider: Execute the spider from the project's root directory:
scrapy crawl itcast
- Controlling Log Verbosity: To reduce the amount of log output, set the LOG_LEVEL in settings.py:
LOG_LEVEL = "WARNING"
- Extracting Specific Data: Use XPath or CSS selectors to extract desired data. The .extract() method returns a list of matching elements, while .extract_first() returns the first match as a string or None.
Refined parsing logic:
# -*- coding: utf-8 -*-
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.cn']
start_urls = ["http://www.itcast.cn/channel/teacher.shtml"]
def parse(self, response):
li_list = response.xpath("//div[@class='tea_con']/div/ul/li")
for li in li_list:
item = {}
item["name"] = li.xpath('.//h3/text()').extract_first()
item["title"] = li.xpath(".//h4/text()").extract_first()
item["profile"] = li.xpath(".//p/text()").extract_first()
print(item)
- Using Item Pipelines: Items are typically passed to pipelines for further processing.
Modify the parse method to yield items:
# -*- coding: utf-8 -*-
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.cn']
start_urls = ["http://www.itcast.cn/channel/teacher.shtml"]
def parse(self, response):
li_list = response.xpath("//div[@class='tea_con']/div/ul/li")
for li in li_list:
item = {}
item["name"] = li.xpath('.//h3/text()').extract_first()
item["title"] = li.xpath(".//h4/text()").extract_first()
item["profile"] = li.xpath(".//p/text()").extract_first()
yield item # Yield the item to the pipeline
Implement the item processing logic in pipelines.py:
class ItcastPipeline(object):
def process_item(self, item, spider):
print(item)
return item
Enable pipelines in settings.py by uncommenting and configuring ITEM_PIPELINES.
- Prioritizing Pipelines: Multiple pipelines can be used, processed in order of their priroity (lower numbers execute first).
Add another pipeline and configure priorities in settings.py:
# pipelines.py
class ItcastPipeline(object):
def process_item(self, item, spider):
item['processed_by'] = 'Pipeline1'
return item
class ItcastPipeline2(object):
def process_item(self, item, spider):
print(item) # Processed item from previous pipeline
return item
# settings.py
ITEM_PIPELINES = {
'mySpider.pipelines.ItcastPipeline': 300,
'mySpider.pipelines.ItcastPipeline2': 600,
}
- Logging to a File: Configure Scrapy to log output to a file by setting LOG_FILE in settings.py:
LOG_FILE = './log.log'
Add logging statements within the spider's parse method:
# -*- coding: utf-8 -*-
import scrapy
import logging
logger = logging.getLogger(__name__)
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.cn']
start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
def parse(self, response):
li_list = response.xpath("//div[@class='tea_con']/div/ul/li")
for li in li_list:
item = {}
item["name"] = li.xpath('.//h3/text()').extract_first()
item["title"] = li.xpath(".//h4/text()").extract_first()
item["profile"] = li.xpath(".//p/text()").extract_first()
logger.warning(item) # Log the extracted item
yield item
Running the spider will create a log.log file in the project directory containing the logged warnings.