Automating Duplicate File Management with Python
When managing large collections of files, duplicates can accumulate and consume unnecessary storage space. This solution demonstrates how to use Python to automatically identify and relocate duplicate files to a designated directory.
Implementation Approach
The following Python script implements a duplicate file detection system that:
- Traverses specified directories recursively
- Calculates file hashes for comparison
- Identifies duplicate files based on content
- Relocates duplicates to a dedicated folder
Required Modules
Before implementing the solution, ensure you have the following Python modules available:
os- For file system operationshashlib- For generating file hashesshutil- For file movement operations
Code Implementation
import os
import hashlib
import shutil
from pathlib import Path
def relocate_duplicate_files(base_directory):
"""
Scans a directory for duplicate files and moves them to a duplicates subfolder.
Args:
base_directory (str): Path to the directory to scan for duplicates
"""
file_hash_map = {}
duplicates_container = Path(base_directory) / "duplicate_files"
duplicates_container.mkdir(exist_ok=True)
for current_dir, _, files in os.walk(base_directory):
for file in files:
file_path = Path(current_dir) / file
# Skip files in the duplicates directory
if duplicates_container in file_path.parents:
continue
file_hash = calculate_file_hash(file_path)
if file_hash in file_hash_map:
# File is a duplicate, relocate it
destination = duplicates_container / file_path.name
move_file_with_unique_name(file_path, destination)
else:
# First occurrence of this file
file_hash_map[file_hash] = file_path
def calculate_file_hash(file_path, chunk_size=8192):
"""
Calculates MD5 hash of a file.
Args:
file_path (Path): Path to the file
chunk_size (int): Size of chunks to read for hashing
Returns:
str: MD5 hash of the file
"""
hasher = hashlib.md5()
with open(file_path, 'rb') as f:
while chunk := f.read(chunk_size):
hasher.update(chunk)
return hasher.hexdigest()
def move_file_with_unique_name(source, destination):
"""
Moves a file to destination, appending a number if file already exists.
Args:
source (Path): Source file path
destination (Path): Destination directory path
"""
if not destination.exists():
shutil.move(str(source), str(destination))
return
# Handle filename conflicts by appending a number
counter = 1
base_name = destination.stem
extension = destination.suffix
while True:
new_name = f"{base_name}_{counter}{extension}"
new_path = destination.with_name(new_name)
if not new_path.exists():
shutil.move(str(source), str(new_path))
break
counter += 1
# Example usage
if __name__ == "__main__":
target_directory = input("Enter the directory path to scan for duplicates: ")
relocate_duplicate_files(target_directory)
print("Duplicate file relocation completed.")
Code Explanation
The solution consists of three main functions:
- relocate_duplicate_files: The primary function that orchestrates the duplicate detection and relocation process. It creates a hash map to track unique files and identifies duplicates based on content hash matches.
- calculate_file_hash: Computes the MD5 hash of files in chunks to efficiently handle large files without excessive memory usage.
- move_file_with_unique_name: Relocates duplicate files to the designated folder while handling filename conflicts by appending numeric suffixes.
Testing the Implementation
To test this solution:
- Create a test directory with various files, including intentional duplicates
- Run the script and provide the directory path when prompted
- Verify that all duplicate files have been moved to the "duplicate_files" subdirectory
- Check that original files remain in their original locations
Optimization Considerations
For enhanced performance with large file collections:
- Consider implementing file size filtering before hash calculation
- Add parallel processing for multi-core systems
- Implement progress tracking for large directories
Customization Options
This solution can be adapted to various use cases:
-
>Modify the hash algorithm (SHA-1, SHA-256) for different collision probabilities
>Change the relocation behavior (copy instead of move)
>Add file type filtering to process only specific extensions
>Implement logging for tracking duplicate detection activities