Introduction
This article discusses using Python to extract five common archive formats:
- .gz
- .tar
- .tgz
- .zip
- .rar
Format Overview
gz (gzip): Typically compresses a single file. Often used with tar to first bundle files, then compress.
tar: A bunlding tool in Linux systems that packages files without compression.
tgz (tar.gz): Created by first bundling with tar, then compressing with gzip.
zip: Unlike gzip, it can bundle and compress multiple files, but compresses each file individual. Compression ratio is general lower than tar.
rar: A bundling and compression format originally for DOS, now primarily used on Windows. Offers higher compression than zip but slower processing and random access speeds.
Extracting gz Files
Since gz typically contains a single file, extraction involves reading that file:
import gzip
import os
def extract_gz(archive_path):
"""Extract a gzip-compressed file"""
output_name = archive_path.replace(".gz", "")
with gzip.open(archive_path, 'rb') as compressed_file:
file_content = compressed_file.read()
with open(output_name, 'wb') as output_file:
output_file.write(file_content)
Extracting tar Files
After extracting a .tar.gz file to get a .tar archive, further extraction is needed. Note: .tgz is equivalent to .tar.gz.
import tarfile
import os
def extract_tar(archive_path):
"""Extract a tar archive"""
output_dir = archive_path + "_extracted"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
with tarfile.open(archive_path, 'r') as archive:
archive.extractall(path=output_dir)
Extracting zip Files
Similar to tar extraction:
import zipfile
import os
def extract_zip(archive_path):
"""Extract a zip archive"""
output_dir = archive_path + "_extracted"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
with zipfile.ZipFile(archive_path, 'r') as archive:
archive.extractall(path=output_dir)
Extracting rar Files
RAR extraction requires the rarfile package. Install it via pip:
pip install rarfile
import rarfile
import os
def extract_rar(archive_path):
"""Extract a rar archive"""
output_dir = archive_path + "_extracted"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
with rarfile.RarFile(archive_path) as archive:
archive.extractall(path=output_dir)
Creating tar Archives
When adding files to a tar archive, the arcname parameter allows custom naming instead of preserving full paths:
import tarfile
import os
import time
def create_tar_archive(source_dir, output_path):
"""Create a tar archive from a directory"""
start_time = time.time()
with tarfile.open(output_path, 'w') as archive:
for root, dirs, files in os.walk(source_dir):
for file in files:
full_path = os.path.join(root, file)
archive.add(full_path, arcname=file)
elapsed = time.time() - start_time
print(f"Archive created in {elapsed:.2f} seconds")
To create a compressed tar archive (e.g., gzip):
with tarfile.open('/path/to/archive.tar.gz', 'w:gz') as archive:
# Add files
Common tarfile modes:
| Mode | Description |
|---|---|
| 'r' or 'r:*' | Read with transparent compression (recommended) |
| 'r:' | Read without compression |
| 'r:gz' | Read with gzip compression |
| 'r:bz2' | Read with bzip2 compression |
| 'a' or 'a:' | Append without compression (creates if missing) |
| 'w' or 'w:' | Write without compression |
| 'w:gz' | Write with gzip compression |
| 'w:bz2' | Write with bzip2 compression |
Extracting tar Archives with Different Compression
import tarfile
import time
def extract_tar_archive(archive_path, extract_dir):
"""Extract a tar archive to specified directory"""
start_time = time.time()
with tarfile.open(archive_path, 'r:') as archive:
archive.extractall(path=extract_dir)
elapsed = time.time() - start_time
print(f"Extraction completed in {elapsed:.2f} seconds")
For processing files individually (useful for large archives):
with tarfile.open('archive.tar.gz', 'r:gz') as archive:
for member in archive:
file_obj = archive.extractfile(member)
# Process file_obj as needed
Note: When processing archives with many files individually, be mindful of memory usage.