File Discovery via Pathlib
To locate all GeoTIFF datasets scattered across nested directories, the pathlib module provides an efficient recursive search mechanism. By leveraging the rglob method, we can aggregate paths for every .tif file contained within the root directory and its descendants.
from pathlib import Path
from osgeo import gdal
import numpy as np
SOURCE_DIR = Path('/data/source_imagery')
DEST_DIR = Path('/data/cropped_output')
MAX_PIXELS = 7680
BAND_COUNT = 3
# Discover all TIF files recursively
target_images = list(SOURCE_DIR.rglob('*.tif'))
Raster Extraction and Subsetting
Clipping a raster subset manually requires reading the pixel matrix from the source dataset, initializing a new GeoTIFF, and transferring the spatial metadata alongside the clipped array. This operation is executed per file to extract a square region up to the defined maximum dimension.
Dataset Access and Array Reading
Open the source dataset to extract spatial properties—such as the affine transformation matrix and coordinate reference system—along with the band data type. Calculate the target dimensions by constraining the original size to the maximum clipping threshold, then pull the upper-left rectangular subset using ReadAsArray.
def process_geotiff(src_filepath):
src_ds = gdal.Open(str(src_filepath))
if not src_ds:
return
affine_matrix = src_ds.GetGeoTransform()
crs_info = src_ds.GetProjection()
x_pixels = src_ds.RasterXSize
y_pixels = src_ds.RasterYSize
band_dtype = src_ds.GetRasterBand(1).DataType
# Determine crop boundaries
crop_x = min(x_pixels, MAX_PIXELS)
crop_y = min(y_pixels, MAX_PIXELS)
# Extract pixel block from origin (0,0)
pixel_block = src_ds.ReadAsArray(0, 0, crop_x, crop_y)
Destination Dataset Generation
Instantiate the output GeoTIFF using the GTiff driver, ensuring the created file matches the clipped width, height, band count, and original data type. It is critical to apply the unchanged affine matrix and projection to preserve geographic alignment.
# Initialize destination raster
dest_filename = src_filepath.name
dest_filepath = DEST_DIR / dest_filename
dest_ds = gdal.GetDriverByName('GTiff').Create(
str(dest_filepath), crop_x, crop_y, BAND_COUNT, band_dtype
)
# Preserve spatial referencing
dest_ds.SetGeoTransform(affine_matrix)
dest_ds.SetProjection(crs_info)
Band Data Transfer
Iterate through the designated number of bands to write the sliced array data into the newly generated dataset. Dimensions of the injected array must precisely correspond to the destination file's resolution to avoid runtime exceptions. Concluding the flush process and nullifying the dataset variable ensures data is written to disk.
# Commit band data
for band_idx in range(1, BAND_COUNT + 1):
dest_ds.GetRasterBand(band_idx).WriteArray(pixel_block[band_idx - 1, :, :])
dest_ds.FlushCache()
dest_ds = None
# Execute batch cropping
for img_path in target_images:
process_geotiff(img_path)