Processing a 2010 US Census spreadsheet containing thousands of rows—where each row represents a single census tract—requires calculating the total population and tract count for every county. Performing this manually across thousands of entries is inefficient and time-consuming. A Python script can automate these calculations in seconds by reading the Excel data, aggregating the statistics per county, and exporting the results.
The automation workflow involves:
- Loading the Excel file using the
openpyxllibrary. - Iterating through each row to extract state, county, and population figures.
- Accumulating tract counts and population totals into a nested dictionary.
- Writing the aggregated dictionary to a Python module file using
pprintfor easy reuse.
import openpyxl
import pprint
def process_census_data(filepath):
print(f'Loading workbook: {filepath}')
workbook = openpyxl.load_workbook(filepath)
active_sheet = workbook.active
population_stats = {}
for row in active_sheet.iter_rows(min_row=2, values_only=True):
state_name = row[1] # Column B
county_name = row[2] # Column C
tract_pop = row[3] # Column D
# Initialize state and county entries if they don't exist
population_stats.setdefault(state_name, {})
population_stats[state_name].setdefault(county_name, {'total_pop': 0, 'tract_count': 0})
# Increment the tract count and add to the county population
population_stats[state_name][county_name]['tract_count'] += 1
population_stats[state_name][county_name]['total_pop'] += int(tract_pop)
print('Writing aggregated data to module...')
with open('census_aggregation.py', 'w') as output_file:
output_file.write('aggregated_data = ' + pprint.pformat(population_stats))
return population_stats
process_census_data('censuspopdata.xlsx')Exporting the data structure as a formatted Python dictionary into a .py file using pprint.pformat creates a persistent module. This approach avoids reading the Excel file again in future computations. The generated file can be imported directly into other scripts:
import census_aggregation
print(census_aggregation.aggregated_data['AK'])