Requirements Analysis
Excel data porcessing typically involves:
- Converting Excel serial date numbers to standard date formats (e.g., 44567 → 2022/1/6)
- Removing duplicates based on order IDs (SOID), keeping only the most recent records by date
- Converting date strings with English month names to numeric formats (e.g., 06/Jan/2022 12:27 → 2022-1-6)
Additionally, multiple Excel files may correspond to a single database table. The mapping between files and tables needs to be defined explicitly.
Installing Dependencies
pip3 install sqlalchemy pymssql pandas xlrd xlwt
Import the required packages:
import pandas as pd
from datetime import date, timedelta, datetime
import time
import os
from sqlalchemy import create_engine
import pymssql
Reading Excel Data
The pandas library handles Excel reading straightforwardly:
def read_excel_file(filepath):
return pd.read_excel(filepath)
Data Transformation
Converting Excel Serial Numbers to Dates
Excel stores dates as sequential day counts starting from a reference date. To determine this reference date, reverse-calculate using a known value:
from datetime import date, timedelta
date_value = 44567
delta = timedelta(date_value)
result_date = date(2022, 1, 6)
reference = result_date - delta
print(reference)
# Output: 1899-12-30
With the reference date identified, implement the conversion function:
def serial_to_date(serial_num):
if pd.isna(serial_num):
return None
base_date = date(1899, 12, 30)
delta = timedelta(serial_num)
return base_date + delta
Converting English Month Dates
Python's datetime module can parse English month names directly:
def format_datetime(date_str):
if pd.isna(date_str):
return None
# Input: 06/Jan/2022 12:27
parsed = datetime.strptime(date_str, '%d/%b/%Y %H:%M')
return datetime.strftime(parsed, '%Y-%m-%d')
Removing Duplicates
Sort by date in ascending order, then remove duplicates keeping the last occurrence:
def remove_duplicates(df):
df.sort_values(by=['Receipt Date'], inplace=True)
df.drop_duplicates(subset=['Order ID'], keep='last', inplace=True)
return df
Consolidating Multiple Excel Files
When several Excel files map to one database table:
def combine_excel_files(filenames, base_path):
datasets = [read_excel_file(base_path + fname) for fname in filenames]
return pd.concat(datasets, ignore_index=True)
Writing to SQL Server
Iintialize the database connection with explicit encoding:
engine = create_engine("mssql+pymssql://sa:password@localhost/study?charset=GBK")
Insert data into the target table:
def write_to_database(df, table_name, column_list):
selected_df = df[column_list]
start = time.time()
print(f'Insert started at: {start}')
selected_df.to_sql(table_name, engine, index=False, if_exists='append')
elapsed = time.time() - start
print(f'Inserted {len(selected_df)} rows in {elapsed:.5f} seconds')
Note: SQL Server's default collation is GBK. Include the charset parameter in the connection string to avoid ancoding issues.
Main Processing Loop
Define the file-to-table mapping and process each group:
base_path = './data/'
backup_path = './process/'
file_mapping = {
'table_a': ['test1.xls', 'test2.xls'],
'table_b': ['test3.xls'],
'table_c': ['test4.xls']
}
target_columns = [
['Serial', 'Order ID', 'Status', 'Entry Date'],
['Part Number', 'Category', 'Description'],
['Source Location', 'Original Code']
]
idx = 0
for table, files in file_mapping.items():
df = combine_excel_files(files, base_path)
if 'Order ID' in df.columns:
df = remove_duplicates(df)
else:
df.drop_duplicates(inplace=True)
df.to_excel(backup_path + table + '.xls', index=False)
write_to_database(df, table, target_columns[idx])
idx += 1