Batch Processing Excel Files and Importing Data into SQL Server with Python

Requirements Analysis

Excel data porcessing typically involves:

  • Converting Excel serial date numbers to standard date formats (e.g., 44567 → 2022/1/6)
  • Removing duplicates based on order IDs (SOID), keeping only the most recent records by date
  • Converting date strings with English month names to numeric formats (e.g., 06/Jan/2022 12:27 → 2022-1-6)

Additionally, multiple Excel files may correspond to a single database table. The mapping between files and tables needs to be defined explicitly.

Installing Dependencies

pip3 install sqlalchemy pymssql pandas xlrd xlwt

Import the required packages:

import pandas as pd
from datetime import date, timedelta, datetime
import time
import os
from sqlalchemy import create_engine
import pymssql

Reading Excel Data

The pandas library handles Excel reading straightforwardly:

def read_excel_file(filepath):
    return pd.read_excel(filepath)

Data Transformation

Converting Excel Serial Numbers to Dates

Excel stores dates as sequential day counts starting from a reference date. To determine this reference date, reverse-calculate using a known value:

from datetime import date, timedelta

date_value = 44567
delta = timedelta(date_value)
result_date = date(2022, 1, 6)
reference = result_date - delta
print(reference)
# Output: 1899-12-30

With the reference date identified, implement the conversion function:

def serial_to_date(serial_num):
    if pd.isna(serial_num):
        return None
    
    base_date = date(1899, 12, 30)
    delta = timedelta(serial_num)
    return base_date + delta

Converting English Month Dates

Python's datetime module can parse English month names directly:

def format_datetime(date_str):
    if pd.isna(date_str):
        return None
    
    # Input: 06/Jan/2022 12:27
    parsed = datetime.strptime(date_str, '%d/%b/%Y %H:%M')
    return datetime.strftime(parsed, '%Y-%m-%d')

Removing Duplicates

Sort by date in ascending order, then remove duplicates keeping the last occurrence:

def remove_duplicates(df):
    df.sort_values(by=['Receipt Date'], inplace=True)
    df.drop_duplicates(subset=['Order ID'], keep='last', inplace=True)
    return df

Consolidating Multiple Excel Files

When several Excel files map to one database table:

def combine_excel_files(filenames, base_path):
    datasets = [read_excel_file(base_path + fname) for fname in filenames]
    return pd.concat(datasets, ignore_index=True)

Writing to SQL Server

Iintialize the database connection with explicit encoding:

engine = create_engine("mssql+pymssql://sa:password@localhost/study?charset=GBK")

Insert data into the target table:

def write_to_database(df, table_name, column_list):
    selected_df = df[column_list]
    
    start = time.time()
    print(f'Insert started at: {start}')
    
    selected_df.to_sql(table_name, engine, index=False, if_exists='append')
    
    elapsed = time.time() - start
    print(f'Inserted {len(selected_df)} rows in {elapsed:.5f} seconds')

Note: SQL Server's default collation is GBK. Include the charset parameter in the connection string to avoid ancoding issues.

Main Processing Loop

Define the file-to-table mapping and process each group:

base_path = './data/'
backup_path = './process/'

file_mapping = {
    'table_a': ['test1.xls', 'test2.xls'],
    'table_b': ['test3.xls'],
    'table_c': ['test4.xls']
}

target_columns = [
    ['Serial', 'Order ID', 'Status', 'Entry Date'],
    ['Part Number', 'Category', 'Description'],
    ['Source Location', 'Original Code']
]

idx = 0
for table, files in file_mapping.items():
    df = combine_excel_files(files, base_path)
    
    if 'Order ID' in df.columns:
        df = remove_duplicates(df)
    else:
        df.drop_duplicates(inplace=True)
    
    df.to_excel(backup_path + table + '.xls', index=False)
    write_to_database(df, table, target_columns[idx])
    idx += 1

Tags: python Excel SQL Server Pandas Data Processing

Posted on Fri, 08 May 2026 09:32:27 +0000 by m2babaey