Practical Data Preprocessing with Pandas for Stroke Risk Analysis

Sample Dataset: Patient Demographics and Clinical Metrics

ID Gender Hypertension Married Occupation Residence BMI SmokingHistory Stroke
9046 Male No Yes Private Urban 36.6 FormerSmoker Yes
51676 Female No Yes SelfEmployed Rural NaN NeverSmoked Yes
31112 Male No Yes Private Rural 32.5 NeverSmoked Yes
60182 Female No Yes Private Urban 34.4 CurrentSmoker Yes
1665 Female Yes Yes SelfEmployed Rural 24.0 NeverSmoked Yes
56669 Male No Yes Private Urban 29.0 FormerSmoker Yes
53882 Male Yes Yes Private Rural 27.4 NeverSmoked Yes

Supplementary Clinical Measurements

ID Age AvgGlucose
9046 67 228.69
51676 61 202.21
31112 80 105.92
60182 49 171.23
1665 79 174.12
56669 81 186.21
53882 74 70.09
10434 69 94.39
27419 59 76.15
60491 78 58.57

Task 1: Integratign Patient Records Using Key-Based Joins

Load both Excel files and inspect their structure:

import pandas as pd

stroke_data = pd.read_excel('healthcare-dataset-stroke.xlsx')
glucose_age_data = pd.read_excel('healthcare-dataset-age-abs.xlsx')

print("Stroke dataset (first 5 rows):")
print(stroke_data.head(5))
print("\nAge-glucose dataset (first 5 rows):")
print(glucose_age_data.head(5))

Perform vertical concatenation using pd.concat, comparing outer and inner alignment:

# Outer join along index axis (row-wise stacking)
stacked_outer = pd.concat([stroke_data, glucose_age_data], axis=0, join='outer', ignore_index=True)
print("\nTop 5 rows of outer-joined stack:")
print(stacked_outer.head(5))

# Inner join — retains only columns common to both DataFrames
stacked_inner = pd.concat([stroke_data, glucose_age_data], axis=0, join='inner', ignore_index=True)
print("\nTop 5 rows of inner-joined stack:")
print(stacked_inner.head(5))

Merge datasets on the shared identifier 'ID':

merged_df = pd.merge(stroke_data, glucose_age_data, on='ID', how='inner')
print("\nMerged dataset (first 5 rows):")
print(merged_df.head(5))

Task 2: Cleaning Inconsistent and Missing Values

Remove duplicate combinations of 'Occupation' and 'SmokingHistory' from the stroke dataset:

stroke_cleaned = stroke_data.drop_duplicates(subset=['Occupation', 'SmokingHistory'])

Assess missingness in the age-glucose dataset:

missing_summary = glucose_age_data[['Age', 'AvgGlucose']].isnull().sum()
print("Missing value counts:\n", missing_summary)

Impute missing values: fill Age with its mean, interpolate AvgGlucose using cubic polynomial interpolation:

filled_age = glucose_age_data['Age'].fillna(glucose_age_data['Age'].mean())
interpolated_glucose = glucose_age_data['AvgGlucose'].interpolate(method='polynomial', order=3)

glucose_age_data['Age'] = filled_age
glucose_age_data['AvgGlucose'] = interpolated_glucose

Detect outliers in Age and AvgGlucose using the 3-sigma rule:

from scipy import stats

def count_outliers(series):
    z_scores = np.abs(stats.zscore(series.dropna()))
    return (z_scores > 3).sum()

age_outliers = count_outliers(glucose_age_data['Age'])
glucose_outliers = count_outliers(glucose_age_data['AvgGlucose'])
print(f"Age outliers: {age_outliers}, Glucose outliers: {glucose_outliers}")

Task 3: Feature Scaling Techniques on Glucose Levels

Apply three distinct normalization strategies to 'AvgGlucose':

Min-Max Scaling:

glucose_age_data['AvgGlucose_MinMax'] = (
    (glucose_age_data['AvgGlucose'] - glucose_age_data['AvgGlucose'].min()) /
    (glucose_age_data['AvgGlucose'].max() - glucose_age_data['AvgGlucose'].min())
)

Z-Score Standardization:

glucose_age_data['AvgGlucose_ZScore'] = (
    (glucose_age_data['AvgGlucose'] - glucose_age_data['AvgGlucose'].mean()) /
    glucose_age_data['AvgGlucose'].std()
)

Decimal Scaling:

def decimal_scale(x):
    if x == 0:
        return 0
    magnitude = 10 ** len(str(abs(int(x))))
    return x / magnitude

glucose_age_data['AvgGlucose_DecimalScale'] = glucose_age_data['AvgGlucose'].apply(decimal_scale)

Task 4: Encoding Categorical Features and Discretizing Numerical Variables

Convert 'SmokingHistory' into binary indicator columns:

smoking_dummies = pd.get_dummies(stroke_data['SmokingHistory'], prefix='SmokeStatus')

Discretize 'Age' into five equal-width bins and five quantile-based bins:

age_data = glucose_age_data[['Age']].copy()
age_data['Age_EqualWidth'] = pd.cut(age_data['Age'], bins=5, labels=False, include_lowest=True)
age_data['Age_Quantile'] = pd.qcut(age_data['Age'], q=5, labels=False, duplicates='drop')

Tags: Pandas data-cleaning feature-engineering data-integration Normalization

Posted on Sun, 14 Jun 2026 17:26:31 +0000 by soccerstar_23