Sample Dataset: Patient Demographics and Clinical Metrics
| ID | Gender | Hypertension | Married | Occupation | Residence | BMI | SmokingHistory | Stroke |
|---|---|---|---|---|---|---|---|---|
| 9046 | Male | No | Yes | Private | Urban | 36.6 | FormerSmoker | Yes |
| 51676 | Female | No | Yes | SelfEmployed | Rural | NaN | NeverSmoked | Yes |
| 31112 | Male | No | Yes | Private | Rural | 32.5 | NeverSmoked | Yes |
| 60182 | Female | No | Yes | Private | Urban | 34.4 | CurrentSmoker | Yes |
| 1665 | Female | Yes | Yes | SelfEmployed | Rural | 24.0 | NeverSmoked | Yes |
| 56669 | Male | No | Yes | Private | Urban | 29.0 | FormerSmoker | Yes |
| 53882 | Male | Yes | Yes | Private | Rural | 27.4 | NeverSmoked | Yes |
Supplementary Clinical Measurements
| ID | Age | AvgGlucose |
|---|---|---|
| 9046 | 67 | 228.69 |
| 51676 | 61 | 202.21 |
| 31112 | 80 | 105.92 |
| 60182 | 49 | 171.23 |
| 1665 | 79 | 174.12 |
| 56669 | 81 | 186.21 |
| 53882 | 74 | 70.09 |
| 10434 | 69 | 94.39 |
| 27419 | 59 | 76.15 |
| 60491 | 78 | 58.57 |
Task 1: Integratign Patient Records Using Key-Based Joins
Load both Excel files and inspect their structure:
import pandas as pd
stroke_data = pd.read_excel('healthcare-dataset-stroke.xlsx')
glucose_age_data = pd.read_excel('healthcare-dataset-age-abs.xlsx')
print("Stroke dataset (first 5 rows):")
print(stroke_data.head(5))
print("\nAge-glucose dataset (first 5 rows):")
print(glucose_age_data.head(5))
Perform vertical concatenation using pd.concat, comparing outer and inner alignment:
# Outer join along index axis (row-wise stacking)
stacked_outer = pd.concat([stroke_data, glucose_age_data], axis=0, join='outer', ignore_index=True)
print("\nTop 5 rows of outer-joined stack:")
print(stacked_outer.head(5))
# Inner join — retains only columns common to both DataFrames
stacked_inner = pd.concat([stroke_data, glucose_age_data], axis=0, join='inner', ignore_index=True)
print("\nTop 5 rows of inner-joined stack:")
print(stacked_inner.head(5))
Merge datasets on the shared identifier 'ID':
merged_df = pd.merge(stroke_data, glucose_age_data, on='ID', how='inner')
print("\nMerged dataset (first 5 rows):")
print(merged_df.head(5))
Task 2: Cleaning Inconsistent and Missing Values
Remove duplicate combinations of 'Occupation' and 'SmokingHistory' from the stroke dataset:
stroke_cleaned = stroke_data.drop_duplicates(subset=['Occupation', 'SmokingHistory'])
Assess missingness in the age-glucose dataset:
missing_summary = glucose_age_data[['Age', 'AvgGlucose']].isnull().sum()
print("Missing value counts:\n", missing_summary)
Impute missing values: fill Age with its mean, interpolate AvgGlucose using cubic polynomial interpolation:
filled_age = glucose_age_data['Age'].fillna(glucose_age_data['Age'].mean())
interpolated_glucose = glucose_age_data['AvgGlucose'].interpolate(method='polynomial', order=3)
glucose_age_data['Age'] = filled_age
glucose_age_data['AvgGlucose'] = interpolated_glucose
Detect outliers in Age and AvgGlucose using the 3-sigma rule:
from scipy import stats
def count_outliers(series):
z_scores = np.abs(stats.zscore(series.dropna()))
return (z_scores > 3).sum()
age_outliers = count_outliers(glucose_age_data['Age'])
glucose_outliers = count_outliers(glucose_age_data['AvgGlucose'])
print(f"Age outliers: {age_outliers}, Glucose outliers: {glucose_outliers}")
Task 3: Feature Scaling Techniques on Glucose Levels
Apply three distinct normalization strategies to 'AvgGlucose':
Min-Max Scaling:
glucose_age_data['AvgGlucose_MinMax'] = (
(glucose_age_data['AvgGlucose'] - glucose_age_data['AvgGlucose'].min()) /
(glucose_age_data['AvgGlucose'].max() - glucose_age_data['AvgGlucose'].min())
)
Z-Score Standardization:
glucose_age_data['AvgGlucose_ZScore'] = (
(glucose_age_data['AvgGlucose'] - glucose_age_data['AvgGlucose'].mean()) /
glucose_age_data['AvgGlucose'].std()
)
Decimal Scaling:
def decimal_scale(x):
if x == 0:
return 0
magnitude = 10 ** len(str(abs(int(x))))
return x / magnitude
glucose_age_data['AvgGlucose_DecimalScale'] = glucose_age_data['AvgGlucose'].apply(decimal_scale)
Task 4: Encoding Categorical Features and Discretizing Numerical Variables
Convert 'SmokingHistory' into binary indicator columns:
smoking_dummies = pd.get_dummies(stroke_data['SmokingHistory'], prefix='SmokeStatus')
Discretize 'Age' into five equal-width bins and five quantile-based bins:
age_data = glucose_age_data[['Age']].copy()
age_data['Age_EqualWidth'] = pd.cut(age_data['Age'], bins=5, labels=False, include_lowest=True)
age_data['Age_Quantile'] = pd.qcut(age_data['Age'], q=5, labels=False, duplicates='drop')