Raw datasets often arrive with gaps, irregularities, and outliers that interfere with meaningful analysis or model training. Using Python’s data ecosystem, especial pandas, NumPy, and scikit-learn, allows practitioners to systematically prepare data before feeding it into downstream processes.
Exploring Dataset Structure
Load the dataset and examine its shape, column types, and summary metrics to reveal potential flaws early.
import pandas as pd
raw = pd.read_csv("dataset.csv")
print(raw.shape)
print(raw.dtypes)
print(raw.describe(include='all'))
print(raw.head(3))
Profiling Data Quality
Identify missing cells, duplicated rows, and cardinality issues before deciding on cleaning actions.
missing_counts = raw.isna().sum()
missing_pct = raw.isna().mean() * 100
quality_report = pd.DataFrame({'missing': missing_counts, 'percent': missing_pct})
print(quality_report[quality_report['missing'] > 0])
print(f"Duplicate rows: {raw.duplicated().sum()}")
print(f"Unique values per column:\n{raw.nunique()}")
Core Cleaning Operations
Addressing Missing Values
Choose an imputation strategy based on context:
# Fill with a constant
filled = raw.fillna("UNKNOWN")
# Forward-fill for sequential data
filled = raw.ffill()
# Drop rows where critical columns are empty
critical_cols = ['id', 'timestamp']
cleaned = raw.dropna(subset=critical_cols)
Detecting and Handling Outliers
Flag extreme values using the interquartile range approach:
import numpy as np
num_cols = ['age', 'income', 'score']
for c in num_cols:
q1, q3 = raw[c].quantile([0.25, 0.75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
mask = (raw[c] < lower) | (raw[c] > upper)
raw.loc[mask, c] = np.nan # mark before imputation
cleaned = raw.interpolate()
Standardizing Formats
Cast columns to appropriate types and apply custom transformations:
# Convert and clean
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df['username'] = df['username'].str.lower().str.strip()
# Map categories
mapping = {'M': 'Male', 'F': 'Female', 'f': 'Female'}
df['gender'] = df['gender'].map(mapping)
Feature Scaling
Normalize numeric faetures before distance-based learning algorithms:
from sklearn.preprocessing import MinMaxScaler
features = ['salary', 'years_exp']
scaler = MinMaxScaler()
dataset_scaled = scaler.fit_transform(df[features])
df[features] = dataset_scaled
Post-cleaning Validation
Automate checks to confirm the dataset is ready for modeling:
assert df.isna().sum().sum() == 0, "Missing values remain"
assert df.duplicated().sum() == 0, "Duplicates still present"
print("Dataset verified for downstream use.")
Advanced Preparation Techniques
- Feature construction: Derive new columns such as
age_groupfrombirth_yearor interaction terms from multiple numeric fields. - Text processing: Tokenize, remove stop words with
nltk, and apply stemming or lemmatization before vectorization. - Time series steps: Resample, apply rolling windows, and create lag features for forecasting.
- Dimension reduction: Use
PCAorTruncatedSVDto compress high-dimensional data while retaining variance.
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
reduced = pca.fit_transform(standardized_data)
print(f"Reduced shape: {reduced.shape}")
Layering robust cleaning and thoughtful preprocessing directly impacts model reliability and analysis accuracy.