Data and features define the upper bound of machine learning performance; models and algorithms merely approach this limit.
- Feature Selection
Feature selection aims to identify the most relevant subset of input variables to improve model interpretability, reduce overfitting, and enhance computational efficiency—especially critical for high-dimensional datasets.
1.1 Traditional Feature Selection Methods
Wrapper Methods
These methods evaluate feature subsets based on the predictive performance of a specific learning algorithm. While effective, they are computationally expensive due to the exponential search space (O(2d) for d features) and prone to overfitting.
- Sequential Forward Selection (SFS): Starts with an empty set and iteratively adds the feature that maximizes a scoring function J.
- Sequential Backward Selection (SBS): Begins with all features and removes the least useful one at each step.
- Sequential Floating Forward Selection (SFFS): Extends SFS with conditional backward steps to correct greeedy choices.
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
sfs = SequentialFeatureSelector(
knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=4,
n_jobs=-1
)
sfs.fit(X, y)
print("Selected feature indices:", sfs.k_feature_idx_)
print("Cross-validated score:", sfs.k_score_)
Filter Methods
These rank features using statistical measures independant of any learning algorithm, making them fast and scalable.
- Variance Threshold: Removes low-variance features.
- Correlation Coefficients (Pearson/Spearman): Measure linear/monotonic relationships with the target.
- Mutual Information: Quantifies dependency via information gain: I(X;Y) = H(Y) - H(Y|X).
- Chi-Square Test: Assesses independence between categorical features and labels.
- ANOVA F-test: Evaluates differences in means across classes.
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.16) # Keeps features with var > 0.16
X_reduced = selector.fit_transform(X)
Embedded Methods
Integrate feature selection into the model training process (e.g., Lasso regression with L1 regularization, decision trees with impurity-based importance).
1.2 Advanced Feature Selection Techniques
Laplacian Score (Unsupervised)
This method selects features that preserve local manifold structure by minimizing the Laplacian score:
- Construct a k-NN graph.
- Compute affinity matrix S using heat kernel: S(i,j) = exp(-||x_i - x_j||² / t).
- Derive degree matrix D and Laplacian L = D - S.
- Select features with smallest Lr.
Fisher Score (Supervised)
Maximizes between-class scatter while minimizing within-class scatter:
Unlike feature selection (which discards features), dimensionality reduction transforms features into a lower-dimensional space via projection.
Principle Component Analysis (PCA)
PCA identifies orthogonal directions (principal components) of maximum variance in the data.
Algorithm Steps:
- Center the data: X ← X - mean(X).
- Compute covariance matrix: C = XᵀX.
- Eigendecompose C and select top k eigenvectors.
- Project data: Y = XW, where W contains selected eigenvectors.
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)
print("Explained variance ratio:", pca.explained_variance_ratio_)