Filter-based feature selection evaluates features prior to model training using statistical metrics or dependency measures between features and the target variable. It ranks features by relevance and selects a subset expected to improve generalization and reduce overfitting.
Workflow
- Data Acquisition — Gather a dataset containing feature columns and a target variable. Preprocess to handle missing values and outliers.
- Feature Scoring — Apply a chosen metric (variance, mutual information, chi-square, Pearson correlation, information gain) to quantify each feature's association with the target.
- Ranking — Sort features by their scores in descending order of relevance.
- Thresholding (Optional) — Define a cutoff, such as keeping only the top N features or those exceeding a score threshold.
- Subset Construction — Retain the qualifying features for downstream modeling.
Advantages include independence from learning algorithms, low computational cost, ease of interpretation, and suitability for high-dimensional data.
Variance Thresholding
Removes features with low variance, assuming they carry little discriminative information.
Procedure
- Prepare cleaned data.
- Compute variance for each feature.
- Choose a variance limit.
- Discard features below the limit.
Effective for continuous variables; discrete ones may require encoding.
Example
import numpy as np
from sklearn.feature_selection import VarianceThreshold
matrix = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
vt = VarianceThreshold(threshold=0.6)
filtered = vt.fit_transform(matrix)
print("Initial matrix:\n", matrix)
print("Filtered matrix:\n", filtered)
print("Kept column indices:", vt.get_support(indices=True))
print("Variances:", vt.variances_)
Mutual Information
Measures shared information between a feature and target, capturing linear and nonlinear dependencies.
Procedure
- Clean and prepare data.
- Calculate mutual information for each feature-target pair.
- Rank features by mutual information score.
- Select top-ranked features.
Works for both continuous and categorical data (latter may need discretization).
Example
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, mutual_info_classif
data = load_iris()
features, labels = data.data, data.target
mi_selector = SelectKBest(mutual_info_classif, k=2)
reduced_features = mi_selector.fit_transform(features, labels)
print("Original shape:", features.shape)
print("Reduced shape:", reduced_features.shape)
print("Chosen indices:", mi_selector.get_support(indices=True))
Chi-Square Test
Statistical test for independence between categorical variables, commonly used in classification tasks.
Procedure
- Obtain categorical feature and target data.
- Build contingency tables for each feature-target pair.
- Compute expected frequencies and chi-square statistic.
- Determine degrees of freedom and compare with critical value at chosen significance level.
- Retain features with significant association.
Best suited for purely categorical settings.
Example
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2
data = load_iris()
features, labels = data.data, data.target
chi_selector = SelectKBest(chi2, k=2)
reduced_features = chi_selector.fit_transform(features, labels)
print("Original shape:", features.shape)
print("Reduced shape:", reduced_features.shape)
print("Chosen indices:", chi_selector.get_support(indices=True))
Pearson Correlation
Quantifies linear relationship between continuous features and target.
Procedure
- Supply cleaned continuous data.
- Compute Pearson correlation coefficient for each feature-target pair.
- Rank by absolute coefficient magnitude.
- Select features above a set threshold or top k.
Insensitive to nonlinear patterns; sensitive to outliers.
Example
import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression
np.random.seed(0)
samples = np.random.rand(100, 5)
response = samples[:, 0] + 2 * samples[:, 1] + np.random.normal(0, 0.1, 100)
corr_selector = SelectKBest(f_regression, k=2)
reduced_samples = corr_selector.fit_transform(samples, response)
print("Original shape:", samples.shape)
print("Reduced shape:", reduced_samples.shape)
print("Chosen indices:", corr_selector.get_support(indices=True))
Information Gain
Evaluates reduction in uncertainty of target variable when a feature is known. Popular in classification.
Procedure
- Encode features numerically; discretize if necessary.
- Compute initial entropy of target.
- For each feature, calculate conditional entropy given feature; derive information gain as difference.
- Rank features by gain; select top candidates.
Effective for categorical targets; less natural for continuous features.
Example
import numpy as np
from sklearn.feature_selection import mutual_info_classif
np.random.seed(0)
feat_matrix = np.random.rand(100, 5)
bin_target = np.random.randint(2, size=100)
gains = []
for col in range(feat_matrix.shape[1]):
mi_score = mutual_info_classif(feat_matrix[:, col].reshape(-1, 1), bin_target)[0]
gains.append(mi_score)
top_two = np.argsort(gains)[::-1][:2]
selected_data = feat_matrix[:, top_two]
print("Original shape:", feat_matrix.shape)
print("Reduced shape:", selected_data.shape)
print("Chosen indices:", top_two)
Comparative Overview
| Method | Strengths | Limitations | Typical Use Case |
|---|---|---|---|
| Variance Threshold | Fast, easy, reduces dimensionality | Ignores relation to target; insensitive to correlations | Sparse or near-constant features |
| Mutual Information | Captures linear & nonlinear dependencies | Computationally heavier; unstable on small datasets | Complex relationships in modest-sized data |
| Chi-Square | Effective for categorical associations | Unsuitable for continuous features | Purely categorical classification problems |
| Pearson Correlation | Simple, fast linear measure | Misses nonlinear trends; outlier-sensitive | Linear relationships with clean continuous data |
| Information Gain | Strong for categorical targets; intuitive in trees | Needs discretization; costly for many feature | Classification with tree-based models |