Filter-Based Feature Selection Techniques in Machine Learning

Filter-based feature selection evaluates features prior to model training using statistical metrics or dependency measures between features and the target variable. It ranks features by relevance and selects a subset expected to improve generalization and reduce overfitting.

Workflow

  1. Data Acquisition — Gather a dataset containing feature columns and a target variable. Preprocess to handle missing values and outliers.
  2. Feature Scoring — Apply a chosen metric (variance, mutual information, chi-square, Pearson correlation, information gain) to quantify each feature's association with the target.
  3. Ranking — Sort features by their scores in descending order of relevance.
  4. Thresholding (Optional) — Define a cutoff, such as keeping only the top N features or those exceeding a score threshold.
  5. Subset Construction — Retain the qualifying features for downstream modeling.

Advantages include independence from learning algorithms, low computational cost, ease of interpretation, and suitability for high-dimensional data.


Variance Thresholding

Removes features with low variance, assuming they carry little discriminative information.

Procedure

  • Prepare cleaned data.
  • Compute variance for each feature.
  • Choose a variance limit.
  • Discard features below the limit.

Effective for continuous variables; discrete ones may require encoding.

Example

import numpy as np
from sklearn.feature_selection import VarianceThreshold

matrix = np.array([[0, 2, 0, 3],
                   [0, 1, 4, 3],
                   [0, 1, 1, 3]])

vt = VarianceThreshold(threshold=0.6)
filtered = vt.fit_transform(matrix)

print("Initial matrix:\n", matrix)
print("Filtered matrix:\n", filtered)
print("Kept column indices:", vt.get_support(indices=True))
print("Variances:", vt.variances_)

Mutual Information

Measures shared information between a feature and target, capturing linear and nonlinear dependencies.

Procedure

  • Clean and prepare data.
  • Calculate mutual information for each feature-target pair.
  • Rank features by mutual information score.
  • Select top-ranked features.

Works for both continuous and categorical data (latter may need discretization).

Example

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, mutual_info_classif

data = load_iris()
features, labels = data.data, data.target

mi_selector = SelectKBest(mutual_info_classif, k=2)
reduced_features = mi_selector.fit_transform(features, labels)

print("Original shape:", features.shape)
print("Reduced shape:", reduced_features.shape)
print("Chosen indices:", mi_selector.get_support(indices=True))

Chi-Square Test

Statistical test for independence between categorical variables, commonly used in classification tasks.

Procedure

  • Obtain categorical feature and target data.
  • Build contingency tables for each feature-target pair.
  • Compute expected frequencies and chi-square statistic.
  • Determine degrees of freedom and compare with critical value at chosen significance level.
  • Retain features with significant association.

Best suited for purely categorical settings.

Example

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2

data = load_iris()
features, labels = data.data, data.target

chi_selector = SelectKBest(chi2, k=2)
reduced_features = chi_selector.fit_transform(features, labels)

print("Original shape:", features.shape)
print("Reduced shape:", reduced_features.shape)
print("Chosen indices:", chi_selector.get_support(indices=True))

Pearson Correlation

Quantifies linear relationship between continuous features and target.

Procedure

  • Supply cleaned continuous data.
  • Compute Pearson correlation coefficient for each feature-target pair.
  • Rank by absolute coefficient magnitude.
  • Select features above a set threshold or top k.

Insensitive to nonlinear patterns; sensitive to outliers.

Example

import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression

np.random.seed(0)
samples = np.random.rand(100, 5)
response = samples[:, 0] + 2 * samples[:, 1] + np.random.normal(0, 0.1, 100)

corr_selector = SelectKBest(f_regression, k=2)
reduced_samples = corr_selector.fit_transform(samples, response)

print("Original shape:", samples.shape)
print("Reduced shape:", reduced_samples.shape)
print("Chosen indices:", corr_selector.get_support(indices=True))

Information Gain

Evaluates reduction in uncertainty of target variable when a feature is known. Popular in classification.

Procedure

  • Encode features numerically; discretize if necessary.
  • Compute initial entropy of target.
  • For each feature, calculate conditional entropy given feature; derive information gain as difference.
  • Rank features by gain; select top candidates.

Effective for categorical targets; less natural for continuous features.

Example

import numpy as np
from sklearn.feature_selection import mutual_info_classif

np.random.seed(0)
feat_matrix = np.random.rand(100, 5)
bin_target = np.random.randint(2, size=100)

gains = []
for col in range(feat_matrix.shape[1]):
    mi_score = mutual_info_classif(feat_matrix[:, col].reshape(-1, 1), bin_target)[0]
    gains.append(mi_score)

top_two = np.argsort(gains)[::-1][:2]
selected_data = feat_matrix[:, top_two]

print("Original shape:", feat_matrix.shape)
print("Reduced shape:", selected_data.shape)
print("Chosen indices:", top_two)

Comparative Overview

Method Strengths Limitations Typical Use Case
Variance Threshold Fast, easy, reduces dimensionality Ignores relation to target; insensitive to correlations Sparse or near-constant features
Mutual Information Captures linear & nonlinear dependencies Computationally heavier; unstable on small datasets Complex relationships in modest-sized data
Chi-Square Effective for categorical associations Unsuitable for continuous features Purely categorical classification problems
Pearson Correlation Simple, fast linear measure Misses nonlinear trends; outlier-sensitive Linear relationships with clean continuous data
Information Gain Strong for categorical targets; intuitive in trees Needs discretization; costly for many feature Classification with tree-based models

Tags: Machine Learning feature selection filter methods variance threshold mutual information

Posted on Mon, 15 Jun 2026 17:41:18 +0000 by linuxdoniv