Bilingual Language Identification Using Hybrid MFCC and Linear Prediction Features with SVM Classification

Language identification systems distinguish between spoken languages using acoustic features extracted from speech signals. This implementation focuses on binary classification—Chinese versus English—using a hybrid feature vector combining Mel-frequency cepstral coefficients (MFCCs) and linear prediction coefficients (LPCs), followed by support vector machine (SVM) decision logic.

The preprocessing pipeline begins with audio loading and voice activity detection to isolate speech segments. A fixed-duration segment (e.g., 5.3 seconds at 16 kHz) is selected for consistency. Framing uses a 25-ms window with 10-ms shift, pre-emphasis coefficient 0.97, and Hamming windowing. The signal is transformed into the frequency domain via FFT, and energy values are mapped through a triangluar mel-scaled filterbank (20 channels spanning 300–3700 Hz). From the log-filterbank energies, 13 MFCCs are computed and optionally liftered using a sine lifter of order 22.

Simultaneously, linear prediction analysis estimates an all-pole model of the vocal tract. A 17th-order LPC model is fitted using autocorrelation-based Levinson-Durbin recursion on frames of length 256 samples. The first 16 coefficients (excluding the zeroth, which reflects energy) form the LPC-derived subvector.

The final feature matrix concatenates a 1000-sample slice of MFCCs (rows 1000–2000 across time frames) and the flattened 16×125 LPC coefficient matrix (yielding 2000 elements), resulting in a 3000-dimensional column vector per utterance.

Training employs a labeled dataset where class labels are encoded as [0, ..., 0, 1, ..., 1] — 30 samples each for Chinese and English. A SVM classifier is trained using fitcsvm (replacing legacy svmtrain/svmclassify) with default kernel and hyperparameters. At inference, the constructed feature vector is passed to predict, returning a categorical label.

% Feature extraction block
fs = 16000;
N_sec = 5.3;
Tw_ms = 25; Ts_ms = 10;
alpha = 0.97;
R_hz = [300, 3700];
M_filters = 20;
C_mfcc = 13;
L_lifter = 22;

% Load and trim speech
[speech, fs] = audioread(fullfile(pathname, filename));
voice_segment = extractvoice_simple(speech, -30, -20, 0.2);
segment = voice_segment(1:round(N_sec * fs));

% Compute MFCCs
[mfcc_mat, ~, ~] = mfcc(segment, fs, Tw_ms, Ts_ms, alpha, ...
    @(n) hamming(n, 'periodic'), R_hz, M_filters, C_mfcc, L_lifter);

% Extract MFCC slice (time frames 1000–2000)
mfcc_part = mfcc_mat(1000:2000, :);

% Compute LPCs
[lpc_coeffs, ~] = lpc(segment, 17); % or use lpces for frame-wise
lpc_part = lpc_coeffs(2:17, :); % discard gain term

% Reshape and concatenate
feature_vector = [mfcc_part(:); lpc_part(:)];

% Classification
predicted_label = predict(trained_svm_model, feature_vector');
language_name = string(predicted_label);

The triangular filterbank function trifbank maps linear frequency bins to warped domains (e.g., mel scale) using custom forward/backward warping functions. It constructs overlapping triangular filters whose centers and cutoffs are uniformly spaced in the warped domain, ensuring perceptually relevant frequency resolution. Each row of the output matrix represents a filter’s magnitude response across FFT bins.

Tags: speech-processing language-identification mfcc linear-prediction SVM

Posted on Sun, 24 May 2026 18:39:23 +0000 by amavadia