Binary and Multiclass Classification Techniques

  1. Classification Fundamentals

1.1 What is Classification?

Classification involves predicting categorical outcomes from given inputs. Common applications include email filtering (spam versus legitimate), medical diagnosis (malignant versus benign tumors), and fraud detection in financial transactions. Consider a dataset containing tumor samples where the goal is to predict malignancy. The target variable takes only two values: zero (benign) or one (malignant). A naive approach might involve applying linear regression to this binary clasification problem. When using linear regression, one could threshold predictions at 0.5—predicting class 1 for hypothesis outputs ≥ 0.5 and class 0 otherwise. While this may work acceptably in some cases, adding an outlier training example can significantly degrade performance. Linear regression produces a straight line fit, and a single extreme data point can shift the entire decision boundary inappropriately. The fundamental issue is that linear regression can output values greater than 1 or less than 0, which lacks meaningful interpretation in binary classification where labels are strictly 0 or 1. This makes linear regression unsuitable for classification tasks. ### 1.2 Hypothesis Representation

The solution involves transforming the hypothesis to ensure outputs always fall between 0 and 1. This is achieved by passing the linear combination of features through the sigmoid (logistic) function. The hypothesis takes the following form: g(z) = 1 / (1 + e^(-z))h(x) = g(θ^T * x)The sigmoid function maps any real number to the open interval (0, 1), making it suitable for probability estimation. The hypothesis h(x) represents the probability that the output equals 1 given input x and parameters θ: h(x) = P(y = 1 | x; θ)Since probabilities must sum to 1, we have: P(y = 0 | x; θ) + P(y = 1 | x; θ) = 1### 1.3 Decision Boundary

The decision boundary determines how predictions are made. For the sigmoid function: h(x) ≥ 0.5 → y = 1h(x) < 0.5 → y = 0Since g(z) ≥ 0.5 when z ≥ 0, we obtain: θ^T * x ≥ 0 → y = 1θ^T * x < 0 → y = 0The decision boundary is the line (or curve) where θ^T * x = 0, separating regions where y = 1 from y = 0. For example, with parameters [-3, 1, 1], the boundary becomes x1 + x2 = 3, a straight line through (3,0) and (0,3) on the coordinate plane. The region where x1 + x2 ≥ 3 corresponds to y = 1 predictions. For non-linear boundaries, polynomial or other advanced features can be incorporated. With parameters [-1, 0, 0, 1, 1] and features including x1² and x2²: h(x) = 1 when -1 + x1² + x2² ≥ 0This creates a circular decision boundary where x1² + x2² = 1, predicting y = 1 outside the unit circle and y = 0 inside. 2. Logistic Regression Model

2.1 Cost Function

Using the mean squared error from linear regression would produce a non-convex cost function with multiple local minima for logistic regression. Instead, a specially designed convex cost function is used: J(θ) = (1/m) * Σ Cost(h(x^(i)), y^(i))Cost(h(x), y) = -log(h(x)) if y = 1Cost(h(x), y) = -log(1 - h(x)) if y = 0Key properties of this cost function include: zero cost when prediction matches actual value, and cost approaching infinity as prediction approaches the wrong value. ### 2.2 Simplified Cost Function and Gradient Descent

The two cases can be unified into a single expression: Cost(h(x), y) = -y * log(h(x)) - (1-y) * log(1 - h(x))The complete cost function becomes: J(θ) = -(1/m) * Σ[y^(i) * log(h(x^(i)))) + (1-y^(i)) * log(1-h(x^(i)))]This formulation can be derived from maximum likelihood estimation principles and guarantees convexity. A vectorized implementation: h = sigmoid(X * θ)J(θ) = (1/m) * (-y^T * log(h) - (1-y)^T * log(1-h))Gradient descent update rule: θ_j := θ_j - (α/m) * Σ(h(x^(i)) - y^(i)) * x_j^(i))Vectorized form: θ := θ - (α/m) * X^T * (sigmoid(X*θ) - y)Note: Despite mathematical similarity to linear regression gradient descent, the hypothesis definition differs (sigmoid function versus linear). ### 2.3 Advanced Optimization

Beyond gradient descent, several sophisticated optimization algorithms can minimize the cost function more efficiently: These algorithms (conjgrad, bfgs, l-bfgs) offer advantages including automatic learning rate selection and typically faster convergence. They require only functions that compute J(θ) and its partial derivatives. Implementation in Octave: ``` function [cost, gradient] = computeCost(params, X, y) m = length(y); h = sigmoid(X * params); cost = -(1/m) * (y'*log(h) + (1-y)'*log(1-h)); gradient = (1/m) * X' * (h - y); end

options = optimset('GradObj', 'on', 'MaxIter', 100); initialParams = zeros(size(X,2), 1); [optimalParams, cost, exitFlag] = fminunc(@(p) computeCost(p, X, y), initialParams, options);


3. Multiclass Classification
----------------------------

### 3.1 One-vs-All Classification

When dealing with more than two classes (y = {0, 1, ..., n}), the one-vs-all (one-vs-rest) strategy is employed: Train n+1 binary classifiers, where each classifier h^(i)(x) computes P(y = i | x; θ) for i = 0 to n. For prediction on new data, select the class with the maximum probability: <div>prediction = argmax_i(h^(i)(x))</div>This approach divides the multi-class problem into multiple binary classification problems, training separate classifiers for each class against all other classes. 4. Addressing Overfitting
-------------------------

### 4.1 Understanding Overfitting

Overfitting occurs when a model fits training data too closely, including noise, failing to generalize to new data. Underfitting (high bias) occurs when the model is too simple to capture data patterns. Options to combat overfitting include reducing features manually or algorithmically, and regularization (penalizing large parameter values). ### 4.2 Regularization Cost Function

Regularization adds a penalty term to the cost function: <div>J(θ) = (1/2m) * Σ(h(x^(i)) - y^(i))² + (λ/2m) * Σθ_j²</div>The regularization parameter λ controls the tradeoff between fitting training data well and maintaining small parameter values. If λ is too large, the model underfits; if too small, overfitting persists. ### 4.3 Regularized Linear Regression

Gradient descent with regularization: <div>θ_0 := θ_0 - α * (1/m) * Σ(h(x^(i)) - y^(i)) * x_0^(i)</div><div>θ_j := θ_j - α * [(1/m) * Σ(h(x^(i)) - y^(i)) * x_j^(i) + (λ/m) * θ_j] for j ≥ 1</div>This can be rewritten as: <div>θ_j := θ_j * (1 - α*λ/m) - α * (1/m) * Σ(h(x^(i)) - y^(i)) * x_j^(i)</div>Using the normal equation with regularization: <div>θ = (X^T * X + λ*L)^(-1) * X^T * y</div>Where L is an (n+1)×(n+1) matrix with zeros on the first row/column and ones on the diagonal elsewhere. This modification ensures invertibility even when m &lt; n. ### 4.4 Regularized Logistic Regression

The regularized cost function for logistic regression: <div>J(θ) = -(1/m) * Σ[y^(i) * log(h(x^(i))) + (1-y^(i)) * log(1-h(x^(i)))] + (λ/2m) * Σθ_j² for j ≥ 1</div>The gradient descent update mirrors linear regression: <div>θ_0 := θ_0 - α * (1/m) * Σ(h(x^(i)) - y^(i)) * x_0^(i)</div><div>θ_j := θ_j - α * [(1/m) * Σ(h(x^(i)) - y^(i)) * x_j^(i) + (λ/m) * θ_j] for j ≥ 1</div>Regularization effectively prevents overfitting even with numerous features, allowing the model to generalize better to unseen data.

Posted on Tue, 12 May 2026 14:15:59 +0000 by devx