Training Neural Networks: Cost Function and Backpropagation Explained

Cost Function for Neural Networks

The cost function for a neural network extends the logistic regression cost to handle multiple output units. Define:

(L): total number of layers
(s_l): number of units (excluding bias) in layer (l)
(K): number of output units (classes)

For a binary classification (K=1), the hypothesis (h_\Theta(x)) is a scalar. For multi-class (K≥3), (h_\Theta(x) \in \mathbb{R}^K) and (y \in \mathbb{R}^K). The cost function is:

[J(\Theta) = -\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K\left[ y_k^{(i)}\log((h_\Theta(x^{(i)}))k) + (1-y_k^{(i)})\log(1-(h\Theta(x^{(i)}))k)\right] + \frac{\lambda}{2m}\sum{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{j,i}^{(l)})^2]

The double sum inside the brackets sums the logistic regression cost across all output units. The triple sum is the regularization term, which sums over all weight matrices excluding bias units (index (i=0) is sometimes included, but excluding is a common convnetion).

Backpropagation Algorithm

Backpropagation computes the gradient (\frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta)). For a training set of (m) examples:

Initialize (\Delta_{i,j}^{(l)} = 0) for all (l, i, j).
For each example (t = 1,\dots,m):
- Set (a^{(1)} = x^{(t)}).
- Perform forward propagation to compute activations (a^{(l)}) for (l=2,\dots,L).
- Compute the output error: (\delta^{(L)} = a^{(L)} - y^{(t)}).
- For layers (l = L-1, L-2, \dots, 2), compute: [\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)}) \cdot a^{(l)} \cdot (1 - a^{(l)})] where (a^{(l)} \cdot (1 - a^{(l)})) is the derivative of the sigmoid activation.
- Accumulate: (\Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T).
Compute the final gradient:
- (D_{i,j}^{(l)} = \frac{1}{m}(\Delta_{i,j}^{(l)} + \lambda \Theta_{i,j}^{(l)})) if (j \neq 0), and (D_{i,j}^{(l)} = \frac{1}{m}\Delta_{i,j}^{(l)}) if (j = 0).

The resulting (D_{i,j}^{(l)}) equals (\frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta)).

Backpropagation Intuition

The term (\delta_j^{(l)}) represents the "error" of unit (j) in layer (l), defined as the derivative of the cost for a single example with respect to (z_j^{(l)}): [\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} \text{cost}(t)] For a single output unit (K=1) ignoring regularization: [\text{cost}(t) = y^{(t)}\log(h_\Theta(x^{(t)})) + (1-y^{(t)})\log(1-h_\Theta(x^{(t)}))] Delta values propagate errors backward through the network: (\delta_j^{(l)}) is a weighted sum of the deltas from the next layer, multiplied by the activation function's derivative.

Practical Implementation

Unrolling Parameters

Optimization functions like fminunc require a single parameter vector. Unroll matrices into a long vector:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ];
deltaVector = [ D1(:); D2(:); D3(:); ];

Reshape back after optimization:

Theta1 = reshape(thetaVector(1:110), 10, 11);
Theta2 = reshape(thetaVector(111:220), 10, 11);
Theta3 = reshape(thetaVector(221:231), 1, 11);

Gradient Checking

Numerically approximate the gradient to verify backpropagation: [\frac{\partial}{\partial\Theta_j}J(\Theta) \approx \frac{J(\Theta_1,\dots,\Theta_j+\epsilon,\dots)-J(\Theta_1,\dots,\Theta_j-\epsilon,\dots)}{2\epsilon}] Set (\epsilon = 10^{-4}). In code:

epsilon = 1e-4;
for i = 1:n
    thetaPlus = theta; thetaPlus(i) += epsilon;
    thetaMinus = theta; thetaMinus(i) -= epsilon;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2*epsilon);
end

Compare gradApprox with the backpropagation vector. Once verified, disable gradient checking for training.

Random Initialization

Initilaizing all weights to zero fails because all hidden units become symmetric and compute identical functions, eliminating the network's capacity. Instead, initialize randomly within a small range:

INIT_EPSILON = 0.12;
Theta1 = rand(10,11) * (2*INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2*INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2*INIT_EPSILON) - INIT_EPSILON;

Training Steps

Choose architecture: input units = feature dimension, output units = number of classes, hidden layers (default 1, more with equal units per layer).
Randomly initialize weights.
Implement forward propagation to compute (h_\Theta(x^{(i)})).
Implement the cost function.
Implement backpropagation for gradients.
Perform gradient checking to validate backpropagation, then disable it.
Use an optimization algorithm (gradient descent, fminunc, etc.) to minimize (J(\Theta)).

Neural network cost functions are non-convex, so optimization may find a local minimum. In practice, this often yields good results.

Autonomous Driving Example

ALVINN (Autonomous Land Vehicle In a Neural Network) was a 1992 system that learned to steer by observing a human driver. A 30×32 pixel video image was fed into a three-layer network, which used backpropagation to mimic the steering direction. After about two minutes of training, the network could drive on the road it was trained on. During operation, multiple networks specialized for different road types ran in parallel; the most confident network controlled the vehicle. This demonstrated that neural networks with backpropagation can learn complex real-world tasks like driving.

Tags: Neural Networks backpropagation cost function gradient checking random initialization

Posted on Fri, 08 May 2026 06:35:31 +0000 by vB3Dev.Com

Freaks City