Forward vs. Reverse Mode Automatic Differentiation: When to Use Which

Automatic differentiation (AD) computes exact derivatives efficiently by applying the chain rule during program execution. Two primary strategies exist: forward mode and reverse mode. Their suitability depends on the shape of the function being differentiated—specifically, the number of inputs versus outputs.

Intuitive Analogy: Manufacturing Workflow

Consider a production line where raw materials undergo sequential transformations to yield a final product. Derivative computation mirrors assessing how each step influences the total cost.

Forward Mode AD

In forward mode, derivative information propagates alongside the primal computation from inputs to output. At each operation, we track how a small change in the input affects intermediate values.

This aprpoach is efficient when the number of inputs is small relative to outputs. For example, if you're analyzing how two design parameters affect three performance metrics, forward mode computes all required partial derivatives in just two passes.

Example: Compute the derivative of \( f(x, y) = x \cdot y + \sin(x) \) with respect to \( x \) at \( (2, 3) \):

# Initialize primal and tangent values
x = 2;  dx_dx = 1.0  # seed for ∂/∂x
y = 3;  dy_dx = 0.0  # y independent of x

# Compute v1 = x * y
v1 = x * y
dv1_dx = y * dx_dx + x * dy_dx  # = 3*1 + 2*0 = 3

# Compute v2 = sin(x)
v2 = sin(x)
dv2_dx = cos(x) * dx_dx         # ≈ -0.416

# Final output
f = v1 + v2
df_dx = dv1_dx + dv2_dx         # ≈ 3 - 0.416 = 2.584

Reverse Mode AD

Reverse mode first executes the full forward pass to compute the output, then performs a backward sweep to accumulate gradients from output back to inputs. It leverages the fact that the gradient of a scalar-valued function can be computed in one reverse pass regardless of input dimensionality.

This makes it ideal for functions with many inputs but few outputs—such as loss functions in deep learning, which map millions of parameters to a single scalar loss.

Example: Same function \( f(x, y) = x \cdot y + \sin(x) \), now computing both \( \partial f/\partial x \) and \( \partial f/\partial y \):

# Forward pass (store intermediates)
x = 2
y = 3
v1 = x * y        # = 6
v2 = sin(x)       # ≈ 0.909
f = v1 + v2       # ≈ 6.909

# Backward pass (initialize adjoints)
bar_f = 1.0       # ∂f/∂f = 1

bar_v1 = bar_f * 1.0   # ∂f/∂v1 = 1
bar_v2 = bar_f * 1.0   # ∂f/∂v2 = 1

bar_x = bar_v1 * y + bar_v2 * cos(x)  # = 1*3 + 1*(-0.416) ≈ 2.584
bar_y = bar_v1 * x                    # = 1*2 = 2

Key Comparison

Property Forward Mode Reverse Mode
Propagation Direction Input → Output Output ← Input
Optimal When Few inputs, many outputs (\( \mathbb{R}^n \to \mathbb{R}^m \), \( n \ll m \)) Many inputs, few outputs (\( \mathbb{R}^n \to \mathbb{R}^m \), \( n \gg m \))
Computational Cost \( O(n) \) evaluations for full Jacobian \( O(m) \) evaluations for full Jacobian
Memory Usage Low (no need to store full computation graph) High (must retain all intermediate values for backward pass)
Common Use Cases Jacobian computation, sensitivity analysis Gradient descent, neural network training (e.g., PyTorch, TensorFlow)

In practice, deep learning frameworks exclusively use reverse mode because models typically have vast parameter spaces (inputs) but produce a single loss value (output). Conversely, forward mode shines in simulatiosn with few control variables but multiple observables.

Tags: automatic-differentiation forward-mode reverse-mode deep-learning gradient-computation

Posted on Sat, 23 May 2026 19:33:53 +0000 by Beyond Reality