Automatic differentiation (AD) is a fundamental technique in deep learning frameworks. This article demonstrates how to implement forward-mode automatic differentiation using Python operator overloading. The implemantation is remarkably concise—you can build a working system in just a few lines of code.
Understanding Forward-Mode Automatic Differentiation
Forward-mode automatic differentiation computes derivatives by traversing the computational graph from inputs to outputs. At each node, it calculates both the function value and its derivative with respect to the input variables. This approach corresponds to applying the chain rule from the innermost to the outermost function when differentiating composite functions.
The core idea can be summarized in three steps:
- Decompose the program into elementary operations with known derivative rules
- Apply the corresponding derivative rules to each elementary expression
- Combine derivative results using the chain rule based on data dependencies between expressions
When implementing this in Python using operator overloading, the process follows a similar pattern but leverages the language's operator overloading capabilities.
Implementation
First, we import numpy for numerical computations:
import numpy as np
We create a class called ADValue that holds two pieces of information: the actual value and its derivative. The derivative (grad) represents the sensitivity of the output with respect to this input variable.
class ADValue:
"""Forward-mode automatic differentiation value container."""
def __init__(self, val, grad):
self.val = val
self.grad = grad
def __repr__(self):
return f'value:{self.val:.4f}, derivative:{self.grad:.4f}'
Now we implement operator overloading for basic arithmetic operations. The key insight is that each operation must also propagate the derivative using standard calculus rules.
def __add__(self, other):
if isinstance(other, ADValue):
new_val = self.val + other.val
new_grad = self.grad + other.grad
elif isinstance(other, (int, float)):
new_val = self.val + other
new_grad = self.grad
else:
return NotImplemented
return ADValue(new_val, new_grad)
def __radd__(self, other):
return self.__add__(other)
def __sub__(self, other):
if isinstance(other, ADValue):
new_val = self.val - other.val
new_grad = self.grad - other.grad
elif isinstance(other, (int, float)):
new_val = self.val - other
new_grad = self.grad
else:
return NotImplemented
return ADValue(new_val, new_grad)
def __mul__(self, other):
if isinstance(other, ADValue):
# Product rule: (f*g)' = f'*g + f*g'
new_val = self.val * other.val
new_grad = self.grad * other.val + self.val * other.grad
elif isinstance(other, (int, float)):
new_val = self.val * other
new_grad = self.grad * other
else:
return NotImplemented
return ADValue(new_val, new_grad)
def __rmul__(self, other):
return self.__mul__(other)
We also need to overload common mathematical functions:
def log(self):
# d/dx log(x) = 1/x
new_val = np.log(self.val)
new_grad = (1.0 / self.val) * self.grad
return ADValue(new_val, new_grad)
def sin(self):
# d/dx sin(x) = cos(x)
new_val = np.sin(self.val)
new_grad = self.grad * np.cos(self.val)
return ADValue(new_val, new_grad)
Testing the Implementation
Let's verify our implementation with the following function:
f(x₁, x₂) = ln(x₁) + x₁ * x₂ - sin(x₂)
We initialize the input variables. To compute the partial derivative with respect to x₁, we set its derivative to 1 and x₂'s derivative to 0:
x1 = ADValue(val=2.0, grad=1.0)
x2 = ADValue(val=5.0, grad=0.0)
# Compute f(x1, x2) = ln(x1) + x1 * x2 - sin(x2)
result = x1.log() + x1 * x2 - x2.sin()
print(f"Function value: {result.val:.4f}")
print(f"Partial derivative df/dx1: {result.grad:.4f}")
Output:
Function value: 11.6521
Partial derivative df/dx1: 5.5000
To obtain the derivative with respect to x₂, we simply swap the gradient settings:
x1 = ADValue(val=2.0, grad=0.0)
x2 = ADValue(val=5.0, grad=1.0)
result = x1.log() + x1 * x2 - x2.sin()
print(f"Partial derivative df/dx2: {result.grad:.4f}")
Output:
Partial derivative df/dx2: 1.7163
Verification with Deep Learning Frameworks
We can verify our results against PyTorch:
import torch
x1 = torch.tensor([2.0], requires_grad=True)
x2 = torch.tensor([5.0], requires_grad=True)
f = torch.log(x1) + x1 * x2 - torch.sin(x2)
f.backward()
print(f"PyTorch - df/dx1: {x1.grad.item():.4f}")
print(f"PyTorch - df/dx2: {x2.grad.item():.4f}")
The results match our implementation exactly, confirming the correctness of the forward-mode automatic differentiation approach.