Computation Graphs and Automatic Differentiation in Deep Learning Frameworks

Modern deep-learning stacks rely on a computation graph to represent a neural network as a directed acyclic graph (DAG) whose nodes are tensor operations and whose edges carry multi-dimensional arrays (tensors). This abstraction allows the framework to reason about the entire model ahead of time, insert missing backward operations, schedule kernels, and reclaim memory with out burdening the user.

Why a graph?

Training a large model raises several practical questions:

  • How can we automatically obtain gradients for millions of parameters?
  • How can we fuse, reorder, or eliminate operations at compile time?
  • How do we map arithmetic kernels to GPUs, TPUs, or NPUs?
  • How do we pre-allocate intermediate buffers produced by back-propagation?

A unified graph representation answers all of these in a single data structure. The user writes Python, the framework converts the imperative description into a DAG, and the runtime executes an optimized plan.

From math to DAG

Consider the scalar expression

f(x₁, x₂) = ln(x₁) + x₁·x₂ − sin(x₂)

The corresponding DAG contains:

  • Leaf nodes: x₁, x₂
  • Internal nodes: ln, mul, sin, add, sub
  • Root node: f

Every edge shows data flow; every node is an elementary operation.

Tensors and operators

Data containers

Operators consume and produce tansors. Examples range from simple algebra (add, mul) to domain-specific kernels (conv2d, batch_norm). Each operator has:

  • A forward function that computes outputs from inputs.
  • A backward function that computes vector-Jacobian products for autodiff.

Forward and backward graphs

For the mini-network

ŷ = ReLU(Conv2d(x, W, b))

the framework builds

  1. Forward DAG: two nodes—Conv2d followed by ReLU.
  2. Backward DAG: automatically generated by reversing edges and inserting gradient operators. The loss node becomes the new root, and gradients flow back to parameters W and b.

PyTorch’s dynamic approach

Eager execution

PyTorch creates the graph on-the-fly. Every tensor operation immediately performs its forward computation and appends a node to a tape:

import torch

W = torch.randn(5, 3, requires_grad=True)
b = torch.randn(5, 1, requires_grad=True)
x = torch.randn(10, 3)

y = torch.mm(x, W.t()) + b        # forward executed right here
loss = y.pow(2).mean()
print(loss.item())               # numeric value already available

Gradient tape disposal

After loss.backward() the tape is discarded to free memory. To retain it for higher-order derivatives, set retain_graph=True.

Custom autograd functions

Users can define new primitives by subclassing torch.autograd.Function:

class Swish(torch.autograd.Function):
    @staticmethod
    def forward(ctx, z):
        ctx.save_for_backward(z)
        return z * torch.sigmoid(z)

    @staticmethod
    def backward(ctx, grad_out):
        z, = ctx.saved_tensors
        sig = torch.sigmoid(z)
        return grad_out * (sig * (1 + z * (1 - sig)))

swish = Swish.apply
out = swish(x)
out.sum().backward()
print(W.grad.shape)   # (5, 3)

The same mechanism is used internally for every built-in kernel, ensuirng that the dynamic graph can grow arbitrarily complex while still supporting automatic differentiation end-to-end.

Tags: computation-graph automatic-differentiation pytorch tensor autograd

Posted on Thu, 04 Jun 2026 19:06:43 +0000 by CoreyR