tutorial 2025-04-09 14 min read

Backpropagation Explained for Programmers

Understand backpropagation through code and the chain rule. Build a miniature autograd engine from scratch to see exactly how gradients flow through neural networks.

backpropagation autograd neural networks deep learning chain rule

The Chain Rule is All You Need

Backpropagation sounds scary. It's just the chain rule from calculus applied systematically to a computation graph. If you understand function composition, you can understand backpropagation.

The chain rule: if y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx).

That's it. Backprop chains this rule through every operation in your neural network.

Build a Miniature Autograd Engine

The best way to understand backpropagation is to implement it. Let's build a tiny version of PyTorch's autograd.

import math

class Value:
    """A scalar value with automatic gradient computation."""

    def __init__(self, data: float, _children=(), _op=""):
        self.data = data
        self.grad = 0.0  # Gradient of loss w.r.t. this value
        self._backward = lambda: None  # Function to compute gradients
        self._prev = set(_children)
        self._op = _op

    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

Defining Operations with Backward Passes

Each operation needs a forward computation AND a backward computation (how to compute input gradients given output gradient):

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        result = Value(self.data + other.data, (self, other), "+")

        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            # Chain rule: grad_a += result.grad * 1
            self.grad += result.grad
            other.grad += result.grad

        result._backward = _backward
        return result

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        result = Value(self.data * other.data, (self, other), "*")

        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad += other.data * result.grad
            self.grad = self.grad  # Keep for clarity
            other.grad += self.data * result.grad

        result._backward = _backward
        return result

    def __pow__(self, power):
        assert isinstance(power, (int, float))
        result = Value(self.data ** power, (self,), f"**{power}")

        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += power * (self.data ** (power - 1)) * result.grad

        result._backward = _backward
        return result

    def relu(self):
        result = Value(max(0, self.data), (self,), "relu")

        def _backward():
            # d(relu(x))/dx = 1 if x > 0 else 0
            self.grad += (result.data > 0) * result.grad

        result._backward = _backward
        return result

    def exp(self):
        result = Value(math.exp(self.data), (self,), "exp")

        def _backward():
            # d(e^x)/dx = e^x
            self.grad += result.data * result.grad

        result._backward = _backward
        return result

    # Convenience methods
    def __neg__(self): return self * -1
    def __sub__(self, other): return self + (-other)
    def __truediv__(self, other): return self * other**-1
    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other
    def __rsub__(self, other): return other + (-self)

The Backward Pass: Topological Sort

To compute gradients correctly, we need to call each node's _backward() in reverse topological order (outputs before inputs):

    def backward(self):
        """Compute gradients for all nodes via reverse-mode autodiff."""

        # Build topological order
        topo = []
        visited = set()

        def build_topo(node):
            if node not in visited:
                visited.add(node)
                for child in node._prev:
                    build_topo(child)
                topo.append(node)

        build_topo(self)

        # Gradient of loss w.r.t. itself = 1
        self.grad = 1.0

        # Propagate gradients in reverse topological order
        for node in reversed(topo):
            node._backward()

Testing It

# Simple example: f(x) = x^2
x = Value(3.0)
y = x ** 2
y.backward()
print(x.grad)  # 6.0 ✓ (dy/dx = 2x at x=3)

# Chain rule: f(x) = (x + 2) * x
x = Value(3.0)
y = (x + Value(2.0)) * x
y.backward()
print(x.grad)  # 8.0 ✓ (y = x^2 + 2x, dy/dx = 2x + 2 = 8 at x=3)

# Verify against PyTorch
import torch
x_torch = torch.tensor(3.0, requires_grad=True)
y_torch = (x_torch + 2) * x_torch
y_torch.backward()
print(x_torch.grad)  # tensor(8.) ✓

Building a Neural Network with Our Engine

import random

class Neuron:
    def __init__(self, n_inputs: int):
        self.weights = [Value(random.uniform(-1, 1)) for _ in range(n_inputs)]
        self.bias = Value(0.0)

    def __call__(self, inputs: list[Value]) -> Value:
        # Weighted sum + bias
        activation = sum((w * x for w, x in zip(self.weights, inputs)), self.bias)
        return activation.relu()

    def parameters(self) -> list[Value]:
        return self.weights + [self.bias]

class Layer:
    def __init__(self, n_inputs: int, n_outputs: int):
        self.neurons = [Neuron(n_inputs) for _ in range(n_outputs)]

    def __call__(self, x: list[Value]) -> list[Value]:
        return [neuron(x) for neuron in self.neurons]

    def parameters(self) -> list[Value]:
        return [p for neuron in self.neurons for p in neuron.parameters()]

class MLP:
    def __init__(self, n_inputs: int, layer_sizes: list[int]):
        sizes = [n_inputs] + layer_sizes
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(sizes)-1)]

    def __call__(self, x: list[float]) -> Value:
        out = [Value(xi) for xi in x]
        for layer in self.layers:
            out = layer(out)
        return out[0] if len(out) == 1 else out

    def parameters(self) -> list[Value]:
        return [p for layer in self.layers for p in layer.parameters()]

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.0

Training the Network

# XOR problem — not linearly separable
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]

model = MLP(n_inputs=2, layer_sizes=[4, 4, 1])

learning_rate = 0.1

for step in range(1000):
    # Forward pass
    predictions = [model(x) for x in X]

    # MSE loss
    loss = sum((pred - target)**2 for pred, target in zip(predictions, y))
    loss = loss * (1 / len(y))

    # Backward pass
    model.zero_grad()
    loss.backward()

    # Gradient descent
    for param in model.parameters():
        param.data -= learning_rate * param.grad

    if step % 100 == 0:
        print(f"Step {step}: loss={loss.data:.6f}")

# Test
for x, target in zip(X, y):
    pred = model(x)
    print(f"Input: {x}, Target: {target}, Predicted: {pred.data:.4f}")

# Input: [0, 0], Target: 0, Predicted: 0.0123
# Input: [0, 1], Target: 1, Predicted: 0.9876
# Input: [1, 0], Target: 1, Predicted: 0.9854
# Input: [1, 1], Target: 0, Predicted: 0.0234  ✓

How PyTorch Does This at Scale

Our engine computes on scalars. PyTorch operates on tensors (n-dimensional arrays) using the same principles:

  1. Every tensor operation creates a node in the computation graph
  2. Each node stores a grad_fn (equivalent to our _backward)
  3. .backward() performs topological sort + applies chain rule at each node
  4. Gradients flow from the scalar loss back to all leaf tensors with requires_grad=True

The math is identical. PyTorch's implementation:

  • Operates on batches of data simultaneously (vectorized)
  • Runs the computation graph on GPU
  • Handles edge cases (in-place operations, non-differentiable ops)
  • Uses efficient C++ and CUDA kernels for performance

Common Backpropagation Bugs

Vanishing gradients: Gradients shrink as they propagate through many layers.

# Problem: sigmoid at every layer
# Gradient of sigmoid is max 0.25 — multiply 10 layers: 0.25^10 ≈ 0.0000001
# Fix: ReLU (gradient = 1 when active), residual connections (gradient highway)

Exploding gradients: Gradients grow exponentially.

# Fix: gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Gradients not flowing to some parameters: A layer is disconnected from the loss.

# Debug: check which parameters have None gradients after backward
for name, param in model.named_parameters():
    if param.grad is None and param.requires_grad:
        print(f"No gradient: {name}")

Now that you understand how gradients work, see how they're used at scale in our guide to distributed training.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.