tutorial 2025-03-03 11 min read

Gradient Descent Explained for Programmers

Understand gradient descent through code, not calculus. A programmer-first explanation of how neural networks learn, with concrete Python examples and intuition for learning rates and optimization.

gradient descent optimization neural networks Python deep learning basics

The Core Idea in One Sentence

Gradient descent finds the minimum of a function by repeatedly stepping in the direction that decreases it fastest.

That's it. Everything else — Adam, momentum, learning rate schedules — is an optimization on top of this idea.

Start with the Problem

You have a model with parameters (weights). You have a loss function that measures how wrong the model is. You want to adjust the parameters to reduce the loss.

import numpy as np

# Simple example: fit y = wx to data
# True relationship: y = 2x
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([2.0, 4.0, 6.0, 8.0, 10.0])

# Start with a random weight
w = 0.0

def predict(x, w):
    return w * x

def loss(y_pred, y_true):
    # Mean Squared Error
    return np.mean((y_pred - y_true) ** 2)

# Current loss is terrible
print(loss(predict(x, w), y))  # 44.0

We need to find the w that minimizes the loss. The answer is w = 2.0, but in real ML we can't solve for it analytically — we have millions of parameters.

The Gradient

The gradient is the derivative of the loss with respect to each parameter. For our simple case:

Loss = mean((wx - y)^2)
dLoss/dw = mean(2(wx - y) * x)
         = 2 * mean(x * (wx - y))

In code:

def gradient(x, y, w):
    predictions = predict(x, w)
    errors = predictions - y
    # dLoss/dw
    return 2 * np.mean(x * errors)

print(gradient(x, y, w=0.0))  # Negative: loss decreases as w increases

The gradient tells us: if we increase w, does the loss go up or down, and by how much?

The Update Rule

Gradient descent steps opposite to the gradient (because we want to decrease loss):

learning_rate = 0.01

w = 0.0
for step in range(100):
    grad = gradient(x, y, w)
    w = w - learning_rate * grad  # Step opposite to gradient

    if step % 10 == 0:
        current_loss = loss(predict(x, w), y)
        print(f"Step {step}: w={w:.4f}, loss={current_loss:.4f}")

# Step 0: w=0.2200, loss=33.8416
# Step 10: w=1.3686, loss=2.0611
# Step 50: w=1.9314, loss=0.0236
# Step 90: w=1.9906, loss=0.0009

After 100 steps, w ≈ 2.0. The algorithm found the answer without us knowing it.

Visualizing the Loss Landscape

Think of the loss function as a hilly terrain. Your current weight position is a ball on this terrain. Gradient descent rolls the ball downhill:

import matplotlib.pyplot as plt

w_values = np.linspace(-1, 5, 100)
loss_values = [loss(predict(x, w_val), y) for w_val in w_values]

plt.plot(w_values, loss_values)
plt.xlabel("w")
plt.ylabel("Loss")
plt.title("Loss landscape — minimum at w=2")

The gradient at any point tells you the slope. You move downhill by stepping opposite to it.

The Learning Rate

The learning rate controls step size. This is the most important hyperparameter in training:

# Too small: slow convergence
w = 0.0
for step in range(1000):
    w = w - 0.0001 * gradient(x, y, w)
print(f"After 1000 steps: w={w:.4f}")  # w=1.2 — not converged

# Just right
w = 0.0
for step in range(100):
    w = w - 0.01 * gradient(x, y, w)
print(f"After 100 steps: w={w:.4f}")  # w≈2.0 — converged

# Too large: diverges
w = 0.0
for step in range(20):
    w = w - 1.0 * gradient(x, y, w)
    print(f"w={w:.4f}")  # Oscillates and blows up

Rule of thumb: start with lr=1e-3 for Adam, lr=0.01 for SGD. Use a learning rate finder if you're unsure.

Stochastic vs. Batch Gradient Descent

Batch GD: Compute gradient over entire dataset. Accurate but slow for large data.

Stochastic GD (SGD): Compute gradient on one sample at a time. Noisy but fast.

Mini-batch GD: Compute gradient on a small batch (32–512 samples). Best of both.

# Mini-batch gradient descent
batch_size = 2
w = 0.0

for epoch in range(50):
    # Shuffle data
    indices = np.random.permutation(len(x))
    x_shuffled, y_shuffled = x[indices], y[indices]

    for i in range(0, len(x), batch_size):
        x_batch = x_shuffled[i:i+batch_size]
        y_batch = y_shuffled[i:i+batch_size]

        grad = gradient(x_batch, y_batch, w)
        w = w - 0.01 * grad

print(f"w={w:.4f}")  # ≈ 2.0

Modern deep learning always uses mini-batch GD. "SGD" in PyTorch is actually mini-batch GD.

Adam: SGD with Memory

Adam (Adaptive Moment Estimation) is the default optimizer for deep learning. It maintains running averages of gradients:

# Adam pseudo-implementation
m = 0  # First moment (mean of gradients)
v = 0  # Second moment (variance of gradients)
beta1, beta2 = 0.9, 0.999
epsilon = 1e-8
lr = 0.001
t = 0

w = 0.0
for step in range(200):
    t += 1
    grad = gradient(x, y, w)

    # Update moments
    m = beta1 * m + (1 - beta1) * grad
    v = beta2 * v + (1 - beta2) * grad**2

    # Bias correction
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)

    # Update weight
    w = w - lr * m_hat / (np.sqrt(v_hat) + epsilon)

print(f"w={w:.4f}")  # ≈ 2.0, converges faster

The key insight: Adam scales the learning rate per parameter based on gradient history. Parameters with small, consistent gradients get larger effective learning rates.

In PyTorch

All of this is abstracted in PyTorch:

import torch
import torch.nn as nn

x = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0], [10.0]])

model = nn.Linear(1, 1, bias=False)
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()

for epoch in range(200):
    pred = model(x)
    loss = loss_fn(pred, y)

    optimizer.zero_grad()  # Clear old gradients
    loss.backward()        # Compute new gradients
    optimizer.step()       # Update weights

print(model.weight)  # tensor([[2.0]])

The training loop you see here is exactly the one used for GPT, ResNet, and every other neural network — just with more parameters and a more complex model.

Key Intuitions to Carry Forward

  1. Gradient = slope in parameter space. Negative gradient = go right to decrease loss.
  2. Learning rate = step size. Too big overshoots; too small crawls.
  3. Mini-batches inject noise. This noise actually helps escape local minima.
  4. Adam adapts per-parameter. It's rarely the wrong choice to start with.
  5. Loss landscapes for deep networks are high-dimensional. Local minima are rarely a problem in practice — saddle points are more common.

Next: understand how gradients flow through layers with backpropagation explained for programmers.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.