The Core Idea in One Sentence
Gradient descent finds the minimum of a function by repeatedly stepping in the direction that decreases it fastest.
That's it. Everything else — Adam, momentum, learning rate schedules — is an optimization on top of this idea.
Start with the Problem
You have a model with parameters (weights). You have a loss function that measures how wrong the model is. You want to adjust the parameters to reduce the loss.
import numpy as np
# Simple example: fit y = wx to data
# True relationship: y = 2x
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([2.0, 4.0, 6.0, 8.0, 10.0])
# Start with a random weight
w = 0.0
def predict(x, w):
return w * x
def loss(y_pred, y_true):
# Mean Squared Error
return np.mean((y_pred - y_true) ** 2)
# Current loss is terrible
print(loss(predict(x, w), y)) # 44.0
We need to find the w that minimizes the loss. The answer is w = 2.0, but in real ML we can't solve for it analytically — we have millions of parameters.
The Gradient
The gradient is the derivative of the loss with respect to each parameter. For our simple case:
Loss = mean((wx - y)^2)
dLoss/dw = mean(2(wx - y) * x)
= 2 * mean(x * (wx - y))
In code:
def gradient(x, y, w):
predictions = predict(x, w)
errors = predictions - y
# dLoss/dw
return 2 * np.mean(x * errors)
print(gradient(x, y, w=0.0)) # Negative: loss decreases as w increases
The gradient tells us: if we increase w, does the loss go up or down, and by how much?
The Update Rule
Gradient descent steps opposite to the gradient (because we want to decrease loss):
learning_rate = 0.01
w = 0.0
for step in range(100):
grad = gradient(x, y, w)
w = w - learning_rate * grad # Step opposite to gradient
if step % 10 == 0:
current_loss = loss(predict(x, w), y)
print(f"Step {step}: w={w:.4f}, loss={current_loss:.4f}")
# Step 0: w=0.2200, loss=33.8416
# Step 10: w=1.3686, loss=2.0611
# Step 50: w=1.9314, loss=0.0236
# Step 90: w=1.9906, loss=0.0009
After 100 steps, w ≈ 2.0. The algorithm found the answer without us knowing it.
Visualizing the Loss Landscape
Think of the loss function as a hilly terrain. Your current weight position is a ball on this terrain. Gradient descent rolls the ball downhill:
import matplotlib.pyplot as plt
w_values = np.linspace(-1, 5, 100)
loss_values = [loss(predict(x, w_val), y) for w_val in w_values]
plt.plot(w_values, loss_values)
plt.xlabel("w")
plt.ylabel("Loss")
plt.title("Loss landscape — minimum at w=2")
The gradient at any point tells you the slope. You move downhill by stepping opposite to it.
The Learning Rate
The learning rate controls step size. This is the most important hyperparameter in training:
# Too small: slow convergence
w = 0.0
for step in range(1000):
w = w - 0.0001 * gradient(x, y, w)
print(f"After 1000 steps: w={w:.4f}") # w=1.2 — not converged
# Just right
w = 0.0
for step in range(100):
w = w - 0.01 * gradient(x, y, w)
print(f"After 100 steps: w={w:.4f}") # w≈2.0 — converged
# Too large: diverges
w = 0.0
for step in range(20):
w = w - 1.0 * gradient(x, y, w)
print(f"w={w:.4f}") # Oscillates and blows up
Rule of thumb: start with lr=1e-3 for Adam, lr=0.01 for SGD. Use a learning rate finder if you're unsure.
Stochastic vs. Batch Gradient Descent
Batch GD: Compute gradient over entire dataset. Accurate but slow for large data.
Stochastic GD (SGD): Compute gradient on one sample at a time. Noisy but fast.
Mini-batch GD: Compute gradient on a small batch (32–512 samples). Best of both.
# Mini-batch gradient descent
batch_size = 2
w = 0.0
for epoch in range(50):
# Shuffle data
indices = np.random.permutation(len(x))
x_shuffled, y_shuffled = x[indices], y[indices]
for i in range(0, len(x), batch_size):
x_batch = x_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
grad = gradient(x_batch, y_batch, w)
w = w - 0.01 * grad
print(f"w={w:.4f}") # ≈ 2.0
Modern deep learning always uses mini-batch GD. "SGD" in PyTorch is actually mini-batch GD.
Adam: SGD with Memory
Adam (Adaptive Moment Estimation) is the default optimizer for deep learning. It maintains running averages of gradients:
# Adam pseudo-implementation
m = 0 # First moment (mean of gradients)
v = 0 # Second moment (variance of gradients)
beta1, beta2 = 0.9, 0.999
epsilon = 1e-8
lr = 0.001
t = 0
w = 0.0
for step in range(200):
t += 1
grad = gradient(x, y, w)
# Update moments
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
# Bias correction
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
# Update weight
w = w - lr * m_hat / (np.sqrt(v_hat) + epsilon)
print(f"w={w:.4f}") # ≈ 2.0, converges faster
The key insight: Adam scales the learning rate per parameter based on gradient history. Parameters with small, consistent gradients get larger effective learning rates.
In PyTorch
All of this is abstracted in PyTorch:
import torch
import torch.nn as nn
x = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0], [10.0]])
model = nn.Linear(1, 1, bias=False)
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()
for epoch in range(200):
pred = model(x)
loss = loss_fn(pred, y)
optimizer.zero_grad() # Clear old gradients
loss.backward() # Compute new gradients
optimizer.step() # Update weights
print(model.weight) # tensor([[2.0]])
The training loop you see here is exactly the one used for GPT, ResNet, and every other neural network — just with more parameters and a more complex model.
Key Intuitions to Carry Forward
- Gradient = slope in parameter space. Negative gradient = go right to decrease loss.
- Learning rate = step size. Too big overshoots; too small crawls.
- Mini-batches inject noise. This noise actually helps escape local minima.
- Adam adapts per-parameter. It's rarely the wrong choice to start with.
- Loss landscapes for deep networks are high-dimensional. Local minima are rarely a problem in practice — saddle points are more common.
Next: understand how gradients flow through layers with backpropagation explained for programmers.