Transformers Explained for Programmers

Why Transformers Won

Before 2017, sequential models (RNNs, LSTMs) dominated NLP. They processed text word-by-word, maintaining a hidden state. The problem: information about early tokens had to survive many steps to influence later tokens — it often got lost.

The transformer's key insight: process all tokens simultaneously, with every token attending to every other token. No sequential processing. No information bottleneck. And it parallelizes perfectly on GPUs.

The Architecture in One Diagram

Input: "The cat sat"
    │
    ▼
[Token Embeddings]       "The"→[0.2, -0.1, ...], "cat"→[...], "sat"→[...]
    │
    ▼
[Positional Encoding]    Add position information (tokens are now position-aware)
    │
    ▼
[Multi-Head Attention]   Each token attends to all other tokens
    │
    ▼
[Feed-Forward Network]   Process each position independently
    │
    ▼
[Layer Norm + Residual]  (Repeated N times — 12 layers for BERT-base)
    │
    ▼
Output representations (one vector per token)

Attention: The Core Mechanism

Attention answers: for each token, how much should it "look at" each other token when building its representation?

import torch
import torch.nn.functional as F

def attention(query, key, value):
    """
    query: (seq_len, d_k) — "what am I looking for?"
    key:   (seq_len, d_k) — "what do I contain?"
    value: (seq_len, d_v) — "what do I return if you attend to me?"
    """
    d_k = query.size(-1)

    # Compute similarity scores (dot product)
    scores = torch.matmul(query, key.transpose(-2, -1)) / (d_k ** 0.5)
    # scores shape: (seq_len, seq_len)
    # scores[i, j] = how much token i should attend to token j

    # Convert to probabilities
    attention_weights = F.softmax(scores, dim=-1)

    # Weighted sum of values
    output = torch.matmul(attention_weights, value)
    # output shape: (seq_len, d_v)

    return output, attention_weights

# Concrete example
seq_len = 4  # "The cat sat on"
d_model = 8

# Random Q, K, V for illustration
Q = torch.randn(seq_len, d_model)
K = torch.randn(seq_len, d_model)
V = torch.randn(seq_len, d_model)

out, weights = attention(Q, K, V)
print(weights.shape)  # (4, 4) — each token attends to all 4 tokens
print(weights[1])     # Attention weights for "cat" over all tokens

The Query-Key-Value Analogy

Think of it like a database lookup:

Query: what you're searching for
Key: the index of each stored item
Value: the content of each stored item

Attention computes similarity between your query and all keys, then returns a weighted average of the values.

Multi-Head Attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x: torch.Tensor) -> torch.Tensor:
        batch_size, seq_len, d_model = x.shape
        x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 2)  # (batch, heads, seq_len, d_k)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Project and split into heads
        Q = self.split_heads(self.W_q(query))
        K = self.split_heads(self.W_k(key))
        V = self.split_heads(self.W_v(value))

        # Scaled dot-product attention (per head)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        weights = F.softmax(scores, dim=-1)
        attended = torch.matmul(weights, V)

        # Concatenate heads
        attended = attended.transpose(1, 2).contiguous()
        attended = attended.view(batch_size, -1, self.d_model)

        return self.W_o(attended)

Why multiple heads? Each head learns to attend to different types of relationships simultaneously:

Head 1 might focus on syntactic relationships (subject-verb)
Head 2 might focus on coreference (pronoun → noun)
Head 3 might focus on positional proximity

Positional Encoding

Attention is permutation-invariant — it doesn't know the order of tokens. Positional encodings add this information:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_seq_len: int = 5000):
        super().__init__()

        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len).unsqueeze(1).float()

        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() *
            -(torch.log(torch.tensor(10000.0)) / d_model)
        )

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer("pe", pe.unsqueeze(0))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x + self.pe[:, :x.size(1)]

Modern models (LLaMA, GPT-4) use Rotary Position Embeddings (RoPE) instead, which has better length generalization, but the concept is the same.

BERT vs GPT: Two Architectures, Two Use Cases

BERT (Encoder-only)

Sees the full sequence (bidirectional attention)
Pre-trained with: Masked Language Modeling (predict missing tokens)
Use for: classification, NER, Q&A, embeddings

from transformers import BertTokenizer, BertForSequenceClassification
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

inputs = tokenizer("This movie was great!", return_tensors="pt", padding=True)
outputs = model(**inputs)
logits = outputs.logits  # Shape: (1, 2)
predicted_class = logits.argmax().item()

GPT (Decoder-only)

Only sees past tokens (causal/masked attention)
Pre-trained with: Next token prediction
Use for: text generation, completion, chat

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_ids = tokenizer.encode("The transformer architecture", return_tensors="pt")
outputs = model.generate(
    input_ids,
    max_new_tokens=50,
    temperature=0.8,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

Fine-Tuning a Pretrained Model

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import Dataset
import numpy as np

# Load pretrained model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Prepare data
def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

dataset = Dataset.from_dict({"text": texts, "label": labels})
tokenized = dataset.map(tokenize, batched=True)
train_ds, val_ds = tokenized.train_test_split(test_size=0.2).values()

# Training config
args = TrainingArguments(
    output_dir="./checkpoints",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="best",
)

# Train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
)
trainer.train()

This fine-tunes all ~66M DistilBERT parameters. For larger models, use LoRA (freeze base model, train small adapter layers).

When to Use Transformers vs. Simpler Models

Use Transformer when	Use simpler model when
You have text/sequence data	You have tabular data
Pre-trained models exist for your domain	Training data > evaluation time tradeoff matters
Semantic understanding matters	You need fast, interpretable inference
Your data is unstructured	Features are well-understood

For tabular data with engineered features, gradient boosting (XGBoost) usually beats transformers with far less compute.

Ready to go deeper? Read about attention mechanisms in large language models and how they're optimized for production serving.

Why Transformers Won

The Architecture in One Diagram

Attention: The Core Mechanism

The Query-Key-Value Analogy

Multi-Head Attention

Positional Encoding

BERT vs GPT: Two Architectures, Two Use Cases

BERT (Encoder-only)

GPT (Decoder-only)

Fine-Tuning a Pretrained Model

When to Use Transformers vs. Simpler Models

Related Articles

Deep Neural Networks for YouTube Recommendations: A Complete Guide

Building LinkedIn's Semantic Search: From Keywords to Understanding

Pinterest Recommendation System: Evolution Through the Years

Want to Go Deeper?