Why Transformers Won
Before 2017, sequential models (RNNs, LSTMs) dominated NLP. They processed text word-by-word, maintaining a hidden state. The problem: information about early tokens had to survive many steps to influence later tokens — it often got lost.
The transformer's key insight: process all tokens simultaneously, with every token attending to every other token. No sequential processing. No information bottleneck. And it parallelizes perfectly on GPUs.
The Architecture in One Diagram
Input: "The cat sat"
│
â–¼
[Token Embeddings] "The"→[0.2, -0.1, ...], "cat"→[...], "sat"→[...]
│
â–¼
[Positional Encoding] Add position information (tokens are now position-aware)
│
â–¼
[Multi-Head Attention] Each token attends to all other tokens
│
â–¼
[Feed-Forward Network] Process each position independently
│
â–¼
[Layer Norm + Residual] (Repeated N times — 12 layers for BERT-base)
│
â–¼
Output representations (one vector per token)
Attention: The Core Mechanism
Attention answers: for each token, how much should it "look at" each other token when building its representation?
import torch
import torch.nn.functional as F
def attention(query, key, value):
"""
query: (seq_len, d_k) — "what am I looking for?"
key: (seq_len, d_k) — "what do I contain?"
value: (seq_len, d_v) — "what do I return if you attend to me?"
"""
d_k = query.size(-1)
# Compute similarity scores (dot product)
scores = torch.matmul(query, key.transpose(-2, -1)) / (d_k ** 0.5)
# scores shape: (seq_len, seq_len)
# scores[i, j] = how much token i should attend to token j
# Convert to probabilities
attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = torch.matmul(attention_weights, value)
# output shape: (seq_len, d_v)
return output, attention_weights
# Concrete example
seq_len = 4 # "The cat sat on"
d_model = 8
# Random Q, K, V for illustration
Q = torch.randn(seq_len, d_model)
K = torch.randn(seq_len, d_model)
V = torch.randn(seq_len, d_model)
out, weights = attention(Q, K, V)
print(weights.shape) # (4, 4) — each token attends to all 4 tokens
print(weights[1]) # Attention weights for "cat" over all tokens
The Query-Key-Value Analogy
Think of it like a database lookup:
- Query: what you're searching for
- Key: the index of each stored item
- Value: the content of each stored item
Attention computes similarity between your query and all keys, then returns a weighted average of the values.
Multi-Head Attention
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Linear projections
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def split_heads(self, x: torch.Tensor) -> torch.Tensor:
batch_size, seq_len, d_model = x.shape
x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
return x.transpose(1, 2) # (batch, heads, seq_len, d_k)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Project and split into heads
Q = self.split_heads(self.W_q(query))
K = self.split_heads(self.W_k(key))
V = self.split_heads(self.W_v(value))
# Scaled dot-product attention (per head)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
weights = F.softmax(scores, dim=-1)
attended = torch.matmul(weights, V)
# Concatenate heads
attended = attended.transpose(1, 2).contiguous()
attended = attended.view(batch_size, -1, self.d_model)
return self.W_o(attended)
Why multiple heads? Each head learns to attend to different types of relationships simultaneously:
- Head 1 might focus on syntactic relationships (subject-verb)
- Head 2 might focus on coreference (pronoun → noun)
- Head 3 might focus on positional proximity
Positional Encoding
Attention is permutation-invariant — it doesn't know the order of tokens. Positional encodings add this information:
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, max_seq_len: int = 5000):
super().__init__()
pe = torch.zeros(max_seq_len, d_model)
position = torch.arange(0, max_seq_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() *
-(torch.log(torch.tensor(10000.0)) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer("pe", pe.unsqueeze(0))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return x + self.pe[:, :x.size(1)]
Modern models (LLaMA, GPT-4) use Rotary Position Embeddings (RoPE) instead, which has better length generalization, but the concept is the same.
BERT vs GPT: Two Architectures, Two Use Cases
BERT (Encoder-only)
- Sees the full sequence (bidirectional attention)
- Pre-trained with: Masked Language Modeling (predict missing tokens)
- Use for: classification, NER, Q&A, embeddings
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
inputs = tokenizer("This movie was great!", return_tensors="pt", padding=True)
outputs = model(**inputs)
logits = outputs.logits # Shape: (1, 2)
predicted_class = logits.argmax().item()
GPT (Decoder-only)
- Only sees past tokens (causal/masked attention)
- Pre-trained with: Next token prediction
- Use for: text generation, completion, chat
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
input_ids = tokenizer.encode("The transformer architecture", return_tensors="pt")
outputs = model.generate(
input_ids,
max_new_tokens=50,
temperature=0.8,
do_sample=True,
)
print(tokenizer.decode(outputs[0]))
Fine-Tuning a Pretrained Model
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from datasets import Dataset
import numpy as np
# Load pretrained model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Prepare data
def tokenize(examples):
return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)
dataset = Dataset.from_dict({"text": texts, "label": labels})
tokenized = dataset.map(tokenize, batched=True)
train_ds, val_ds = tokenized.train_test_split(test_size=0.2).values()
# Training config
args = TrainingArguments(
output_dir="./checkpoints",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="best",
)
# Train
trainer = Trainer(
model=model,
args=args,
train_dataset=train_ds,
eval_dataset=val_ds,
)
trainer.train()
This fine-tunes all ~66M DistilBERT parameters. For larger models, use LoRA (freeze base model, train small adapter layers).
When to Use Transformers vs. Simpler Models
| Use Transformer when | Use simpler model when |
|---|---|
| You have text/sequence data | You have tabular data |
| Pre-trained models exist for your domain | Training data > evaluation time tradeoff matters |
| Semantic understanding matters | You need fast, interpretable inference |
| Your data is unstructured | Features are well-understood |
For tabular data with engineered features, gradient boosting (XGBoost) usually beats transformers with far less compute.
Ready to go deeper? Read about attention mechanisms in large language models and how they're optimized for production serving.