Introduction
Two-tower models have become the standard architecture for large-scale retrieval systems. This guide covers everything from theory to production implementation.
Architecture Overview
Basic Structure
Query Tower Item Tower
| |
Query Features Item Features
| |
[Dense Layers] [Dense Layers]
| |
Query Embedding Item Embedding
| |
+-------- dot(q, i) -----+
|
Similarity
Why Two Towers?
- Decomposability: Compute item embeddings offline
- Scalability: ANN search for retrieval
- Flexibility: Separate tower architectures
Training
Loss Functions
Contrastive Loss (InfoNCE)
def contrastive_loss(query_emb, pos_item_emb, neg_item_embs, temperature=0.1):
pos_score = dot(query_emb, pos_item_emb) / temperature
neg_scores = dot(query_emb, neg_item_embs) / temperature
return -log(softmax(pos_score, neg_scores))
Batch Negatives
- Use other batch items as negatives
- Efficient GPU utilization
- Need large batches (1000+)
Hard Negative Mining
Easy negatives provide weak signal:
def mine_hard_negatives(query, all_items, num_hard=10):
# Get approximate nearest neighbors
candidates = ann_search(query, all_items, top_k=100)
# Filter out positives
negatives = [c for c in candidates if not is_positive(c)]
return negatives[:num_hard]
Tower Design
Query Tower
Features:
- Query text (embedded)
- User history (aggregated)
- Context (time, device)
Architecture:
- BERT/transformer for text
- Pooling layers for sequences
- MLP for final projection
Item Tower
Features:
- Item text/title
- Item attributes
- Engagement statistics
Architecture:
- Similar to query tower
- Can be asymmetric
Serving Architecture
Offline Pipeline
All Items -> Item Tower -> Item Embeddings -> ANN Index
Online Pipeline
User Query -> Query Tower -> Query Embedding -> ANN Search -> Results
Index Types
| Index Type | Build Time | Query Time | Memory |
|---|---|---|---|
| Brute Force | O(1) | O(n) | Low |
| IVF | O(n) | O(sqrt(n)) | Medium |
| HNSW | O(n log n) | O(log n) | High |
Optimization Techniques
Embedding Compression
- Quantization: FP32 -> INT8
- Dimensionality reduction: PCA
- Product quantization: Split and quantize
Training Improvements
- Temperature scheduling: Start high, anneal
- Curriculum learning: Easy to hard negatives
- Multi-task training: Multiple objectives
Production Considerations
Embedding Freshness
- Real-time vs. batch updates
- Incremental index building
- Versioning strategy
Monitoring
- Embedding drift
- ANN recall vs. exact
- Latency distributions
Common Pitfalls
- Batch size too small: Need thousands for good negatives
- No hard negatives: Model doesn't learn fine distinctions
- Dimension mismatch: Towers must output same size
- Index stale: Embeddings outdated
Master two-tower models in our Recommendation Systems at Scale course.