Introduction
Late interaction models represent a middle ground between expensive cross-encoders and efficient bi-encoders. This guide explores ColBERT, ColPali, and related architectures.
The Retrieval Spectrum
Bi-Encoders (Two-Tower)
- Encode query and document independently
- Fast: dot product similarity
- Quality: Limited interaction
Cross-Encoders
- Joint encoding of query-document pairs
- Slow: Must run for each pair
- Quality: Rich interaction
Late Interaction (ColBERT)
- Encode independently
- Interact at token level
- Balance of speed and quality
ColBERT Architecture
Encoding Phase
# Query encoding
query_embeddings = bert(query) # [num_query_tokens, dim]
# Document encoding (offline)
doc_embeddings = bert(document) # [num_doc_tokens, dim]
MaxSim Scoring
def maxsim(query_embs, doc_embs):
# For each query token, find max similarity to any doc token
similarities = query_embs @ doc_embs.T # [Q, D]
max_per_query = similarities.max(dim=1) # [Q]
return max_per_query.sum()
ColPali: Visual Documents
ColPali extends late interaction to visual document retrieval:
- Process document pages as images
- Vision encoder for patch embeddings
- Late interaction for relevance
Production Deployment
Index Building
- Pre-compute document token embeddings
- Store in specialized index (PLAID)
- Compress with quantization
Query Time
- Encode query tokens
- Approximate retrieval
- Re-rank with full MaxSim
When to Use
| Method | Latency | Quality | Use Case |
|---|---|---|---|
| Bi-encoder | 1x | Good | First stage |
| ColBERT | 5-10x | Better | Re-ranking |
| Cross-encoder | 100x | Best | Top-k only |
Best Practices
- Hybrid retrieval: Bi-encoder + ColBERT re-rank
- Token pruning: Reduce document tokens
- Quantization: INT8 for efficiency
- Batching: Amortize query encoding
Master retrieval in our RAG Systems at Scale course.