design pattern 2024-07-28 11 min read

Late Interaction Retrieval Methods: ColBERT and ColPali Explained

Understanding late interaction retrieval methods including ColBERT and ColPali for efficient semantic search.

ColBERT retrieval embeddings semantic search late interaction

Introduction

Late interaction models represent a middle ground between expensive cross-encoders and efficient bi-encoders. This guide explores ColBERT, ColPali, and related architectures.

The Retrieval Spectrum

Bi-Encoders (Two-Tower)

  • Encode query and document independently
  • Fast: dot product similarity
  • Quality: Limited interaction

Cross-Encoders

  • Joint encoding of query-document pairs
  • Slow: Must run for each pair
  • Quality: Rich interaction

Late Interaction (ColBERT)

  • Encode independently
  • Interact at token level
  • Balance of speed and quality

ColBERT Architecture

Encoding Phase

# Query encoding
query_embeddings = bert(query)  # [num_query_tokens, dim]

# Document encoding (offline)
doc_embeddings = bert(document)  # [num_doc_tokens, dim]

MaxSim Scoring

def maxsim(query_embs, doc_embs):
    # For each query token, find max similarity to any doc token
    similarities = query_embs @ doc_embs.T  # [Q, D]
    max_per_query = similarities.max(dim=1)  # [Q]
    return max_per_query.sum()

ColPali: Visual Documents

ColPali extends late interaction to visual document retrieval:

  • Process document pages as images
  • Vision encoder for patch embeddings
  • Late interaction for relevance

Production Deployment

Index Building

  • Pre-compute document token embeddings
  • Store in specialized index (PLAID)
  • Compress with quantization

Query Time

  1. Encode query tokens
  2. Approximate retrieval
  3. Re-rank with full MaxSim

When to Use

Method Latency Quality Use Case
Bi-encoder 1x Good First stage
ColBERT 5-10x Better Re-ranking
Cross-encoder 100x Best Top-k only

Best Practices

  1. Hybrid retrieval: Bi-encoder + ColBERT re-rank
  2. Token pruning: Reduce document tokens
  3. Quantization: INT8 for efficiency
  4. Batching: Amortize query encoding

Master retrieval in our RAG Systems at Scale course.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.