Introduction
Evaluating ranking models is tricky - offline metrics don't always correlate with online success. This guide covers both approaches and how to bridge the gap.
Offline Metrics
Precision and Recall @ K
def precision_at_k(relevant, retrieved, k):
retrieved_k = retrieved[:k]
relevant_retrieved = len(set(retrieved_k) & set(relevant))
return relevant_retrieved / k
def recall_at_k(relevant, retrieved, k):
retrieved_k = retrieved[:k]
relevant_retrieved = len(set(retrieved_k) & set(relevant))
return relevant_retrieved / len(relevant)
NDCG (Normalized Discounted Cumulative Gain)
def ndcg_at_k(relevance_scores, k):
dcg = sum(
(2**rel - 1) / np.log2(i + 2)
for i, rel in enumerate(relevance_scores[:k])
)
ideal = sorted(relevance_scores, reverse=True)
idcg = sum(
(2**rel - 1) / np.log2(i + 2)
for i, rel in enumerate(ideal[:k])
)
return dcg / idcg if idcg > 0 else 0
MRR (Mean Reciprocal Rank)
def mrr(queries_results):
rr_sum = 0
for results in queries_results:
for i, (_, relevant) in enumerate(results):
if relevant:
rr_sum += 1 / (i + 1)
break
return rr_sum / len(queries_results)
Online Metrics
Engagement Metrics
- Click-through rate (CTR): Clicks / Impressions
- Session duration: Time spent on platform
- Items consumed: Videos watched, articles read
Business Metrics
- Conversion rate: Purchases / Sessions
- Revenue per user: Total revenue / Users
- Retention: Users returning
Quality Metrics
- Diversity: Unique categories in results
- Novelty: New items surfaced
- Fairness: Equal exposure across groups
The Offline-Online Gap
Why Metrics Don't Align
- Position bias: Users click top results more
- Selection bias: Only see what was shown
- Delayed feedback: Long-term effects missed
- Context missing: Offline data lacks context
Bridging Strategies
Counterfactual Evaluation
def ips_estimator(new_policy, logged_data):
total = 0
for (context, action, reward, propensity) in logged_data:
new_prob = new_policy(context, action)
total += (new_prob / propensity) * reward
return total / len(logged_data)
Interleaving
Mix results from two rankers:
- Team Draft interleaving
- Balanced interleaving
- Detect preference quickly
A/B Testing Best Practices
Experiment Design
- Clear hypothesis
- Primary and guardrail metrics
- Sufficient sample size
- Appropriate duration
Common Pitfalls
- Peeking: Looking at results too early
- Multiple comparisons: Test many variants
- Network effects: Users influence each other
- Novelty effects: New != better long-term
Evaluation Framework
Idea -> Offline Eval -> Interleaving -> A/B Test -> Ship
| | | | |
(fast) (medium) (quick sig) (full sig) (monitor)
Master evaluation in our Recommendation Systems at Scale course.