Introduction
Evaluating LLMs is one of the hardest unsolved problems in production ML. Unlike traditional ML, where a held-out test set gives you reliable performance estimates, LLM outputs are open-ended, context-dependent, and often require domain expertise to assess.
Yet teams are shipping LLM-powered products and need to know: is version 2 better than version 1? Did this prompt change regress quality? Is this fine-tuned model ready for production?
This post covers practical evaluation frameworks that work at scale.
Why Standard Benchmarks Are Insufficient
MMLU, HellaSwag, HumanEval, GSM8K — these benchmarks are useful for comparing model families, but they don't answer the question that matters for production teams: does this model perform better on my task?
Problems with generic benchmarks:
- Distribution mismatch: Your users don't ask questions like benchmark questions
- Metric mismatch: Benchmark accuracy doesn't capture fluency, tone, or safety
- Leakage: Popular benchmarks are in model training data, inflating scores
- Saturation: Top models cluster near 90%+ — hard to distinguish
You need task-specific, production-aligned evaluation.
The Evaluation Pyramid
/\
/ \
/ A/B \ (slowest, most reliable)
/ testing \
/────────────\
/ Human eval \ (slow, expensive)
/────────────────\
/ LLM-as-judge \ (fast, scalable)
/──────────────────────\
/ Automated unit tests \ (fastest, narrow coverage)
/────────────────────────────\
Use all layers. Automated tests run on every change; A/B tests run for significant updates.
Automated Unit Tests for LLMs
Exact match tests
For deterministic outputs (extraction, classification):
def test_entity_extraction():
result = llm.extract_entities("Apple acquired Beats in 2014")
assert "Apple" in result.organizations
assert "Beats" in result.organizations
assert result.events[0].year == 2014
Assertion-based tests
For semi-structured outputs:
def test_summarization():
summary = llm.summarize(LONG_ARTICLE, max_words=100)
assert len(summary.split()) <= 120 # some tolerance
assert "key_entity" in summary.lower()
assert not contains_pii(summary)
Behavioral tests (invariance)
Test that the model's behavior is consistent across equivalent inputs:
def test_classification_invariance():
result_1 = classifier("Please classify this email as spam/not spam: ...")
result_2 = classifier("Is this email spam or not spam? ...")
assert result_1.label == result_2.label # same answer, different phrasing
These catch prompt sensitivity — fragile models that change behavior with minor rephrasing.
LLM-as-Judge
Use a capable LLM to evaluate the output of another LLM. This is now the standard approach for scalable quality evaluation.
Basic Implementation
JUDGE_PROMPT = """
You are evaluating the quality of an AI assistant's response.
User question: {question}
AI response: {response}
Rate the response on the following dimensions (1-5 each):
1. Accuracy: Is the information correct?
2. Completeness: Does it fully address the question?
3. Clarity: Is it easy to understand?
4. Safety: Does it avoid harmful content?
Return a JSON object: {{"accuracy": X, "completeness": X, "clarity": X, "safety": X, "reasoning": "..."}}
"""
def evaluate_with_llm(question, response, judge_model="claude-opus-4-6"):
prompt = JUDGE_PROMPT.format(question=question, response=response)
scores = judge_model.generate(prompt, output_format="json")
return scores
Pairwise Comparison (Preferred)
Pairwise comparison is more reliable than absolute scoring:
PAIRWISE_PROMPT = """
Compare two AI assistant responses to this question:
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better, and why? Consider accuracy, helpfulness, and clarity.
Output: {{"winner": "A" | "B" | "tie", "reasoning": "..."}}
"""
This avoids the "numerical calibration" problem where judges give different absolute scores for the same quality level.
Reducing Judge Bias
Position bias: LLM judges tend to favor whichever response comes first. Fix by running each comparison twice with A/B order swapped.
Verbosity bias: Judges often rate longer responses higher, even when shorter is better. Explicitly instruct the judge to penalize unnecessary length.
Self-enhancement bias: A model judging its own outputs scores them higher. Use a different model as judge, or use a specialized evaluator.
Calibration: Before deploying a judge, compare it against human judgments on 200-500 examples. Target >80% agreement with humans.
Building an Evaluation Dataset
Dataset Composition
A good eval dataset should have:
- Diverse difficulty: Easy, medium, hard examples
- Edge cases: Inputs where the model commonly fails
- Adversarial examples: Intentionally tricky or ambiguous inputs
- Representative samples: Distribution should match production traffic
Data Sources
- Production logs: Sample real user queries (with PII removed)
- Manual crafting: Domain experts write targeted examples
- Synthesis via LLM: Generate variations of existing examples
- Failure mining: Collect examples where the current model fails
Dataset Size
Rule of thumb:
- Unit test suite: 50-500 examples (fast feedback)
- Evaluation dataset: 500-5,000 examples (statistical significance)
- Human evaluation pool: 200-1,000 examples (expensive, high-quality)
For detecting a 2% quality change with 95% confidence, you need ~2,500 examples.
Regression Testing Framework
Version Comparison Pipeline
class LLMEvaluator:
def __init__(self, eval_dataset, judge, metrics):
self.dataset = eval_dataset
self.judge = judge
self.metrics = metrics
def evaluate_version(self, model_version):
results = []
for example in self.dataset:
response = model_version.generate(example.prompt)
scores = self.judge.evaluate(example.prompt, response)
results.append({"example_id": example.id, **scores})
return EvaluationReport(results)
def compare_versions(self, v1, v2):
report_v1 = self.evaluate_version(v1)
report_v2 = self.evaluate_version(v2)
return StatisticalComparison(report_v1, report_v2)
Statistical Significance
Don't claim improvement without statistical testing:
from scipy import stats
def is_improvement_significant(scores_v1, scores_v2, alpha=0.05):
statistic, p_value = stats.wilcoxon(scores_v1, scores_v2)
effect_size = (scores_v2.mean() - scores_v1.mean()) / scores_v1.std()
return p_value < alpha and effect_size > 0.1 # significant and meaningful
Tracking Over Time
Metric v1.0 v1.1 v1.2 v1.3
────────────────────────────────────────────────
Accuracy 0.82 0.84 0.83 0.87
Completeness 0.79 0.81 0.82 0.83
Latency (P95) 2.1s 2.0s 1.9s 2.2s
Safety violations 0.02 0.01 0.01 0.00
Cost per query $0.05 $0.04 $0.04 $0.06
Track all metrics over time. Quality-cost-latency tradeoffs are common — document them explicitly when shipping a new version.
Connecting Evaluation to Business Metrics
The ultimate test of an LLM system: does it help users accomplish their goals? Bridge evaluation to business metrics:
LLM Quality Metrics → User Behavior Metrics → Business Metrics
Response accuracy → Task completion rate → Revenue / retention
Latency → Engagement → Session length
Safety violations → Report rate → Trust / compliance
Run periodic correlation analyses to verify that your LLM quality metrics actually predict user outcomes. If accuracy scores are up but task completion is flat, your evaluation is measuring the wrong thing.
Evaluation Infrastructure at Scale
Evaluation as a Service
Centralize evaluation rather than each team building their own:
[Model teams] → [Evaluation API] → [Eval dashboard]
↑
[Judge models]
[Human eval pool]
[Eval dataset library]
[Statistical testing]
Cost Management
LLM-as-judge is expensive at scale. Optimize:
- Cache judge results for unchanged (prompt, response) pairs
- Use cheaper models for initial triage, expensive models for borderline cases
- Sample evaluation rather than evaluating 100% of production traffic
Human Evaluation Pipeline
For high-stakes decisions (major model updates, safety evaluation), human evaluation is irreplaceable. Build a reliable pipeline:
- Clear annotation guidelines with examples
- Inter-annotator agreement measurement
- Regular calibration sessions
- Appropriate incentives for quality
Conclusion
Robust LLM evaluation is what separates teams that ship confidently from those that cross their fingers on every deployment. The key principles:
- Build a layered evaluation system (automated → LLM judge → human)
- Measure what matters for your users, not generic benchmarks
- Apply statistical rigor to version comparisons
- Connect LLM metrics to business outcomes
- Treat evaluation as infrastructure, not an afterthought
Want to go deeper on ML systems quality? Explore our courses on ML Systems Case Studies.