tutorial 2025-02-01 12 min read

Test-Time Compute Scaling: The New Dimension of AI Performance

Why spending more compute at inference time — not just training time — is becoming the dominant path to better AI performance, and how to engineer systems that do it effectively.

test-time compute chain of thought best-of-N MCTS reasoning scaling laws inference

Introduction

For years, the AI scaling story was simple: bigger models trained on more data = better performance. But in 2024-2025, a new dimension of scaling emerged: test-time compute.

The insight: instead of spending all the compute budget during training, spend some of it during inference — letting the model "think longer" about hard problems. OpenAI's o1, DeepSeek-R1, and Google's Gemini Thinking all embody this shift.

This post explains the techniques, the tradeoffs, and how to build systems that leverage test-time compute effectively.

Why Test-Time Compute Works

The Intuition

When a human faces a hard math problem, they don't just blurt out an answer. They work through it step by step, check their work, try alternative approaches if stuck. The quality of the answer scales with the time spent thinking.

LLMs trained with RL-based reasoning (like R1 and o1) have learned a similar behavior: generating longer, more careful reasoning traces leads to more correct final answers.

The Scaling Curve

DeepSeek and Anthropic both published evidence that for hard tasks (math competition problems, code debugging, scientific reasoning):

Performance = f(training_compute, test_time_compute)

Crucially, a smaller model with more test-time compute can outperform a larger model with less. This has massive implications for cost optimization.

Techniques for Test-Time Compute

1. Chain-of-Thought (CoT) Prompting

The simplest form: ask the model to reason step by step before answering.

Prompt: "Solve this problem. Think step by step..."
→ Model generates: [reasoning steps] → [final answer]

Performance scales with the number of reasoning tokens generated. This is "free" test-time compute in that it doesn't require multiple model calls.

Engineering note: Token streaming with CoT requires clients to handle reasoning tokens differently from output tokens — displaying or hiding them depending on UX requirements.

2. Best-of-N Sampling

Generate N independent solutions, then select the best one:

def best_of_n(prompt, n, verifier):
    candidates = [model.generate(prompt) for _ in range(n)]
    scores = [verifier.score(c) for c in candidates]
    return candidates[argmax(scores)]

Verifier types:

  • Outcome reward model (ORM): Scores the final answer quality
  • Process reward model (PRM): Scores each reasoning step
  • Self-consistency: Most frequent answer among N samples (no separate verifier)
  • Code execution: For code tasks, run and check output

Performance with best-of-N scales logarithmically in N for most tasks. Going from N=1 to N=16 often closes half the gap to N=∞.

Cost: N times the generation cost. Suitable when accuracy matters more than latency/cost.

3. Self-Consistency

A special case of best-of-N that doesn't require a verifier:

def self_consistency(prompt, n, temperature=0.7):
    answers = []
    for _ in range(n):
        response = model.generate(prompt, temperature=temperature)
        answer = extract_final_answer(response)
        answers.append(answer)
    return majority_vote(answers)

Works remarkably well for factual questions and math. For factual recall, even N=5 provides significant accuracy gains.

4. Sequential Refinement

Generate an initial response, then iteratively refine it:

Pass 1: Generate initial solution
Pass 2: Critique the solution → identify errors
Pass 3: Fix identified errors → improved solution
Pass 4: Verify the fix → final answer

Each pass uses the full model, and each builds on the previous. Works well for:

  • Code debugging (generate → test → fix loop)
  • Long-form writing (draft → critique → revise)
  • Complex analysis (initial reasoning → gap identification → deeper analysis)

5. Tree Search (MCTS for LLMs)

Monte Carlo Tree Search applied to token generation:

Root: [prompt]
     ├── Branch A: [path 1]
     │       ├── A1: [continuation]
     │       └── A2: [alternative]
     └── Branch B: [path 2]
             └── B1: [continuation]

The model explores a tree of partial solutions, using a value function (PRM or ORM) to estimate solution quality at intermediate states. High-value branches are expanded further.

Why it's powerful: Can discover solutions that no single chain of thought would find. AlphaProof used this to solve IMO-level math problems.

Why it's expensive: O(tree_width × tree_depth) model calls. Practical for offline/batch settings, challenging for real-time serving.

Engineering Systems for Test-Time Compute

Architecture for Best-of-N at Scale

Client → [Request Router]
              ↓
    [Inference Queue]
              ↓
    ┌─────────────────┐
    │  N parallel     │
    │  generation     │
    │  workers        │
    └────────┬────────┘
             ↓
    [Verifier / Scorer]
             ↓
    [Best response → Client]

Key considerations:

  • Batching: N generations for one request can share a batch, improving GPU utilization
  • Cancellation: Once verification begins, you can cancel remaining generations early if a high-confidence winner is found
  • Cost tracking: Attribute compute costs to requests accurately

Dynamic Compute Allocation

Not every request needs maximum test-time compute. A routing layer can classify request difficulty and allocate compute accordingly:

def adaptive_inference(prompt, budget):
    difficulty = classifier.estimate_difficulty(prompt)

    if difficulty < 0.3:
        return model.generate(prompt, n=1)  # easy
    elif difficulty < 0.7:
        return best_of_n(prompt, n=4)       # medium
    else:
        return tree_search(prompt, budget)   # hard

This is "compute routing" — similar to mixture of experts but at the request level.

Latency vs. Accuracy Tradeoffs

Method Latency Accuracy Cost
Single pass 1x Baseline 1x
CoT 2-4x +10-20% 2-4x tokens
Best-of-4 ~1x (parallel) +15-25% 4x
Best-of-16 ~1x (parallel) +25-40% 16x
MCTS (depth 4) 10-100x +40-60% High

For latency-sensitive applications: Best-of-N with parallel generation is most practical. For accuracy-critical offline tasks: sequential refinement or MCTS.

The Economics of Test-Time Compute

When It Makes Sense

Test-time compute is economically justified when:

  • High task value: The cost of an error exceeds the compute cost
  • Soft real-time: Latency requirements are seconds, not milliseconds
  • Hard verification: You can objectively verify correctness (code execution, math checking)

Model Selection for Test-Time Scaling

An important finding: test-time compute scales better on stronger base models.

A 7B model with 16x compute may not reach the performance of a 70B model with 1x compute, but a 70B model with 16x compute substantially outperforms a 70B with 1x.

This creates an interesting cost optimization: compare {smaller model + more inference compute} vs. {larger model + less inference compute} for your target quality level.

Conclusion

Test-time compute scaling represents a fundamental shift in how we think about AI system performance. Training compute is fixed at deployment; test-time compute is a dial you can turn up based on the task at hand.

The teams that build reliable, cost-effective systems on top of this paradigm — with smart routing, efficient parallel generation, and good verifiers — will have a significant advantage in the next wave of AI applications.


Deepen your understanding of LLM systems with our LLM Inference at Scale course.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.