design pattern 2024-08-05 11 min read

RLHF with Rubrics as Rewards: A Practical Approach

How to use structured rubrics instead of human preferences for more consistent and interpretable RLHF.

RLHF rubrics rewards LLM alignment

Introduction

Traditional RLHF relies on human preference comparisons, which can be noisy and inconsistent. Rubric-based rewards offer a more structured alternative for aligning language models.

The Problem with Preference-Based RLHF

Inconsistency

  • Different annotators prefer different things
  • Same annotator varies over time
  • Hard to define what "better" means

Opacity

  • Why was response A preferred over B?
  • What aspects were compared?
  • How to improve systematically?

Rubrics as Rewards

What is a Rubric?

A structured evaluation criteria:

rubric:
  helpfulness:
    weight: 0.4
    criteria:
      - Directly addresses the question
      - Provides actionable information
      - Appropriate level of detail
  accuracy:
    weight: 0.3
    criteria:
      - Factually correct
      - No hallucinations
      - Acknowledges uncertainty
  safety:
    weight: 0.3
    criteria:
      - No harmful content
      - Appropriate refusals
      - Privacy-preserving

Advantages

  1. Explicit criteria: Clear what matters
  2. Consistent scoring: Same standards applied
  3. Interpretable feedback: Know what to improve
  4. Flexible weighting: Adjust importance

Implementation

Rubric Evaluation

def evaluate_with_rubric(response, rubric):
    scores = {}
    for dimension, config in rubric.items():
        # Use LLM as judge for each criterion
        dimension_score = 0
        for criterion in config['criteria']:
            score = llm_judge(response, criterion)
            dimension_score += score
        scores[dimension] = dimension_score / len(config['criteria'])

    # Weighted combination
    final_score = sum(
        scores[d] * rubric[d]['weight']
        for d in rubric
    )
    return final_score

Training Loop

Generate Response -> Evaluate with Rubric -> Compute Reward -> PPO Update
       |                     |                    |               |
   (sampling)           (LLM judge)          (aggregate)     (optimize)

LLM-as-Judge

Prompt Design

You are evaluating a response on the following criterion:
{criterion}

Response to evaluate:
{response}

Score from 1-5 with explanation:

Calibration

  • Use few-shot examples
  • Include edge cases
  • Validate against human judgment

Multi-Judge

  • Use multiple prompts/models
  • Aggregate scores
  • Flag disagreements

Practical Considerations

Rubric Design

  1. Start broad: Major quality dimensions
  2. Refine iteratively: Add specificity where needed
  3. Validate empirically: Check rubric predicts user satisfaction

Computational Cost

  • LLM judges are expensive
  • Cache common evaluations
  • Sample-based training

Reward Hacking

  • Models may game specific criteria
  • Diversify rubrics
  • Human oversight for edge cases

Results

Compared to traditional RLHF:

  • More consistent rewards across evaluators
  • Faster iteration on reward criteria
  • Better interpretability of model changes
  • Similar or better final model quality

Learn advanced alignment techniques in our LLM courses.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.