Skip to content

LLM-as-Judge

Use language models to evaluate AI outputs. LLM judges provide nuanced, context-aware evaluation that heuristic methods cannot achieve.

How It Works

An LLM evaluator receives:

  1. The original input/prompt
  2. The model's response
  3. Evaluation criteria (rubric)
  4. Optional: expected output or reference

The judge model then scores the response and provides reasoning.

Basic Usage

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

results = client.evaluate(
    model="gpt-4o",
    cases=[
        {
            "input": "Explain quantum computing to a 10-year-old",
            "expected_criteria": ["age-appropriate", "accurate", "engaging"]
        }
    ],
    evaluator="llm_judge"
)

for result in results:
    print(f"Score: {result.score}/100")
    print(f"Reasoning: {result.reasoning}")

Configuring the Judge

Judge Model Selection

python
results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="llm_judge",
    judge_config={
        "judge_model": "claude-opus-4-5",  # Which model judges
        "temperature": 0.0,                 # Deterministic judging
        "max_tokens": 500                   # Reasoning length
    }
)

Available Judge Models

ModelCostBest For
gpt-4o$$General quality evaluation
claude-opus-4-5$$$Nuanced reasoning, safety
claude-sonnet-4$Cost-effective judging
gpt-4o-mini$High-volume, simpler evals

Rubric-Based Judging

Define explicit scoring criteria:

python
rubric = {
    "criteria": [
        {
            "name": "accuracy",
            "weight": 0.4,
            "description": "Information is factually correct",
            "scale": {
                "5": "Completely accurate, no errors",
                "3": "Mostly accurate, minor errors",
                "1": "Significant inaccuracies"
            }
        },
        {
            "name": "clarity",
            "weight": 0.3,
            "description": "Response is clear and understandable"
        },
        {
            "name": "completeness",
            "weight": 0.3,
            "description": "Addresses all parts of the question"
        }
    ]
}

results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="rubric_judge",
    rubric=rubric
)

# Get per-criterion scores
for result in results:
    print(f"Accuracy: {result.criterion_scores['accuracy']}")
    print(f"Clarity: {result.criterion_scores['clarity']}")
    print(f"Completeness: {result.criterion_scores['completeness']}")

Pairwise Comparison

Compare two model outputs directly:

python
results = client.compare(
    prompt="Write a haiku about technology",
    responses={
        "model_a": "Silicon dreams flow\nThrough circuits of pure light\nFuture awakens",
        "model_b": "Computers are cool\nThey help us do many things\nTechnology rules"
    },
    evaluator="pairwise_judge",
    criteria=["creativity", "adherence_to_form", "imagery"]
)

print(f"Winner: {results.winner}")  # "model_a"
print(f"Reasoning: {results.reasoning}")

Reference-Based Judging

Compare against a gold-standard answer:

python
results = client.evaluate(
    model="gpt-4o",
    cases=[
        {
            "input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France."
        }
    ],
    evaluator="reference_judge",
    config={
        "match_type": "semantic",  # Not just exact match
        "partial_credit": True
    }
)

Judge Prompt Customization

Override the default judge prompts:

python
custom_prompt = """
You are evaluating an AI assistant's response for our customer service application.

INPUT: {input}
RESPONSE: {response}

Evaluate on these criteria:
1. Professionalism (1-5)
2. Accuracy (1-5)
3. Empathy (1-5)

Provide your scores and brief reasoning.
"""

results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="llm_judge",
    judge_config={
        "prompt_template": custom_prompt
    }
)

Consistency and Calibration

Position Bias Mitigation

LLM judges can have position bias. GateFlow mitigates this:

python
results = client.compare(
    prompt=prompt,
    responses={"a": resp_a, "b": resp_b},
    evaluator="pairwise_judge",
    config={
        "position_debias": True,  # Run both orderings
        "require_agreement": True  # Only count if both agree
    }
)

Multi-Judge Ensemble

Use multiple judges for higher reliability:

python
results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="llm_judge",
    judge_config={
        "ensemble": ["gpt-4o", "claude-opus-4-5"],
        "aggregation": "majority"  # or "average", "unanimous"
    }
)

Cost Optimization

See Tiered Cost Optimization for how to reduce LLM judge costs by 97%.

Quick Tips

python
# Use cheaper models for clear-cut cases
judge_config = {
    "tiered": True,
    "tier_1_model": "gpt-4o-mini",   # Fast, cheap
    "tier_2_model": "claude-opus-4-5", # Expensive, for ambiguous
    "escalation_threshold": 0.3       # Escalate if confidence < 30%
}

Best Practices

  1. Be specific in rubrics - Vague criteria lead to inconsistent scores
  2. Use examples - Few-shot prompting improves judge accuracy
  3. Calibrate with humans - Compare judge scores to human ratings
  4. Monitor judge drift - Judge behavior can change with model updates
  5. Version your rubrics - Track changes for reproducibility

Next Steps

Built with reliability in mind.