LLM-as-Judge

Use language models to evaluate AI outputs. LLM judges provide nuanced, context-aware evaluation that heuristic methods cannot achieve.

How It Works

An LLM evaluator receives:

The original input/prompt
The model's response
Evaluation criteria (rubric)
Optional: expected output or reference

The judge model then scores the response and provides reasoning.

Basic Usage

python

from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

results = client.evaluate(
    model="gpt-4o",
    cases=[
        {
            "input": "Explain quantum computing to a 10-year-old",
            "expected_criteria": ["age-appropriate", "accurate", "engaging"]
        }
    ],
    evaluator="llm_judge"
)

for result in results:
    print(f"Score: {result.score}/100")
    print(f"Reasoning: {result.reasoning}")

Configuring the Judge

Judge Model Selection

python

results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="llm_judge",
    judge_config={
        "judge_model": "claude-opus-4-5",  # Which model judges
        "temperature": 0.0,                 # Deterministic judging
        "max_tokens": 500                   # Reasoning length
    }
)

Available Judge Models

Model	Cost	Best For
`gpt-4o`	$$	General quality evaluation
`claude-opus-4-5`	$$$	Nuanced reasoning, safety
`claude-sonnet-4`	$	Cost-effective judging
`gpt-4o-mini`	$	High-volume, simpler evals

Rubric-Based Judging

Define explicit scoring criteria:

python

rubric = {
    "criteria": [
        {
            "name": "accuracy",
            "weight": 0.4,
            "description": "Information is factually correct",
            "scale": {
                "5": "Completely accurate, no errors",
                "3": "Mostly accurate, minor errors",
                "1": "Significant inaccuracies"
            }
        },
        {
            "name": "clarity",
            "weight": 0.3,
            "description": "Response is clear and understandable"
        },
        {
            "name": "completeness",
            "weight": 0.3,
            "description": "Addresses all parts of the question"
        }
    ]
}

results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="rubric_judge",
    rubric=rubric
)

# Get per-criterion scores
for result in results:
    print(f"Accuracy: {result.criterion_scores['accuracy']}")
    print(f"Clarity: {result.criterion_scores['clarity']}")
    print(f"Completeness: {result.criterion_scores['completeness']}")

Pairwise Comparison

Compare two model outputs directly:

python

results = client.compare(
    prompt="Write a haiku about technology",
    responses={
        "model_a": "Silicon dreams flow\nThrough circuits of pure light\nFuture awakens",
        "model_b": "Computers are cool\nThey help us do many things\nTechnology rules"
    },
    evaluator="pairwise_judge",
    criteria=["creativity", "adherence_to_form", "imagery"]
)

print(f"Winner: {results.winner}")  # "model_a"
print(f"Reasoning: {results.reasoning}")

Reference-Based Judging

Compare against a gold-standard answer:

python

results = client.evaluate(
    model="gpt-4o",
    cases=[
        {
            "input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France."
        }
    ],
    evaluator="reference_judge",
    config={
        "match_type": "semantic",  # Not just exact match
        "partial_credit": True
    }
)

Judge Prompt Customization

Override the default judge prompts:

python

custom_prompt = """
You are evaluating an AI assistant's response for our customer service application.

INPUT: {input}
RESPONSE: {response}

Evaluate on these criteria:
1. Professionalism (1-5)
2. Accuracy (1-5)
3. Empathy (1-5)

Provide your scores and brief reasoning.
"""

results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="llm_judge",
    judge_config={
        "prompt_template": custom_prompt
    }
)

Consistency and Calibration

Position Bias Mitigation

LLM judges can have position bias. GateFlow mitigates this:

python

results = client.compare(
    prompt=prompt,
    responses={"a": resp_a, "b": resp_b},
    evaluator="pairwise_judge",
    config={
        "position_debias": True,  # Run both orderings
        "require_agreement": True  # Only count if both agree
    }
)

Multi-Judge Ensemble

Use multiple judges for higher reliability:

python

results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="llm_judge",
    judge_config={
        "ensemble": ["gpt-4o", "claude-opus-4-5"],
        "aggregation": "majority"  # or "average", "unanimous"
    }
)

Cost Optimization

See Tiered Cost Optimization for how to reduce LLM judge costs by 97%.

Quick Tips

python

# Use cheaper models for clear-cut cases
judge_config = {
    "tiered": True,
    "tier_1_model": "gpt-4o-mini",   # Fast, cheap
    "tier_2_model": "claude-opus-4-5", # Expensive, for ambiguous
    "escalation_threshold": 0.3       # Escalate if confidence < 30%
}

Best Practices

Be specific in rubrics - Vague criteria lead to inconsistent scores
Use examples - Few-shot prompting improves judge accuracy
Calibrate with humans - Compare judge scores to human ratings
Monitor judge drift - Judge behavior can change with model updates
Version your rubrics - Track changes for reproducibility

Next Steps

Heuristic Evaluators - Fast, rule-based checks
Tiered Approach - 97% cost reduction
Core Concepts - Understand the eval model

LLM-as-Judge ​

How It Works ​

Basic Usage ​

Configuring the Judge ​

Judge Model Selection ​

Available Judge Models ​

Rubric-Based Judging ​

Pairwise Comparison ​

Reference-Based Judging ​

Judge Prompt Customization ​

Consistency and Calibration ​

Position Bias Mitigation ​

Multi-Judge Ensemble ​

Cost Optimization ​

Quick Tips ​

Best Practices ​

Next Steps ​

LLM-as-Judge

How It Works

Basic Usage

Configuring the Judge

Judge Model Selection

Available Judge Models

Rubric-Based Judging

Pairwise Comparison

Reference-Based Judging

Judge Prompt Customization

Consistency and Calibration

Position Bias Mitigation

Multi-Judge Ensemble

Cost Optimization

Quick Tips

Best Practices

Next Steps