Appearance
LLM-as-Judge
Use language models to evaluate AI outputs. LLM judges provide nuanced, context-aware evaluation that heuristic methods cannot achieve.
How It Works
An LLM evaluator receives:
- The original input/prompt
- The model's response
- Evaluation criteria (rubric)
- Optional: expected output or reference
The judge model then scores the response and provides reasoning.
Basic Usage
python
from gateflow import EvalClient
client = EvalClient(api_key="gf-...")
results = client.evaluate(
model="gpt-4o",
cases=[
{
"input": "Explain quantum computing to a 10-year-old",
"expected_criteria": ["age-appropriate", "accurate", "engaging"]
}
],
evaluator="llm_judge"
)
for result in results:
print(f"Score: {result.score}/100")
print(f"Reasoning: {result.reasoning}")Configuring the Judge
Judge Model Selection
python
results = client.evaluate(
model="gpt-4o",
cases=cases,
evaluator="llm_judge",
judge_config={
"judge_model": "claude-opus-4-5", # Which model judges
"temperature": 0.0, # Deterministic judging
"max_tokens": 500 # Reasoning length
}
)Available Judge Models
| Model | Cost | Best For |
|---|---|---|
gpt-4o | $$ | General quality evaluation |
claude-opus-4-5 | $$$ | Nuanced reasoning, safety |
claude-sonnet-4 | $ | Cost-effective judging |
gpt-4o-mini | $ | High-volume, simpler evals |
Rubric-Based Judging
Define explicit scoring criteria:
python
rubric = {
"criteria": [
{
"name": "accuracy",
"weight": 0.4,
"description": "Information is factually correct",
"scale": {
"5": "Completely accurate, no errors",
"3": "Mostly accurate, minor errors",
"1": "Significant inaccuracies"
}
},
{
"name": "clarity",
"weight": 0.3,
"description": "Response is clear and understandable"
},
{
"name": "completeness",
"weight": 0.3,
"description": "Addresses all parts of the question"
}
]
}
results = client.evaluate(
model="gpt-4o",
cases=cases,
evaluator="rubric_judge",
rubric=rubric
)
# Get per-criterion scores
for result in results:
print(f"Accuracy: {result.criterion_scores['accuracy']}")
print(f"Clarity: {result.criterion_scores['clarity']}")
print(f"Completeness: {result.criterion_scores['completeness']}")Pairwise Comparison
Compare two model outputs directly:
python
results = client.compare(
prompt="Write a haiku about technology",
responses={
"model_a": "Silicon dreams flow\nThrough circuits of pure light\nFuture awakens",
"model_b": "Computers are cool\nThey help us do many things\nTechnology rules"
},
evaluator="pairwise_judge",
criteria=["creativity", "adherence_to_form", "imagery"]
)
print(f"Winner: {results.winner}") # "model_a"
print(f"Reasoning: {results.reasoning}")Reference-Based Judging
Compare against a gold-standard answer:
python
results = client.evaluate(
model="gpt-4o",
cases=[
{
"input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris is the capital of France."
}
],
evaluator="reference_judge",
config={
"match_type": "semantic", # Not just exact match
"partial_credit": True
}
)Judge Prompt Customization
Override the default judge prompts:
python
custom_prompt = """
You are evaluating an AI assistant's response for our customer service application.
INPUT: {input}
RESPONSE: {response}
Evaluate on these criteria:
1. Professionalism (1-5)
2. Accuracy (1-5)
3. Empathy (1-5)
Provide your scores and brief reasoning.
"""
results = client.evaluate(
model="gpt-4o",
cases=cases,
evaluator="llm_judge",
judge_config={
"prompt_template": custom_prompt
}
)Consistency and Calibration
Position Bias Mitigation
LLM judges can have position bias. GateFlow mitigates this:
python
results = client.compare(
prompt=prompt,
responses={"a": resp_a, "b": resp_b},
evaluator="pairwise_judge",
config={
"position_debias": True, # Run both orderings
"require_agreement": True # Only count if both agree
}
)Multi-Judge Ensemble
Use multiple judges for higher reliability:
python
results = client.evaluate(
model="gpt-4o",
cases=cases,
evaluator="llm_judge",
judge_config={
"ensemble": ["gpt-4o", "claude-opus-4-5"],
"aggregation": "majority" # or "average", "unanimous"
}
)Cost Optimization
See Tiered Cost Optimization for how to reduce LLM judge costs by 97%.
Quick Tips
python
# Use cheaper models for clear-cut cases
judge_config = {
"tiered": True,
"tier_1_model": "gpt-4o-mini", # Fast, cheap
"tier_2_model": "claude-opus-4-5", # Expensive, for ambiguous
"escalation_threshold": 0.3 # Escalate if confidence < 30%
}Best Practices
- Be specific in rubrics - Vague criteria lead to inconsistent scores
- Use examples - Few-shot prompting improves judge accuracy
- Calibrate with humans - Compare judge scores to human ratings
- Monitor judge drift - Judge behavior can change with model updates
- Version your rubrics - Track changes for reproducibility
Next Steps
- Heuristic Evaluators - Fast, rule-based checks
- Tiered Approach - 97% cost reduction
- Core Concepts - Understand the eval model