Appearance
Tiered Cost Optimization
Achieve 97% cost reduction compared to GPT-4-class judges by using intelligent tiered evaluation.
The Problem
LLM-as-Judge evaluation is powerful but expensive:
| Approach | Cost per 1000 evals | Quality |
|---|---|---|
| GPT-4o for all | ~$150 | Excellent |
| Claude Opus for all | ~$200 | Excellent |
| Tiered approach | ~$5 | Excellent |
How Tiered Evaluation Works
┌─────────────────┐
│ Input + Output │
└────────┬────────┘
│
┌────────▼────────┐
│ Tier 1: Heuristics │ ← Free, <1ms
│ (70% resolved) │
└────────┬────────┘
│ Ambiguous
┌────────▼────────┐
│ Tier 2: Small LLM │ ← Cheap, ~200ms
│ (25% resolved) │
└────────┬────────┘
│ Still unclear
┌────────▼────────┐
│ Tier 3: Large LLM │ ← Expensive, ~2s
│ (5% resolved) │
└─────────────────┘Configuring Tiered Evaluation
python
from gateflow import EvalClient
client = EvalClient(api_key="gf-...")
results = client.evaluate(
model="gpt-4o",
cases=cases,
evaluator="tiered",
config={
# Tier 1: Heuristics (free)
"tier_1": {
"checks": [
"json_valid",
"length_check",
"contains_required",
"pii_check"
],
"fail_fast": True # If heuristics fail, don't escalate
},
# Tier 2: Small/fast LLM (cheap)
"tier_2": {
"model": "gpt-4o-mini", # or "claude-haiku"
"confidence_threshold": 0.7, # Escalate if confidence < 70%
"temperature": 0.0
},
# Tier 3: Large LLM (expensive, only for ambiguous cases)
"tier_3": {
"model": "claude-opus-4-5",
"temperature": 0.0
}
}
)Tier Resolution Statistics
After running tiered evaluation, see where cases resolved:
python
print(results.tier_stats)
# {
# "tier_1_resolved": 712, # 71.2%
# "tier_2_resolved": 243, # 24.3%
# "tier_3_resolved": 45, # 4.5%
# "total_cases": 1000,
# "estimated_cost": "$4.82",
# "full_llm_cost": "$152.00",
# "savings": "96.8%"
# }Heuristic Pre-Filtering
The first tier catches obvious cases:
Pass-Through Heuristics
Cases that clearly pass without LLM evaluation:
- Response matches expected format exactly
- All required elements present
- No policy violations detected
Fail-Fast Heuristics
Cases that clearly fail:
- Invalid JSON when JSON required
- Contains PII when prohibited
- Exceeds length limits
- Contains profanity/blocked content
python
# Example: 70% of cases resolved at Tier 1
tier_1_config = {
"pass_if": [
{"type": "json_valid"},
{"type": "length_check", "min": 50, "max": 500},
{"type": "semantic_similarity", "threshold": 0.9} # Embedding-based
],
"fail_if": [
{"type": "pii_check", "action": "fail"},
{"type": "profanity_check", "action": "fail"},
{"type": "length_check", "max": 10, "action": "fail"} # Too short
]
}Confidence-Based Escalation
Tier 2 uses a small LLM with confidence estimation:
python
# Internal prompt for Tier 2 judge
"""
Evaluate this response and provide:
1. Score (1-5)
2. Confidence (0-100%)
3. Brief reasoning
If you're uncertain about the score, say so.
"""
# Escalation logic
if tier_2_confidence < 0.7:
escalate_to_tier_3()
elif tier_2_score in [2, 3, 4]: # Middle scores often need review
escalate_to_tier_3()
else:
use_tier_2_result()Cost Comparison
Scenario: 10,000 evaluations/month
| Approach | Monthly Cost | Quality |
|---|---|---|
| All GPT-4o | $1,500 | Baseline |
| All Claude Opus | $2,000 | +2% vs baseline |
| All GPT-4o-mini | $150 | -8% vs baseline |
| Tiered | $50 | -1% vs baseline |
Cost Breakdown (Tiered)
Tier 1 (7,000 cases): $0 (heuristics)
Tier 2 (2,500 cases): $25 (gpt-4o-mini)
Tier 3 (500 cases): $25 (claude-opus-4-5)
────────────────────────────────────
Total: $50Quality Validation
Tiered evaluation maintains quality through calibration:
python
# Run calibration to verify tier accuracy
calibration = client.calibrate_tiers(
cases=labeled_cases, # Cases with human labels
config=tier_config
)
print(calibration.results)
# {
# "tier_1_accuracy": 0.98, # Heuristics agree with humans 98%
# "tier_2_accuracy": 0.94, # Small LLM agrees 94%
# "tier_3_accuracy": 0.97, # Large LLM agrees 97%
# "overall_accuracy": 0.96, # Weighted by resolution rate
# "cost_savings": "97.2%"
# }Tuning Thresholds
Find optimal escalation thresholds:
python
# Grid search for best cost/quality tradeoff
tuning = client.tune_tiers(
cases=validation_cases,
target_accuracy=0.95, # Minimum acceptable accuracy
optimize_for="cost" # or "accuracy", "latency"
)
print(tuning.recommended_config)
# {
# "tier_2_confidence_threshold": 0.72,
# "tier_2_model": "gpt-4o-mini",
# "expected_accuracy": 0.954,
# "expected_cost_per_1000": "$4.20"
# }Best Practices
1. Start Conservative
Begin with lower confidence thresholds (more escalation), then tune up.
2. Monitor Tier Distribution
If Tier 3 usage spikes, your Tier 2 model may need updating.
python
# Alert if tier distribution shifts
client.configure_alerts(
metric="tier_3_percentage",
threshold=10, # Alert if >10% goes to Tier 3
window="1h"
)3. Regularly Recalibrate
Model behavior changes over time. Recalibrate monthly.
4. Use Domain-Specific Heuristics
The more cases resolved at Tier 1, the more you save.
python
# Add custom heuristics for your domain
client.register_heuristic(
name="has_required_sections",
function=check_sections,
tier=1
)Next Steps
- LLM-as-Judge - Configure judge models
- Heuristic Evaluators - Build effective Tier 1
- Traffic Sampling - Apply tiered eval at scale