Skip to content

Tiered Cost Optimization

Achieve 97% cost reduction compared to GPT-4-class judges by using intelligent tiered evaluation.

The Problem

LLM-as-Judge evaluation is powerful but expensive:

ApproachCost per 1000 evalsQuality
GPT-4o for all~$150Excellent
Claude Opus for all~$200Excellent
Tiered approach~$5Excellent

How Tiered Evaluation Works

                    ┌─────────────────┐
                    │  Input + Output │
                    └────────┬────────┘

                    ┌────────▼────────┐
                    │  Tier 1: Heuristics │  ← Free, <1ms
                    │  (70% resolved)     │
                    └────────┬────────┘
                             │ Ambiguous
                    ┌────────▼────────┐
                    │  Tier 2: Small LLM  │  ← Cheap, ~200ms
                    │  (25% resolved)     │
                    └────────┬────────┘
                             │ Still unclear
                    ┌────────▼────────┐
                    │  Tier 3: Large LLM  │  ← Expensive, ~2s
                    │  (5% resolved)      │
                    └─────────────────┘

Configuring Tiered Evaluation

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

results = client.evaluate(
    model="gpt-4o",
    cases=cases,
    evaluator="tiered",
    config={
        # Tier 1: Heuristics (free)
        "tier_1": {
            "checks": [
                "json_valid",
                "length_check",
                "contains_required",
                "pii_check"
            ],
            "fail_fast": True  # If heuristics fail, don't escalate
        },

        # Tier 2: Small/fast LLM (cheap)
        "tier_2": {
            "model": "gpt-4o-mini",  # or "claude-haiku"
            "confidence_threshold": 0.7,  # Escalate if confidence < 70%
            "temperature": 0.0
        },

        # Tier 3: Large LLM (expensive, only for ambiguous cases)
        "tier_3": {
            "model": "claude-opus-4-5",
            "temperature": 0.0
        }
    }
)

Tier Resolution Statistics

After running tiered evaluation, see where cases resolved:

python
print(results.tier_stats)
# {
#   "tier_1_resolved": 712,    # 71.2%
#   "tier_2_resolved": 243,    # 24.3%
#   "tier_3_resolved": 45,     # 4.5%
#   "total_cases": 1000,
#   "estimated_cost": "$4.82",
#   "full_llm_cost": "$152.00",
#   "savings": "96.8%"
# }

Heuristic Pre-Filtering

The first tier catches obvious cases:

Pass-Through Heuristics

Cases that clearly pass without LLM evaluation:

  • Response matches expected format exactly
  • All required elements present
  • No policy violations detected

Fail-Fast Heuristics

Cases that clearly fail:

  • Invalid JSON when JSON required
  • Contains PII when prohibited
  • Exceeds length limits
  • Contains profanity/blocked content
python
# Example: 70% of cases resolved at Tier 1
tier_1_config = {
    "pass_if": [
        {"type": "json_valid"},
        {"type": "length_check", "min": 50, "max": 500},
        {"type": "semantic_similarity", "threshold": 0.9}  # Embedding-based
    ],
    "fail_if": [
        {"type": "pii_check", "action": "fail"},
        {"type": "profanity_check", "action": "fail"},
        {"type": "length_check", "max": 10, "action": "fail"}  # Too short
    ]
}

Confidence-Based Escalation

Tier 2 uses a small LLM with confidence estimation:

python
# Internal prompt for Tier 2 judge
"""
Evaluate this response and provide:
1. Score (1-5)
2. Confidence (0-100%)
3. Brief reasoning

If you're uncertain about the score, say so.
"""

# Escalation logic
if tier_2_confidence < 0.7:
    escalate_to_tier_3()
elif tier_2_score in [2, 3, 4]:  # Middle scores often need review
    escalate_to_tier_3()
else:
    use_tier_2_result()

Cost Comparison

Scenario: 10,000 evaluations/month

ApproachMonthly CostQuality
All GPT-4o$1,500Baseline
All Claude Opus$2,000+2% vs baseline
All GPT-4o-mini$150-8% vs baseline
Tiered$50-1% vs baseline

Cost Breakdown (Tiered)

Tier 1 (7,000 cases): $0 (heuristics)
Tier 2 (2,500 cases): $25 (gpt-4o-mini)
Tier 3 (500 cases):   $25 (claude-opus-4-5)
────────────────────────────────────
Total:                $50

Quality Validation

Tiered evaluation maintains quality through calibration:

python
# Run calibration to verify tier accuracy
calibration = client.calibrate_tiers(
    cases=labeled_cases,  # Cases with human labels
    config=tier_config
)

print(calibration.results)
# {
#   "tier_1_accuracy": 0.98,  # Heuristics agree with humans 98%
#   "tier_2_accuracy": 0.94,  # Small LLM agrees 94%
#   "tier_3_accuracy": 0.97,  # Large LLM agrees 97%
#   "overall_accuracy": 0.96, # Weighted by resolution rate
#   "cost_savings": "97.2%"
# }

Tuning Thresholds

Find optimal escalation thresholds:

python
# Grid search for best cost/quality tradeoff
tuning = client.tune_tiers(
    cases=validation_cases,
    target_accuracy=0.95,  # Minimum acceptable accuracy
    optimize_for="cost"    # or "accuracy", "latency"
)

print(tuning.recommended_config)
# {
#   "tier_2_confidence_threshold": 0.72,
#   "tier_2_model": "gpt-4o-mini",
#   "expected_accuracy": 0.954,
#   "expected_cost_per_1000": "$4.20"
# }

Best Practices

1. Start Conservative

Begin with lower confidence thresholds (more escalation), then tune up.

2. Monitor Tier Distribution

If Tier 3 usage spikes, your Tier 2 model may need updating.

python
# Alert if tier distribution shifts
client.configure_alerts(
    metric="tier_3_percentage",
    threshold=10,  # Alert if >10% goes to Tier 3
    window="1h"
)

3. Regularly Recalibrate

Model behavior changes over time. Recalibrate monthly.

4. Use Domain-Specific Heuristics

The more cases resolved at Tier 1, the more you save.

python
# Add custom heuristics for your domain
client.register_heuristic(
    name="has_required_sections",
    function=check_sections,
    tier=1
)

Next Steps

Built with reliability in mind.