Heuristic Evaluators

Fast, rule-based evaluation methods that don't require LLM calls. Use heuristics for clear-cut checks and as the first layer in tiered evaluation.

Why Heuristics?

Method	Speed	Cost	Best For
Heuristic	<1ms	Free	Format checks, exact matches
LLM Judge	~2s	$$	Nuanced quality assessment

Heuristics are ideal for:

Binary pass/fail checks
Format validation
Constraint verification
Pre-filtering before expensive LLM evaluation

Available Heuristics

Text Matching

python

from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

# Exact match
results = client.evaluate(
    cases=[{"input": "What is 2+2?", "response": "4", "expected": "4"}],
    evaluator="exact_match"
)

# Case-insensitive match
results = client.evaluate(
    cases=[{"response": "PARIS", "expected": "paris"}],
    evaluator="exact_match",
    config={"case_sensitive": False}
)

# Contains check
results = client.evaluate(
    cases=[{"response": "The capital is Paris", "expected": "Paris"}],
    evaluator="contains"
)

# Regex match
results = client.evaluate(
    cases=[{"response": "Order #12345", "pattern": r"Order #\d{5}"}],
    evaluator="regex_match"
)

Format Validation

python

# JSON validity
results = client.evaluate(
    cases=[{"response": '{"name": "test", "value": 123}'}],
    evaluator="json_valid"
)

# JSON schema compliance
results = client.evaluate(
    cases=[{"response": '{"name": "test"}'}],
    evaluator="json_schema",
    config={
        "schema": {
            "type": "object",
            "required": ["name", "value"],
            "properties": {
                "name": {"type": "string"},
                "value": {"type": "number"}
            }
        }
    }
)

# Markdown structure
results = client.evaluate(
    cases=[{"response": "# Title\n\nParagraph..."}],
    evaluator="markdown_valid"
)

Length Constraints

python

# Character count
results = client.evaluate(
    cases=[{"response": "Short answer"}],
    evaluator="length_check",
    config={"min_chars": 10, "max_chars": 100}
)

# Word count
results = client.evaluate(
    cases=[{"response": "This is a five word sentence."}],
    evaluator="word_count",
    config={"min_words": 3, "max_words": 10}
)

# Sentence count
results = client.evaluate(
    cases=[{"response": "First. Second. Third."}],
    evaluator="sentence_count",
    config={"exact": 3}
)

Content Checks

python

# PII detection
results = client.evaluate(
    cases=[{"response": "Contact john@email.com for help"}],
    evaluator="pii_check",
    config={"types": ["email", "phone", "ssn"]}
)

# Profanity filter
results = client.evaluate(
    cases=[{"response": "This is a clean response"}],
    evaluator="profanity_check"
)

# Language detection
results = client.evaluate(
    cases=[{"response": "Bonjour, comment allez-vous?"}],
    evaluator="language_check",
    config={"expected": "en", "allow": ["en", "fr"]}
)

List and Structure Checks

python

# Bullet point count
results = client.evaluate(
    cases=[{"response": "• Item 1\n• Item 2\n• Item 3"}],
    evaluator="bullet_count",
    config={"min": 3, "max": 5}
)

# Numbered list validation
results = client.evaluate(
    cases=[{"response": "1. First\n2. Second\n3. Third"}],
    evaluator="numbered_list",
    config={"sequential": True}
)

# Section headers
results = client.evaluate(
    cases=[{"response": "## Overview\n...\n## Details\n..."}],
    evaluator="has_sections",
    config={"required": ["Overview", "Details"]}
)

Combining Heuristics

All Must Pass

python

results = client.evaluate(
    cases=cases,
    evaluator="composite",
    config={
        "mode": "all",  # All must pass
        "checks": [
            {"type": "json_valid"},
            {"type": "length_check", "min_chars": 100},
            {"type": "contains", "substring": "conclusion"}
        ]
    }
)

Any Must Pass

python

results = client.evaluate(
    cases=cases,
    evaluator="composite",
    config={
        "mode": "any",  # At least one must pass
        "checks": [
            {"type": "exact_match", "expected": "N/A"},
            {"type": "length_check", "min_chars": 50}
        ]
    }
)

Weighted Scoring

python

results = client.evaluate(
    cases=cases,
    evaluator="composite",
    config={
        "mode": "weighted",
        "checks": [
            {"type": "json_valid", "weight": 0.3},
            {"type": "length_check", "min_chars": 100, "weight": 0.3},
            {"type": "contains", "substring": "summary", "weight": 0.4}
        ]
    }
)
# Score = sum of (passed * weight) * 100

Custom Heuristics

Define your own heuristic functions:

python

def check_citation_format(response: str) -> bool:
    """Check if citations follow [Author, Year] format"""
    import re
    citations = re.findall(r'\[([^\]]+)\]', response)
    pattern = r'^[A-Z][a-z]+, \d{4}$'
    return all(re.match(pattern, c) for c in citations)

# Register custom heuristic
client.register_heuristic(
    name="citation_format",
    function=check_citation_format,
    description="Validates [Author, Year] citation format"
)

# Use it
results = client.evaluate(
    cases=cases,
    evaluator="citation_format"
)

Performance Benchmarks

Heuristic	Throughput	Memory
exact_match	100k/sec	<1MB
contains	80k/sec	<1MB
regex_match	50k/sec	<1MB
json_valid	30k/sec	<1MB
json_schema	20k/sec	<5MB
pii_check	10k/sec	<10MB

Best Practices

Use heuristics first - Check format before semantic quality
Fail fast - If heuristics fail, skip expensive LLM evals
Be specific - Narrow checks are more reliable than broad ones
Combine strategically - Use composite evaluators for complex requirements

Next Steps

Tiered Approach - Combine heuristics with LLM judges
LLM-as-Judge - For nuanced evaluation
Traffic Sampling - Apply heuristics at scale

Heuristic Evaluators ​

Why Heuristics? ​

Available Heuristics ​

Text Matching ​

Format Validation ​

Length Constraints ​

Content Checks ​

List and Structure Checks ​

Combining Heuristics ​

All Must Pass ​

Any Must Pass ​

Weighted Scoring ​

Custom Heuristics ​

Performance Benchmarks ​

Best Practices ​

Next Steps ​

Heuristic Evaluators

Why Heuristics?

Available Heuristics

Text Matching

Format Validation

Length Constraints

Content Checks

List and Structure Checks

Combining Heuristics

All Must Pass

Any Must Pass

Weighted Scoring

Custom Heuristics

Performance Benchmarks

Best Practices

Next Steps