Skip to content

Heuristic Evaluators

Fast, rule-based evaluation methods that don't require LLM calls. Use heuristics for clear-cut checks and as the first layer in tiered evaluation.

Why Heuristics?

MethodSpeedCostBest For
Heuristic<1msFreeFormat checks, exact matches
LLM Judge~2s$$Nuanced quality assessment

Heuristics are ideal for:

  • Binary pass/fail checks
  • Format validation
  • Constraint verification
  • Pre-filtering before expensive LLM evaluation

Available Heuristics

Text Matching

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

# Exact match
results = client.evaluate(
    cases=[{"input": "What is 2+2?", "response": "4", "expected": "4"}],
    evaluator="exact_match"
)

# Case-insensitive match
results = client.evaluate(
    cases=[{"response": "PARIS", "expected": "paris"}],
    evaluator="exact_match",
    config={"case_sensitive": False}
)

# Contains check
results = client.evaluate(
    cases=[{"response": "The capital is Paris", "expected": "Paris"}],
    evaluator="contains"
)

# Regex match
results = client.evaluate(
    cases=[{"response": "Order #12345", "pattern": r"Order #\d{5}"}],
    evaluator="regex_match"
)

Format Validation

python
# JSON validity
results = client.evaluate(
    cases=[{"response": '{"name": "test", "value": 123}'}],
    evaluator="json_valid"
)

# JSON schema compliance
results = client.evaluate(
    cases=[{"response": '{"name": "test"}'}],
    evaluator="json_schema",
    config={
        "schema": {
            "type": "object",
            "required": ["name", "value"],
            "properties": {
                "name": {"type": "string"},
                "value": {"type": "number"}
            }
        }
    }
)

# Markdown structure
results = client.evaluate(
    cases=[{"response": "# Title\n\nParagraph..."}],
    evaluator="markdown_valid"
)

Length Constraints

python
# Character count
results = client.evaluate(
    cases=[{"response": "Short answer"}],
    evaluator="length_check",
    config={"min_chars": 10, "max_chars": 100}
)

# Word count
results = client.evaluate(
    cases=[{"response": "This is a five word sentence."}],
    evaluator="word_count",
    config={"min_words": 3, "max_words": 10}
)

# Sentence count
results = client.evaluate(
    cases=[{"response": "First. Second. Third."}],
    evaluator="sentence_count",
    config={"exact": 3}
)

Content Checks

python
# PII detection
results = client.evaluate(
    cases=[{"response": "Contact john@email.com for help"}],
    evaluator="pii_check",
    config={"types": ["email", "phone", "ssn"]}
)

# Profanity filter
results = client.evaluate(
    cases=[{"response": "This is a clean response"}],
    evaluator="profanity_check"
)

# Language detection
results = client.evaluate(
    cases=[{"response": "Bonjour, comment allez-vous?"}],
    evaluator="language_check",
    config={"expected": "en", "allow": ["en", "fr"]}
)

List and Structure Checks

python
# Bullet point count
results = client.evaluate(
    cases=[{"response": "• Item 1\n• Item 2\n• Item 3"}],
    evaluator="bullet_count",
    config={"min": 3, "max": 5}
)

# Numbered list validation
results = client.evaluate(
    cases=[{"response": "1. First\n2. Second\n3. Third"}],
    evaluator="numbered_list",
    config={"sequential": True}
)

# Section headers
results = client.evaluate(
    cases=[{"response": "## Overview\n...\n## Details\n..."}],
    evaluator="has_sections",
    config={"required": ["Overview", "Details"]}
)

Combining Heuristics

All Must Pass

python
results = client.evaluate(
    cases=cases,
    evaluator="composite",
    config={
        "mode": "all",  # All must pass
        "checks": [
            {"type": "json_valid"},
            {"type": "length_check", "min_chars": 100},
            {"type": "contains", "substring": "conclusion"}
        ]
    }
)

Any Must Pass

python
results = client.evaluate(
    cases=cases,
    evaluator="composite",
    config={
        "mode": "any",  # At least one must pass
        "checks": [
            {"type": "exact_match", "expected": "N/A"},
            {"type": "length_check", "min_chars": 50}
        ]
    }
)

Weighted Scoring

python
results = client.evaluate(
    cases=cases,
    evaluator="composite",
    config={
        "mode": "weighted",
        "checks": [
            {"type": "json_valid", "weight": 0.3},
            {"type": "length_check", "min_chars": 100, "weight": 0.3},
            {"type": "contains", "substring": "summary", "weight": 0.4}
        ]
    }
)
# Score = sum of (passed * weight) * 100

Custom Heuristics

Define your own heuristic functions:

python
def check_citation_format(response: str) -> bool:
    """Check if citations follow [Author, Year] format"""
    import re
    citations = re.findall(r'\[([^\]]+)\]', response)
    pattern = r'^[A-Z][a-z]+, \d{4}$'
    return all(re.match(pattern, c) for c in citations)

# Register custom heuristic
client.register_heuristic(
    name="citation_format",
    function=check_citation_format,
    description="Validates [Author, Year] citation format"
)

# Use it
results = client.evaluate(
    cases=cases,
    evaluator="citation_format"
)

Performance Benchmarks

HeuristicThroughputMemory
exact_match100k/sec<1MB
contains80k/sec<1MB
regex_match50k/sec<1MB
json_valid30k/sec<1MB
json_schema20k/sec<5MB
pii_check10k/sec<10MB

Best Practices

  1. Use heuristics first - Check format before semantic quality
  2. Fail fast - If heuristics fail, skip expensive LLM evals
  3. Be specific - Narrow checks are more reliable than broad ones
  4. Combine strategically - Use composite evaluators for complex requirements

Next Steps

Built with reliability in mind.