Skip to content

Core Concepts

Understanding the fundamental building blocks of GateFlow Eval.

Eval Suites

A suite is a collection of related evaluation cases with shared configuration. Suites can be:

  • Curated - Pre-built by GateFlow, covering common use cases
  • Custom - Created by you for your specific application
python
# Using a curated suite
results = client.run_suite(suite="safety-core", model="gpt-4o")

# Creating a custom suite
client.create_suite(
    name="my-product-evals",
    description="Product-specific quality checks",
    cases=[...],
    evaluators=["llm_judge", "exact_match"]
)

Eval Cases

A case is a single test instance with:

  • Input - The prompt or context to evaluate
  • Expected output - What a correct response should contain (optional)
  • Metadata - Tags, categories, difficulty levels
python
case = {
    "id": "case_001",
    "input": "What are the side effects of aspirin?",
    "expected": "mentions headache, stomach issues, bleeding risk",
    "tags": ["medical", "safety-critical"],
    "difficulty": "medium"
}

Evaluators

Evaluators score model outputs. GateFlow supports three types:

Heuristic Evaluators

Fast, rule-based checks:

  • exact_match - Output matches expected exactly
  • contains - Output contains expected substring
  • regex - Output matches pattern
  • json_valid - Output is valid JSON
  • length_check - Output within length bounds

Semantic Evaluators

Embedding-based similarity:

  • semantic_similarity - Cosine similarity to expected output
  • entailment - Logical entailment check

LLM-as-Judge Evaluators

Model-based assessment:

  • llm_judge - General purpose LLM evaluation
  • rubric_judge - Score against specific criteria
  • pairwise_judge - Compare two outputs

Rubrics

A rubric defines scoring criteria for LLM judges:

python
rubric = {
    "criteria": [
        {
            "name": "accuracy",
            "weight": 0.4,
            "description": "Response is factually correct"
        },
        {
            "name": "completeness",
            "weight": 0.3,
            "description": "Response addresses all parts of the question"
        },
        {
            "name": "clarity",
            "weight": 0.3,
            "description": "Response is clear and well-organized"
        }
    ],
    "scale": [1, 2, 3, 4, 5]
}

Auto-Generated Rubrics

GateFlow can generate rubrics from examples:

python
rubric = client.generate_rubric(
    examples=[
        {"input": "...", "good_output": "...", "bad_output": "..."},
        # ... 30 examples recommended
    ]
)
# Generated in ~60 seconds

Eval Runs

A run is a single execution of a suite against a model:

python
run = client.run_suite(suite="safety-core", model="gpt-4o")

print(run.id)           # "run_abc123"
print(run.status)       # "completed"
print(run.aggregate_score)  # 94.5
print(run.passed)       # 189
print(run.failed)       # 11
print(run.duration_ms)  # 4521

Aggregate Scores

Scores are computed hierarchically:

  1. Case score - Individual evaluator result (0-100)
  2. Suite score - Weighted average of case scores
  3. Model score - Rolling average across recent runs
Model Score (gpt-4o): 92.3%
├── safety-core: 100%
├── quality-general: 89.1%
└── rag-faithfulness: 87.8%

Tags and Filtering

Organize cases with tags for filtered analysis:

python
# Run only high-priority cases
results = client.run_suite(
    suite="my-suite",
    model="gpt-4o",
    filters={"tags": ["critical", "safety"]}
)

# Analyze results by tag
breakdown = client.get_breakdown(
    run_id="run_abc123",
    group_by="tags"
)

Next Steps

Built with reliability in mind.