Core Concepts

Understanding the fundamental building blocks of GateFlow Eval.

Eval Suites

A suite is a collection of related evaluation cases with shared configuration. Suites can be:

Curated - Pre-built by GateFlow, covering common use cases
Custom - Created by you for your specific application

python

# Using a curated suite
results = client.run_suite(suite="safety-core", model="gpt-4o")

# Creating a custom suite
client.create_suite(
    name="my-product-evals",
    description="Product-specific quality checks",
    cases=[...],
    evaluators=["llm_judge", "exact_match"]
)

Eval Cases

A case is a single test instance with:

Input - The prompt or context to evaluate
Expected output - What a correct response should contain (optional)
Metadata - Tags, categories, difficulty levels

python

case = {
    "id": "case_001",
    "input": "What are the side effects of aspirin?",
    "expected": "mentions headache, stomach issues, bleeding risk",
    "tags": ["medical", "safety-critical"],
    "difficulty": "medium"
}

Evaluators

Evaluators score model outputs. GateFlow supports three types:

Heuristic Evaluators

Fast, rule-based checks:

exact_match - Output matches expected exactly
contains - Output contains expected substring
regex - Output matches pattern
json_valid - Output is valid JSON
length_check - Output within length bounds

Semantic Evaluators

Embedding-based similarity:

semantic_similarity - Cosine similarity to expected output
entailment - Logical entailment check

LLM-as-Judge Evaluators

Model-based assessment:

llm_judge - General purpose LLM evaluation
rubric_judge - Score against specific criteria
pairwise_judge - Compare two outputs

Rubrics

A rubric defines scoring criteria for LLM judges:

python

rubric = {
    "criteria": [
        {
            "name": "accuracy",
            "weight": 0.4,
            "description": "Response is factually correct"
        },
        {
            "name": "completeness",
            "weight": 0.3,
            "description": "Response addresses all parts of the question"
        },
        {
            "name": "clarity",
            "weight": 0.3,
            "description": "Response is clear and well-organized"
        }
    ],
    "scale": [1, 2, 3, 4, 5]
}

Auto-Generated Rubrics

GateFlow can generate rubrics from examples:

python

rubric = client.generate_rubric(
    examples=[
        {"input": "...", "good_output": "...", "bad_output": "..."},
        # ... 30 examples recommended
    ]
)
# Generated in ~60 seconds

Eval Runs

A run is a single execution of a suite against a model:

python

run = client.run_suite(suite="safety-core", model="gpt-4o")

print(run.id)           # "run_abc123"
print(run.status)       # "completed"
print(run.aggregate_score)  # 94.5
print(run.passed)       # 189
print(run.failed)       # 11
print(run.duration_ms)  # 4521

Aggregate Scores

Scores are computed hierarchically:

Case score - Individual evaluator result (0-100)
Suite score - Weighted average of case scores
Model score - Rolling average across recent runs

Model Score (gpt-4o): 92.3%
├── safety-core: 100%
├── quality-general: 89.1%
└── rag-faithfulness: 87.8%

Tags and Filtering

Organize cases with tags for filtered analysis:

python

# Run only high-priority cases
results = client.run_suite(
    suite="my-suite",
    model="gpt-4o",
    filters={"tags": ["critical", "safety"]}
)

# Analyze results by tag
breakdown = client.get_breakdown(
    run_id="run_abc123",
    group_by="tags"
)

Next Steps

Curated Suites - Explore pre-built evaluations
LLM-as-Judge - Configure LLM evaluators
Tiered Approach - Understand cost optimization

Core Concepts ​

Eval Suites ​

Eval Cases ​

Evaluators ​

Heuristic Evaluators ​

Semantic Evaluators ​

LLM-as-Judge Evaluators ​

Rubrics ​

Auto-Generated Rubrics ​

Eval Runs ​

Aggregate Scores ​

Tags and Filtering ​

Next Steps ​

Core Concepts

Eval Suites

Eval Cases

Evaluators

Heuristic Evaluators

Semantic Evaluators

LLM-as-Judge Evaluators

Rubrics

Auto-Generated Rubrics

Eval Runs

Aggregate Scores

Tags and Filtering

Next Steps