Appearance
Core Concepts
Understanding the fundamental building blocks of GateFlow Eval.
Eval Suites
A suite is a collection of related evaluation cases with shared configuration. Suites can be:
- Curated - Pre-built by GateFlow, covering common use cases
- Custom - Created by you for your specific application
python
# Using a curated suite
results = client.run_suite(suite="safety-core", model="gpt-4o")
# Creating a custom suite
client.create_suite(
name="my-product-evals",
description="Product-specific quality checks",
cases=[...],
evaluators=["llm_judge", "exact_match"]
)Eval Cases
A case is a single test instance with:
- Input - The prompt or context to evaluate
- Expected output - What a correct response should contain (optional)
- Metadata - Tags, categories, difficulty levels
python
case = {
"id": "case_001",
"input": "What are the side effects of aspirin?",
"expected": "mentions headache, stomach issues, bleeding risk",
"tags": ["medical", "safety-critical"],
"difficulty": "medium"
}Evaluators
Evaluators score model outputs. GateFlow supports three types:
Heuristic Evaluators
Fast, rule-based checks:
exact_match- Output matches expected exactlycontains- Output contains expected substringregex- Output matches patternjson_valid- Output is valid JSONlength_check- Output within length bounds
Semantic Evaluators
Embedding-based similarity:
semantic_similarity- Cosine similarity to expected outputentailment- Logical entailment check
LLM-as-Judge Evaluators
Model-based assessment:
llm_judge- General purpose LLM evaluationrubric_judge- Score against specific criteriapairwise_judge- Compare two outputs
Rubrics
A rubric defines scoring criteria for LLM judges:
python
rubric = {
"criteria": [
{
"name": "accuracy",
"weight": 0.4,
"description": "Response is factually correct"
},
{
"name": "completeness",
"weight": 0.3,
"description": "Response addresses all parts of the question"
},
{
"name": "clarity",
"weight": 0.3,
"description": "Response is clear and well-organized"
}
],
"scale": [1, 2, 3, 4, 5]
}Auto-Generated Rubrics
GateFlow can generate rubrics from examples:
python
rubric = client.generate_rubric(
examples=[
{"input": "...", "good_output": "...", "bad_output": "..."},
# ... 30 examples recommended
]
)
# Generated in ~60 secondsEval Runs
A run is a single execution of a suite against a model:
python
run = client.run_suite(suite="safety-core", model="gpt-4o")
print(run.id) # "run_abc123"
print(run.status) # "completed"
print(run.aggregate_score) # 94.5
print(run.passed) # 189
print(run.failed) # 11
print(run.duration_ms) # 4521Aggregate Scores
Scores are computed hierarchically:
- Case score - Individual evaluator result (0-100)
- Suite score - Weighted average of case scores
- Model score - Rolling average across recent runs
Model Score (gpt-4o): 92.3%
├── safety-core: 100%
├── quality-general: 89.1%
└── rag-faithfulness: 87.8%Tags and Filtering
Organize cases with tags for filtered analysis:
python
# Run only high-priority cases
results = client.run_suite(
suite="my-suite",
model="gpt-4o",
filters={"tags": ["critical", "safety"]}
)
# Analyze results by tag
breakdown = client.get_breakdown(
run_id="run_abc123",
group_by="tags"
)Next Steps
- Curated Suites - Explore pre-built evaluations
- LLM-as-Judge - Configure LLM evaluators
- Tiered Approach - Understand cost optimization