Appearance
Quality Suites
Evaluate the general quality of AI responses including coherence, relevance, and instruction following.
quality-general
Comprehensive quality assessment for conversational AI.
What it Tests
- Coherence - Logical flow and consistency
- Relevance - Addresses the actual question
- Helpfulness - Provides actionable, useful information
- Completeness - Covers all aspects of the query
- Conciseness - No unnecessary verbosity
Scoring Rubric
Each dimension is scored 1-5:
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets expectations |
| 3 | Acceptable - Minor issues |
| 2 | Below average - Notable problems |
| 1 | Poor - Fails to meet basic requirements |
Example Usage
python
from gateflow import EvalClient
client = EvalClient(api_key="gf-...")
results = client.run_suite(
suite="quality-general",
model="gpt-4o"
)
# Get dimension breakdown
print(results.dimension_scores)
# {
# "coherence": 4.2,
# "relevance": 4.5,
# "helpfulness": 4.1,
# "completeness": 3.9,
# "conciseness": 4.3
# }quality-instruction
Tests how well models follow specific instructions.
Test Categories
- Format compliance - "Respond in JSON format", "Use bullet points"
- Constraint adherence - "In 3 sentences or less", "Don't mention X"
- Multi-step instructions - "First do A, then B, finally C"
- Conditional logic - "If X, then Y; otherwise Z"
Example Cases
python
cases = [
{
"input": "List 3 benefits of exercise. Use exactly 3 bullet points.",
"checks": ["has_3_bullets", "on_topic"],
"evaluator": "instruction_check"
},
{
"input": "Explain quantum computing without using the word 'particle'",
"checks": ["not_contains:particle", "is_coherent"],
"evaluator": "constraint_check"
}
]Scoring
- Pass/Fail for binary instructions (format, constraints)
- Partial credit for multi-step instructions (% of steps completed)
quality-reasoning
Evaluates logical reasoning and analytical capabilities.
Reasoning Types
- Deductive - General rules to specific conclusions
- Inductive - Specific observations to general patterns
- Analogical - Reasoning by similarity
- Causal - Understanding cause and effect
- Mathematical - Numerical and logical problem-solving
Example
python
results = client.run_suite(
suite="quality-reasoning",
model="gpt-4o",
config={
"categories": ["deductive", "mathematical"]
}
)
# See where reasoning breaks down
for failure in results.failures:
print(f"Reasoning error: {failure.reasoning_type}")
print(f"Expected chain: {failure.expected_steps}")
print(f"Actual chain: {failure.actual_steps}")Custom Quality Evaluation
Using Your Own Rubrics
python
# Define custom quality criteria
rubric = {
"criteria": [
{
"name": "brand_voice",
"weight": 0.3,
"description": "Matches our friendly, professional tone"
},
{
"name": "accuracy",
"weight": 0.4,
"description": "Information is factually correct"
},
{
"name": "actionability",
"weight": 0.3,
"description": "User can act on the response"
}
]
}
results = client.evaluate(
model="gpt-4o",
cases=my_cases,
rubric=rubric
)Auto-Generated Rubrics
Let GateFlow learn your quality standards:
python
# Provide examples of good and bad responses
rubric = client.generate_rubric(
examples=[
{
"input": "How do I reset my password?",
"good_output": "Click Settings > Security > Reset Password...",
"bad_output": "You should contact support or something."
},
# ... 30 examples recommended
]
)
# Rubric generated in ~60 seconds
print(rubric.criteria)
# Auto-detected: clarity, step_accuracy, completeness, toneComparing Models
Use quality suites to compare model variants:
python
models = ["gpt-4o", "claude-opus-4-5", "gemini-2.5-pro"]
comparison = client.compare_models(
suite="quality-general",
models=models
)
print(comparison.rankings)
# 1. claude-opus-4-5: 92.3%
# 2. gpt-4o: 91.8%
# 3. gemini-2.5-pro: 90.1%
print(comparison.statistical_significance)
# claude-opus-4-5 vs gpt-4o: p=0.23 (not significant)
# gpt-4o vs gemini-2.5-pro: p=0.01 (significant)Integration with Routing
Quality scores inform intelligent routing:
python
# Configure quality-aware routing
client.configure_routing(
quality_threshold=85, # Minimum quality score
prefer_models=["claude-opus-4-5", "gpt-4o"],
fallback_on_quality_drop=True
)Next Steps
- RAG Suites - Evaluate retrieval quality
- Tiered Approach - Cost-efficient evaluation
- Routing Feedback - Closed-loop routing