Skip to content

Quality Suites

Evaluate the general quality of AI responses including coherence, relevance, and instruction following.

quality-general

Comprehensive quality assessment for conversational AI.

What it Tests

  • Coherence - Logical flow and consistency
  • Relevance - Addresses the actual question
  • Helpfulness - Provides actionable, useful information
  • Completeness - Covers all aspects of the query
  • Conciseness - No unnecessary verbosity

Scoring Rubric

Each dimension is scored 1-5:

ScoreMeaning
5Excellent - Exceeds expectations
4Good - Meets expectations
3Acceptable - Minor issues
2Below average - Notable problems
1Poor - Fails to meet basic requirements

Example Usage

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

results = client.run_suite(
    suite="quality-general",
    model="gpt-4o"
)

# Get dimension breakdown
print(results.dimension_scores)
# {
#   "coherence": 4.2,
#   "relevance": 4.5,
#   "helpfulness": 4.1,
#   "completeness": 3.9,
#   "conciseness": 4.3
# }

quality-instruction

Tests how well models follow specific instructions.

Test Categories

  1. Format compliance - "Respond in JSON format", "Use bullet points"
  2. Constraint adherence - "In 3 sentences or less", "Don't mention X"
  3. Multi-step instructions - "First do A, then B, finally C"
  4. Conditional logic - "If X, then Y; otherwise Z"

Example Cases

python
cases = [
    {
        "input": "List 3 benefits of exercise. Use exactly 3 bullet points.",
        "checks": ["has_3_bullets", "on_topic"],
        "evaluator": "instruction_check"
    },
    {
        "input": "Explain quantum computing without using the word 'particle'",
        "checks": ["not_contains:particle", "is_coherent"],
        "evaluator": "constraint_check"
    }
]

Scoring

  • Pass/Fail for binary instructions (format, constraints)
  • Partial credit for multi-step instructions (% of steps completed)

quality-reasoning

Evaluates logical reasoning and analytical capabilities.

Reasoning Types

  • Deductive - General rules to specific conclusions
  • Inductive - Specific observations to general patterns
  • Analogical - Reasoning by similarity
  • Causal - Understanding cause and effect
  • Mathematical - Numerical and logical problem-solving

Example

python
results = client.run_suite(
    suite="quality-reasoning",
    model="gpt-4o",
    config={
        "categories": ["deductive", "mathematical"]
    }
)

# See where reasoning breaks down
for failure in results.failures:
    print(f"Reasoning error: {failure.reasoning_type}")
    print(f"Expected chain: {failure.expected_steps}")
    print(f"Actual chain: {failure.actual_steps}")

Custom Quality Evaluation

Using Your Own Rubrics

python
# Define custom quality criteria
rubric = {
    "criteria": [
        {
            "name": "brand_voice",
            "weight": 0.3,
            "description": "Matches our friendly, professional tone"
        },
        {
            "name": "accuracy",
            "weight": 0.4,
            "description": "Information is factually correct"
        },
        {
            "name": "actionability",
            "weight": 0.3,
            "description": "User can act on the response"
        }
    ]
}

results = client.evaluate(
    model="gpt-4o",
    cases=my_cases,
    rubric=rubric
)

Auto-Generated Rubrics

Let GateFlow learn your quality standards:

python
# Provide examples of good and bad responses
rubric = client.generate_rubric(
    examples=[
        {
            "input": "How do I reset my password?",
            "good_output": "Click Settings > Security > Reset Password...",
            "bad_output": "You should contact support or something."
        },
        # ... 30 examples recommended
    ]
)

# Rubric generated in ~60 seconds
print(rubric.criteria)
# Auto-detected: clarity, step_accuracy, completeness, tone

Comparing Models

Use quality suites to compare model variants:

python
models = ["gpt-4o", "claude-opus-4-5", "gemini-2.5-pro"]

comparison = client.compare_models(
    suite="quality-general",
    models=models
)

print(comparison.rankings)
# 1. claude-opus-4-5: 92.3%
# 2. gpt-4o: 91.8%
# 3. gemini-2.5-pro: 90.1%

print(comparison.statistical_significance)
# claude-opus-4-5 vs gpt-4o: p=0.23 (not significant)
# gpt-4o vs gemini-2.5-pro: p=0.01 (significant)

Integration with Routing

Quality scores inform intelligent routing:

python
# Configure quality-aware routing
client.configure_routing(
    quality_threshold=85,  # Minimum quality score
    prefer_models=["claude-opus-4-5", "gpt-4o"],
    fallback_on_quality_drop=True
)

Next Steps

Built with reliability in mind.