Appearance
Curated Eval Suites
GateFlow provides 10+ pre-built evaluation suites covering common AI quality and safety requirements. These suites are maintained by our team and updated regularly.
Available Suites
Safety & Trust
| Suite | Cases | Description |
|---|---|---|
safety-core | 200+ | Core safety checks: toxicity, harmful content, PII leakage |
safety-jailbreak | 150+ | Jailbreak and prompt injection resistance |
safety-bias | 100+ | Bias detection across demographic categories |
Quality
| Suite | Cases | Description |
|---|---|---|
quality-general | 300+ | General response quality: coherence, relevance, helpfulness |
quality-instruction | 150+ | Instruction following accuracy |
quality-reasoning | 100+ | Logical reasoning and consistency |
RAG & Retrieval
| Suite | Cases | Description |
|---|---|---|
rag-faithfulness | 200+ | Faithfulness to source documents |
rag-groundedness | 150+ | Claims grounded in provided context |
rag-citation | 100+ | Citation accuracy and attribution |
Compliance
| Suite | Cases | Description |
|---|---|---|
compliance-eu-ai-act | 100+ | EU AI Act alignment checks |
compliance-medical | 150+ | Medical disclaimer and safety requirements |
Using Curated Suites
python
from gateflow import EvalClient
client = EvalClient(api_key="gf-...")
# Run a single suite
results = client.run_suite(
suite="safety-core",
model="gpt-4o"
)
# Run multiple suites
results = client.run_suites(
suites=["safety-core", "quality-general", "rag-faithfulness"],
model="gpt-4o"
)
# Run all safety suites
results = client.run_suites(
suites=["safety-*"], # Wildcard matching
model="gpt-4o"
)Suite Configuration
Customize how suites run:
python
results = client.run_suite(
suite="safety-core",
model="gpt-4o",
config={
"temperature": 0.0, # Deterministic outputs
"max_tokens": 500, # Limit response length
"timeout_ms": 30000, # Per-case timeout
"retry_on_error": True, # Retry failed cases
"parallel": 10 # Concurrent evaluations
}
)Viewing Suite Contents
Inspect cases before running:
python
suite = client.get_suite("safety-core")
print(f"Suite: {suite.name}")
print(f"Cases: {suite.case_count}")
print(f"Last updated: {suite.updated_at}")
# Preview cases
for case in suite.cases[:5]:
print(f" - {case.input[:50]}...")Suite Versioning
Curated suites are versioned for reproducibility:
python
# Run specific version
results = client.run_suite(
suite="safety-core@v2.1",
model="gpt-4o"
)
# List available versions
versions = client.list_suite_versions("safety-core")
# ["v1.0", "v1.1", "v2.0", "v2.1"]Next Steps
- Safety Suites - Deep dive into safety evaluation
- Quality Suites - Explore quality metrics
- RAG Suites - RAG-specific evaluation