Skip to content

Quickstart

Run your first evaluation in 5 minutes.

Prerequisites

Installation

bash
pip install gateflow
bash
npm install @gateflow/sdk

Run Your First Eval

Option 1: Use a Curated Suite

The fastest way to start—use a pre-built evaluation suite:

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

# Run the safety-core suite against your model
results = client.run_suite(
    suite="safety-core",
    model="gpt-4o"
)

print(f"Overall score: {results.aggregate_score}%")
print(f"Cases passed: {results.passed}/{results.total}")
javascript
import { EvalClient } from '@gateflow/sdk';

const client = new EvalClient({ apiKey: 'gf-...' });

const results = await client.runSuite({
  suite: 'safety-core',
  model: 'gpt-4o'
});

console.log(`Overall score: ${results.aggregateScore}%`);
console.log(`Cases passed: ${results.passed}/${results.total}`);

Option 2: Evaluate Custom Cases

Test your own inputs and expected outputs:

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

# Define your test cases
cases = [
    {
        "input": "What's the capital of France?",
        "expected": "Paris",
        "evaluator": "exact_match"
    },
    {
        "input": "Summarize this article...",
        "expected_criteria": ["concise", "accurate", "neutral"],
        "evaluator": "llm_judge"
    }
]

results = client.evaluate(
    model="gpt-4o",
    cases=cases
)

for result in results:
    print(f"{result.case_id}: {result.score} - {result.reasoning}")
javascript
import { EvalClient } from '@gateflow/sdk';

const client = new EvalClient({ apiKey: 'gf-...' });

const cases = [
  {
    input: "What's the capital of France?",
    expected: "Paris",
    evaluator: "exact_match"
  },
  {
    input: "Summarize this article...",
    expectedCriteria: ["concise", "accurate", "neutral"],
    evaluator: "llm_judge"
  }
];

const results = await client.evaluate({
  model: 'gpt-4o',
  cases
});

results.forEach(r => console.log(`${r.caseId}: ${r.score} - ${r.reasoning}`));

Enable Production Sampling

Add automatic evaluation of live traffic:

python
from openai import OpenAI

# Just use GateFlow as your base URL - sampling is automatic
client = OpenAI(
    api_key="gf-...",
    base_url="https://api.gateflow.ai/v1"
)

# Normal inference - 2.5% of requests are automatically sampled for eval
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello!"}]
)

Configure sampling rate in your dashboard or via API:

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

client.configure_sampling(
    rate=0.025,  # 2.5% of traffic
    suites=["safety-core", "quality-general"]
)

View Results

Results are available in the dashboard or via API:

python
# Get recent eval runs
runs = client.list_runs(limit=10)

for run in runs:
    print(f"{run.suite}: {run.score}% ({run.timestamp})")

# Get detailed results for a specific run
details = client.get_run(run_id="run_abc123")
for case in details.cases:
    print(f"  {case.input[:50]}... → {case.score}")

Next Steps

Built with reliability in mind.