Skip to content

Traffic Sampling

Continuously evaluate production traffic without impacting performance or cost.

How It Works

GateFlow samples a configurable percentage of production requests for async evaluation:

Production Request → Gateway → Model Response → User

                         └─→ Sample? ─→ Async Eval Queue ─→ Results
                             (2.5%)
  • Zero latency impact - Evaluation happens asynchronously
  • Configurable rate - Sample 0.1% to 10% of traffic
  • Automatic storage - Inputs/outputs stored for evaluation
  • Multiple suites - Run different evals on sampled traffic

Configuration

Basic Setup

python
from gateflow import EvalClient

client = EvalClient(api_key="gf-...")

client.configure_sampling(
    rate=0.025,  # 2.5% of traffic
    suites=["safety-core", "quality-general"]
)

Advanced Configuration

python
client.configure_sampling(
    rate=0.025,

    # Which suites to run
    suites=["safety-core", "quality-general", "rag-faithfulness"],

    # Sampling strategy
    strategy="random",  # or "systematic", "stratified"

    # Filter which requests to sample
    filters={
        "models": ["gpt-4o", "claude-opus-4-5"],  # Only these models
        "endpoints": ["/v1/chat/completions"],
        "min_tokens": 50,  # Skip very short responses
        "exclude_cached": True  # Don't eval cache hits
    },

    # Metadata to capture
    capture={
        "input": True,
        "output": True,
        "model": True,
        "latency": True,
        "tokens": True,
        "custom_headers": ["x-session-id", "x-user-tier"]
    },

    # Retention
    retention_days=90
)

Sampling Strategies

Random Sampling

Each request has equal probability of being sampled.

python
config = {"strategy": "random", "rate": 0.025}

Systematic Sampling

Sample every Nth request (e.g., every 40th request = 2.5%).

python
config = {"strategy": "systematic", "rate": 0.025}

Stratified Sampling

Ensure balanced sampling across dimensions.

python
config = {
    "strategy": "stratified",
    "rate": 0.025,
    "strata": {
        "model": ["gpt-4o", "claude-opus-4-5"],
        "user_tier": ["free", "paid", "enterprise"]
    }
}
# Each stratum gets proportional samples

Adaptive Sampling

Increase sampling when quality drops.

python
config = {
    "strategy": "adaptive",
    "base_rate": 0.01,
    "max_rate": 0.10,
    "triggers": {
        "score_below": 90,    # Increase sampling if score < 90%
        "drift_detected": True # Increase on drift
    }
}

Viewing Sampled Data

Dashboard

Navigate to Eval → Samples to see:

  • Recent samples with inputs/outputs
  • Eval scores per sample
  • Filtering by model, score, time

API

python
# Get recent samples
samples = client.list_samples(
    limit=100,
    filters={
        "suite": "safety-core",
        "score_below": 80,
        "time_range": "24h"
    }
)

for sample in samples:
    print(f"Model: {sample.model}")
    print(f"Input: {sample.input[:100]}...")
    print(f"Output: {sample.output[:100]}...")
    print(f"Score: {sample.scores}")
    print("---")

Running Evals on Samples

Samples are automatically evaluated, but you can also run additional evals:

python
# Run a new suite on existing samples
results = client.evaluate_samples(
    sample_ids=["sample_abc", "sample_def"],
    suite="my-custom-suite"
)

# Or evaluate all samples from a time range
results = client.evaluate_samples(
    time_range={"start": "2024-01-01", "end": "2024-01-07"},
    suite="quality-general"
)

Cost Considerations

Sampling Cost

RateRequests/daySamples/dayEval Cost*
1%100,0001,000~$5
2.5%100,0002,500~$12
5%100,0005,000~$25
10%100,00010,000~$50

*Using tiered evaluation approach

Optimizing Cost

python
# Use tiered evaluation for samples
client.configure_sampling(
    rate=0.025,
    suites=["safety-core"],
    evaluator_config={
        "tiered": True,
        "tier_1_checks": ["pii_check", "length_check"],
        "tier_2_model": "gpt-4o-mini",
        "tier_3_model": "claude-opus-4-5"
    }
)

Storage and Retention

Data Stored

Per sample:

  • Request input (prompt, messages)
  • Model output (response)
  • Metadata (model, latency, tokens, timestamp)
  • Eval results (scores, reasoning)

Retention Policies

python
client.configure_sampling(
    retention={
        "default_days": 90,
        "failed_samples_days": 365,  # Keep failures longer
        "compliance_hold": True       # Respect litigation holds
    }
)

Data Access

python
# Export samples for analysis
export = client.export_samples(
    time_range="last_30d",
    format="jsonl",  # or "csv", "parquet"
    include_evals=True
)

# Download URL valid for 24h
print(export.download_url)

Privacy Considerations

PII Handling

python
client.configure_sampling(
    privacy={
        "redact_pii": True,           # Redact before storage
        "pii_types": ["email", "phone", "ssn", "name"],
        "hash_user_ids": True,        # Pseudonymize user IDs
        "exclude_fields": ["password", "api_key"]
    }
)

Access Control

python
# Limit who can view samples
client.configure_sampling(
    access={
        "view_samples": ["admin", "eval_team"],
        "export_samples": ["admin"],
        "delete_samples": ["admin"]
    }
)

Next Steps

Built with reliability in mind.