Appearance
Traffic Sampling
Continuously evaluate production traffic without impacting performance or cost.
How It Works
GateFlow samples a configurable percentage of production requests for async evaluation:
Production Request → Gateway → Model Response → User
│
└─→ Sample? ─→ Async Eval Queue ─→ Results
(2.5%)- Zero latency impact - Evaluation happens asynchronously
- Configurable rate - Sample 0.1% to 10% of traffic
- Automatic storage - Inputs/outputs stored for evaluation
- Multiple suites - Run different evals on sampled traffic
Configuration
Basic Setup
python
from gateflow import EvalClient
client = EvalClient(api_key="gf-...")
client.configure_sampling(
rate=0.025, # 2.5% of traffic
suites=["safety-core", "quality-general"]
)Advanced Configuration
python
client.configure_sampling(
rate=0.025,
# Which suites to run
suites=["safety-core", "quality-general", "rag-faithfulness"],
# Sampling strategy
strategy="random", # or "systematic", "stratified"
# Filter which requests to sample
filters={
"models": ["gpt-4o", "claude-opus-4-5"], # Only these models
"endpoints": ["/v1/chat/completions"],
"min_tokens": 50, # Skip very short responses
"exclude_cached": True # Don't eval cache hits
},
# Metadata to capture
capture={
"input": True,
"output": True,
"model": True,
"latency": True,
"tokens": True,
"custom_headers": ["x-session-id", "x-user-tier"]
},
# Retention
retention_days=90
)Sampling Strategies
Random Sampling
Each request has equal probability of being sampled.
python
config = {"strategy": "random", "rate": 0.025}Systematic Sampling
Sample every Nth request (e.g., every 40th request = 2.5%).
python
config = {"strategy": "systematic", "rate": 0.025}Stratified Sampling
Ensure balanced sampling across dimensions.
python
config = {
"strategy": "stratified",
"rate": 0.025,
"strata": {
"model": ["gpt-4o", "claude-opus-4-5"],
"user_tier": ["free", "paid", "enterprise"]
}
}
# Each stratum gets proportional samplesAdaptive Sampling
Increase sampling when quality drops.
python
config = {
"strategy": "adaptive",
"base_rate": 0.01,
"max_rate": 0.10,
"triggers": {
"score_below": 90, # Increase sampling if score < 90%
"drift_detected": True # Increase on drift
}
}Viewing Sampled Data
Dashboard
Navigate to Eval → Samples to see:
- Recent samples with inputs/outputs
- Eval scores per sample
- Filtering by model, score, time
API
python
# Get recent samples
samples = client.list_samples(
limit=100,
filters={
"suite": "safety-core",
"score_below": 80,
"time_range": "24h"
}
)
for sample in samples:
print(f"Model: {sample.model}")
print(f"Input: {sample.input[:100]}...")
print(f"Output: {sample.output[:100]}...")
print(f"Score: {sample.scores}")
print("---")Running Evals on Samples
Samples are automatically evaluated, but you can also run additional evals:
python
# Run a new suite on existing samples
results = client.evaluate_samples(
sample_ids=["sample_abc", "sample_def"],
suite="my-custom-suite"
)
# Or evaluate all samples from a time range
results = client.evaluate_samples(
time_range={"start": "2024-01-01", "end": "2024-01-07"},
suite="quality-general"
)Cost Considerations
Sampling Cost
| Rate | Requests/day | Samples/day | Eval Cost* |
|---|---|---|---|
| 1% | 100,000 | 1,000 | ~$5 |
| 2.5% | 100,000 | 2,500 | ~$12 |
| 5% | 100,000 | 5,000 | ~$25 |
| 10% | 100,000 | 10,000 | ~$50 |
*Using tiered evaluation approach
Optimizing Cost
python
# Use tiered evaluation for samples
client.configure_sampling(
rate=0.025,
suites=["safety-core"],
evaluator_config={
"tiered": True,
"tier_1_checks": ["pii_check", "length_check"],
"tier_2_model": "gpt-4o-mini",
"tier_3_model": "claude-opus-4-5"
}
)Storage and Retention
Data Stored
Per sample:
- Request input (prompt, messages)
- Model output (response)
- Metadata (model, latency, tokens, timestamp)
- Eval results (scores, reasoning)
Retention Policies
python
client.configure_sampling(
retention={
"default_days": 90,
"failed_samples_days": 365, # Keep failures longer
"compliance_hold": True # Respect litigation holds
}
)Data Access
python
# Export samples for analysis
export = client.export_samples(
time_range="last_30d",
format="jsonl", # or "csv", "parquet"
include_evals=True
)
# Download URL valid for 24h
print(export.download_url)Privacy Considerations
PII Handling
python
client.configure_sampling(
privacy={
"redact_pii": True, # Redact before storage
"pii_types": ["email", "phone", "ssn", "name"],
"hash_user_ids": True, # Pseudonymize user IDs
"exclude_fields": ["password", "api_key"]
}
)Access Control
python
# Limit who can view samples
client.configure_sampling(
access={
"view_samples": ["admin", "eval_team"],
"export_samples": ["admin"],
"delete_samples": ["admin"]
}
)Next Steps
- Routing Feedback Loop - Use scores for routing
- Drift Detection - Monitor quality over time
- Compliance Reports - Generate audit artifacts