Appearance
Cost vs Performance Optimization
GateFlow helps you balance cost and performance for your AI workloads. This guide covers optimization strategies.
Optimization Modes
Cost Mode
Minimize spend while meeting quality requirements:
python
response = client.chat.completions.create(
model="auto",
messages=[...],
extra_body={
"gateflow": {
"optimization": "cost"
}
}
)GateFlow selects the cheapest model that can handle the task.
Latency Mode
Minimize response time:
python
response = client.chat.completions.create(
model="auto",
messages=[...],
extra_body={
"gateflow": {
"optimization": "latency"
}
}
)GateFlow selects based on current provider latency and model speed.
Quality Mode
Maximize output quality:
python
response = client.chat.completions.create(
model="auto",
messages=[...],
extra_body={
"gateflow": {
"optimization": "quality"
}
}
)Routes to premium models like GPT-4o and Claude Opus.
Balanced Mode (Default)
Balance all factors:
python
response = client.chat.completions.create(
model="auto",
messages=[...]
# Default optimization is "balanced"
)Model Pricing
Current pricing per 1M tokens:
| Model | Input | Output | Speed |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Medium |
| GPT-4o-mini | $0.15 | $0.60 | Fast |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Medium |
| Claude 3 Haiku | $0.25 | $1.25 | Fast |
| Gemini 1.5 Pro | $1.25 | $5.00 | Medium |
| Gemini 1.5 Flash | $0.075 | $0.30 | Fast |
| Mistral Large | $2.00 | $6.00 | Medium |
| Mistral Small | $0.20 | $0.60 | Fast |
See Pricing for complete list.
Cost Reduction Strategies
1. Use Semantic Caching
Enable caching to avoid repeat API calls:
python
# First request: hits provider, cached
response1 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}]
)
# Similar request: hits cache, free
response2 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain Python to me"}]
)Typical cache hit rates: 20-60% depending on use case.
2. Right-Size Your Models
Use smaller models for simpler tasks:
| Task | Recommended Model | Cost Savings vs GPT-4o |
|---|---|---|
| Simple Q&A | GPT-4o-mini | 94% |
| Summarization | Claude 3 Haiku | 92% |
| Translation | Gemini Flash | 97% |
| Classification | Mistral Small | 92% |
| Complex reasoning | GPT-4o | Baseline |
3. Optimize Prompts
Shorter prompts = lower costs:
python
# Expensive: 150 tokens
messages=[
{"role": "system", "content": "You are a helpful assistant..."},
{"role": "user", "content": "Can you please help me by..."}
]
# Cheaper: 30 tokens
messages=[
{"role": "user", "content": "Summarize: [text]"}
]4. Limit Output Tokens
Set max_tokens to prevent runaway responses:
python
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
max_tokens=500 # Cap output
)5. Use Streaming Wisely
Streaming doesn't save money, but lets you abort early:
python
stream = client.chat.completions.create(
model="gpt-4o",
messages=[...],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if should_abort(content):
break # Stop processing, but tokens already usedCost Limits
Per-Key Limits
bash
curl -X PATCH https://api.gateflow.ai/v1/management/api-keys/key_123 \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"permissions": {
"cost_limit_daily": 100.00,
"cost_limit_monthly": 2000.00
}
}'Per-Organization Limits
bash
curl -X PATCH https://api.gateflow.ai/v1/management/organization/settings \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"cost_limits": {
"daily": 500.00,
"monthly": 10000.00,
"alert_threshold": 0.8
}
}'Cost Alerts
Get notified before hitting limits:
bash
curl -X POST https://api.gateflow.ai/v1/management/alerts \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"name": "Daily cost warning",
"condition": {
"metric": "cost_daily",
"operator": "gt",
"threshold": 80.00
},
"notify": {
"channels": ["slack", "email"]
}
}'Latency Optimization
Monitor Provider Latency
bash
curl https://api.gateflow.ai/v1/management/providers/health \
-H "Authorization: Bearer gw_prod_admin_key"Response includes latency percentiles:
json
{
"providers": [
{
"provider": "openai",
"latency_p50_ms": 230,
"latency_p95_ms": 560,
"latency_p99_ms": 890
}
]
}Use Faster Models
| Speed Tier | Models | Typical Latency |
|---|---|---|
| Fastest | GPT-4o-mini, Gemini Flash, Haiku | 200-400ms |
| Fast | GPT-4o, Claude Sonnet | 400-800ms |
| Standard | Claude Opus, Gemini Pro | 800-1500ms |
Geographic Routing
GateFlow routes to the nearest provider region when available.
Analytics Dashboard
Track optimization metrics:
- Cost per request - Average and percentiles
- Cost by model - Which models drive spend
- Cache hit rate - Savings from caching
- Latency distribution - Response time percentiles
- Token usage - Input vs output tokens
Optimization Presets
Save common configurations:
bash
curl -X POST https://api.gateflow.ai/v1/management/presets \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"name": "cost-conscious",
"settings": {
"optimization": "cost",
"cache_enabled": true,
"fallbacks": ["gpt-4o-mini", "claude-3-haiku"],
"max_tokens_default": 500
}
}'Apply to requests:
python
response = client.chat.completions.create(
model="auto",
messages=[...],
extra_body={
"gateflow": {
"preset": "cost-conscious"
}
}
)Next Steps
- Semantic Caching - Deep dive into caching
- Cost Analytics - Detailed cost tracking