Cost vs Performance Optimization

GateFlow helps you balance cost and performance for your AI workloads. This guide covers optimization strategies.

Optimization Modes

Cost Mode

Minimize spend while meeting quality requirements:

python

response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "optimization": "cost"
        }
    }
)

GateFlow selects the cheapest model that can handle the task.

Latency Mode

Minimize response time:

python

response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "optimization": "latency"
        }
    }
)

GateFlow selects based on current provider latency and model speed.

Quality Mode

Maximize output quality:

python

response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "optimization": "quality"
        }
    }
)

Routes to premium models like GPT-4o and Claude Opus.

Balanced Mode (Default)

Balance all factors:

python

response = client.chat.completions.create(
    model="auto",
    messages=[...]
    # Default optimization is "balanced"
)

Model Pricing

Current pricing per 1M tokens:

Model	Input	Output	Speed
GPT-4o	$2.50	$10.00	Medium
GPT-4o-mini	$0.15	$0.60	Fast
Claude 3.5 Sonnet	$3.00	$15.00	Medium
Claude 3 Haiku	$0.25	$1.25	Fast
Gemini 1.5 Pro	$1.25	$5.00	Medium
Gemini 1.5 Flash	$0.075	$0.30	Fast
Mistral Large	$2.00	$6.00	Medium
Mistral Small	$0.20	$0.60	Fast

See Pricing for complete list.

Cost Reduction Strategies

1. Use Semantic Caching

Enable caching to avoid repeat API calls:

python

# First request: hits provider, cached
response1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}]
)

# Similar request: hits cache, free
response2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain Python to me"}]
)

Typical cache hit rates: 20-60% depending on use case.

2. Right-Size Your Models

Use smaller models for simpler tasks:

Task	Recommended Model	Cost Savings vs GPT-4o
Simple Q&A	GPT-4o-mini	94%
Summarization	Claude 3 Haiku	92%
Translation	Gemini Flash	97%
Classification	Mistral Small	92%
Complex reasoning	GPT-4o	Baseline

3. Optimize Prompts

Shorter prompts = lower costs:

python

# Expensive: 150 tokens
messages=[
    {"role": "system", "content": "You are a helpful assistant..."},
    {"role": "user", "content": "Can you please help me by..."}
]

# Cheaper: 30 tokens
messages=[
    {"role": "user", "content": "Summarize: [text]"}
]

4. Limit Output Tokens

Set max_tokens to prevent runaway responses:

python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    max_tokens=500  # Cap output
)

5. Use Streaming Wisely

Streaming doesn't save money, but lets you abort early:

python

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if should_abort(content):
        break  # Stop processing, but tokens already used

Cost Limits

Per-Key Limits

bash

curl -X PATCH https://api.gateflow.ai/v1/management/api-keys/key_123 \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "permissions": {
      "cost_limit_daily": 100.00,
      "cost_limit_monthly": 2000.00
    }
  }'

Per-Organization Limits

bash

curl -X PATCH https://api.gateflow.ai/v1/management/organization/settings \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "cost_limits": {
      "daily": 500.00,
      "monthly": 10000.00,
      "alert_threshold": 0.8
    }
  }'

Cost Alerts

Get notified before hitting limits:

bash

curl -X POST https://api.gateflow.ai/v1/management/alerts \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Daily cost warning",
    "condition": {
      "metric": "cost_daily",
      "operator": "gt",
      "threshold": 80.00
    },
    "notify": {
      "channels": ["slack", "email"]
    }
  }'

Latency Optimization

Monitor Provider Latency

bash

curl https://api.gateflow.ai/v1/management/providers/health \
  -H "Authorization: Bearer gw_prod_admin_key"

Response includes latency percentiles:

json

{
  "providers": [
    {
      "provider": "openai",
      "latency_p50_ms": 230,
      "latency_p95_ms": 560,
      "latency_p99_ms": 890
    }
  ]
}

Use Faster Models

Speed Tier	Models	Typical Latency
Fastest	GPT-4o-mini, Gemini Flash, Haiku	200-400ms
Fast	GPT-4o, Claude Sonnet	400-800ms
Standard	Claude Opus, Gemini Pro	800-1500ms

Geographic Routing

GateFlow routes to the nearest provider region when available.

Analytics Dashboard

Track optimization metrics:

Cost per request - Average and percentiles
Cost by model - Which models drive spend
Cache hit rate - Savings from caching
Latency distribution - Response time percentiles
Token usage - Input vs output tokens

Optimization Presets

Save common configurations:

bash

curl -X POST https://api.gateflow.ai/v1/management/presets \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "cost-conscious",
    "settings": {
      "optimization": "cost",
      "cache_enabled": true,
      "fallbacks": ["gpt-4o-mini", "claude-3-haiku"],
      "max_tokens_default": 500
    }
  }'

Apply to requests:

python

response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "preset": "cost-conscious"
        }
    }
)

Next Steps

Semantic Caching - Deep dive into caching
Cost Analytics - Detailed cost tracking

Cost vs Performance Optimization ​

Optimization Modes ​

Cost Mode ​

Latency Mode ​

Quality Mode ​

Balanced Mode (Default) ​

Model Pricing ​

Cost Reduction Strategies ​

1. Use Semantic Caching ​

2. Right-Size Your Models ​

3. Optimize Prompts ​

4. Limit Output Tokens ​

5. Use Streaming Wisely ​

Cost Limits ​

Per-Key Limits ​

Per-Organization Limits ​

Cost Alerts ​

Latency Optimization ​

Monitor Provider Latency ​

Use Faster Models ​

Geographic Routing ​

Analytics Dashboard ​

Optimization Presets ​

Next Steps ​

Cost vs Performance Optimization

Optimization Modes

Cost Mode

Latency Mode

Quality Mode

Balanced Mode (Default)

Model Pricing

Cost Reduction Strategies

1. Use Semantic Caching

2. Right-Size Your Models

3. Optimize Prompts

4. Limit Output Tokens

5. Use Streaming Wisely

Cost Limits

Per-Key Limits

Per-Organization Limits

Cost Alerts

Latency Optimization

Monitor Provider Latency

Use Faster Models

Geographic Routing

Analytics Dashboard

Optimization Presets

Next Steps