Skip to content

Cost vs Performance Optimization

GateFlow helps you balance cost and performance for your AI workloads. This guide covers optimization strategies.

Optimization Modes

Cost Mode

Minimize spend while meeting quality requirements:

python
response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "optimization": "cost"
        }
    }
)

GateFlow selects the cheapest model that can handle the task.

Latency Mode

Minimize response time:

python
response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "optimization": "latency"
        }
    }
)

GateFlow selects based on current provider latency and model speed.

Quality Mode

Maximize output quality:

python
response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "optimization": "quality"
        }
    }
)

Routes to premium models like GPT-4o and Claude Opus.

Balanced Mode (Default)

Balance all factors:

python
response = client.chat.completions.create(
    model="auto",
    messages=[...]
    # Default optimization is "balanced"
)

Model Pricing

Current pricing per 1M tokens:

ModelInputOutputSpeed
GPT-4o$2.50$10.00Medium
GPT-4o-mini$0.15$0.60Fast
Claude 3.5 Sonnet$3.00$15.00Medium
Claude 3 Haiku$0.25$1.25Fast
Gemini 1.5 Pro$1.25$5.00Medium
Gemini 1.5 Flash$0.075$0.30Fast
Mistral Large$2.00$6.00Medium
Mistral Small$0.20$0.60Fast

See Pricing for complete list.

Cost Reduction Strategies

1. Use Semantic Caching

Enable caching to avoid repeat API calls:

python
# First request: hits provider, cached
response1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}]
)

# Similar request: hits cache, free
response2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain Python to me"}]
)

Typical cache hit rates: 20-60% depending on use case.

2. Right-Size Your Models

Use smaller models for simpler tasks:

TaskRecommended ModelCost Savings vs GPT-4o
Simple Q&AGPT-4o-mini94%
SummarizationClaude 3 Haiku92%
TranslationGemini Flash97%
ClassificationMistral Small92%
Complex reasoningGPT-4oBaseline

3. Optimize Prompts

Shorter prompts = lower costs:

python
# Expensive: 150 tokens
messages=[
    {"role": "system", "content": "You are a helpful assistant..."},
    {"role": "user", "content": "Can you please help me by..."}
]

# Cheaper: 30 tokens
messages=[
    {"role": "user", "content": "Summarize: [text]"}
]

4. Limit Output Tokens

Set max_tokens to prevent runaway responses:

python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    max_tokens=500  # Cap output
)

5. Use Streaming Wisely

Streaming doesn't save money, but lets you abort early:

python
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if should_abort(content):
        break  # Stop processing, but tokens already used

Cost Limits

Per-Key Limits

bash
curl -X PATCH https://api.gateflow.ai/v1/management/api-keys/key_123 \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "permissions": {
      "cost_limit_daily": 100.00,
      "cost_limit_monthly": 2000.00
    }
  }'

Per-Organization Limits

bash
curl -X PATCH https://api.gateflow.ai/v1/management/organization/settings \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "cost_limits": {
      "daily": 500.00,
      "monthly": 10000.00,
      "alert_threshold": 0.8
    }
  }'

Cost Alerts

Get notified before hitting limits:

bash
curl -X POST https://api.gateflow.ai/v1/management/alerts \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Daily cost warning",
    "condition": {
      "metric": "cost_daily",
      "operator": "gt",
      "threshold": 80.00
    },
    "notify": {
      "channels": ["slack", "email"]
    }
  }'

Latency Optimization

Monitor Provider Latency

bash
curl https://api.gateflow.ai/v1/management/providers/health \
  -H "Authorization: Bearer gw_prod_admin_key"

Response includes latency percentiles:

json
{
  "providers": [
    {
      "provider": "openai",
      "latency_p50_ms": 230,
      "latency_p95_ms": 560,
      "latency_p99_ms": 890
    }
  ]
}

Use Faster Models

Speed TierModelsTypical Latency
FastestGPT-4o-mini, Gemini Flash, Haiku200-400ms
FastGPT-4o, Claude Sonnet400-800ms
StandardClaude Opus, Gemini Pro800-1500ms

Geographic Routing

GateFlow routes to the nearest provider region when available.

Analytics Dashboard

Track optimization metrics:

  • Cost per request - Average and percentiles
  • Cost by model - Which models drive spend
  • Cache hit rate - Savings from caching
  • Latency distribution - Response time percentiles
  • Token usage - Input vs output tokens

Optimization Presets

Save common configurations:

bash
curl -X POST https://api.gateflow.ai/v1/management/presets \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "cost-conscious",
    "settings": {
      "optimization": "cost",
      "cache_enabled": true,
      "fallbacks": ["gpt-4o-mini", "claude-3-haiku"],
      "max_tokens_default": 500
    }
  }'

Apply to requests:

python
response = client.chat.completions.create(
    model="auto",
    messages=[...],
    extra_body={
        "gateflow": {
            "preset": "cost-conscious"
        }
    }
)

Next Steps

Built with reliability in mind.