Cache Configuration

Advanced configuration options for GateFlow's semantic caching system.

Configuration Overview

json

{
  "semantic_cache": {
    "enabled": true,
    "threshold": 0.95,
    "ttl_seconds": 86400,
    "max_entries": 1000000,
    "embedding_model": "text-embedding-3-small",
    "scope": {
      "include_model": true,
      "include_temperature": true,
      "include_system_prompt": true
    },
    "exclusions": {
      "models": [],
      "paths": [],
      "headers": {}
    }
  }
}

Threshold Tuning

Finding the Right Threshold

bash

# Analyze query similarity distribution
curl https://api.gateflow.ai/v1/management/cache/similarity-analysis \
  -H "Authorization: Bearer gw_prod_admin_key"

Response:

json

{
  "distribution": {
    "0.99-1.00": 0.15,
    "0.95-0.99": 0.25,
    "0.90-0.95": 0.20,
    "0.85-0.90": 0.15,
    "below_0.85": 0.25
  },
  "recommendation": {
    "threshold": 0.95,
    "expected_hit_rate": 0.40,
    "quality_score": "high"
  }
}

Threshold Impact

Threshold	Hit Rate	Quality Risk
0.99	~15%	Very Low
0.95	~40%	Low
0.90	~60%	Medium
0.85	~75%	High

Per-Model Thresholds

Different models may need different thresholds:

json

{
  "semantic_cache": {
    "model_thresholds": {
      "gpt-4o": 0.95,
      "gpt-4o-mini": 0.92,
      "claude-3-haiku": 0.93
    }
  }
}

TTL Strategies

Static TTL

All entries expire after the same duration:

json

{
  "ttl_seconds": 86400  // 24 hours
}

Dynamic TTL

Vary TTL based on query characteristics:

json

{
  "ttl_strategy": "dynamic",
  "ttl_rules": [
    {
      "condition": {"contains": ["today", "now", "current"]},
      "ttl_seconds": 3600  // 1 hour
    },
    {
      "condition": {"model": "gpt-4o-mini"},
      "ttl_seconds": 604800  // 1 week
    },
    {
      "condition": {"default": true},
      "ttl_seconds": 86400  // 1 day
    }
  ]
}

Sliding TTL

Reset TTL on cache hit:

json

{
  "ttl_strategy": "sliding",
  "ttl_seconds": 86400,
  "max_age_seconds": 604800  // Maximum 1 week regardless
}

Cache Scope

Model Scoping

Cache per model (default: true):

json

{
  "scope": {
    "include_model": true
  }
}

If false, queries across models share cache:

Pro: Higher hit rate
Con: May return wrong model's style

Temperature Scoping

Cache per temperature setting:

json

{
  "scope": {
    "include_temperature": true
  }
}

Considerations:

temperature: 0 → Deterministic, safe to share
temperature: 0.7+ → Variable, consider scoping

System Prompt Scoping

Cache per system prompt:

json

{
  "scope": {
    "include_system_prompt": true
  }
}

If false, same query with different system prompts shares cache.

Custom Scope Keys

Add custom dimensions to cache key:

json

{
  "scope": {
    "custom_keys": ["user_locale", "app_version"]
  }
}

Pass in request:

python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_body={
        "gateflow": {
            "cache_scope": {
                "user_locale": "en-US",
                "app_version": "2.1"
            }
        }
    }
)

Exclusions

Exclude Models

json

{
  "exclusions": {
    "models": ["gpt-4-vision", "whisper-1"]
  }
}

Exclude by Header

json

{
  "exclusions": {
    "headers": {
      "X-No-Cache": "true",
      "X-User-Type": "premium"
    }
  }
}

Exclude by Pattern

json

{
  "exclusions": {
    "message_patterns": [
      "order.*status",
      "account.*balance",
      "\\$[0-9]+"
    ]
  }
}

Storage Configuration

Maximum Entries

json

{
  "max_entries": 1000000
}

When limit is reached, oldest entries are evicted (LRU).

Entry Size Limits

json

{
  "max_entry_size_tokens": 8000,
  "max_response_size_tokens": 4000
}

Responses larger than limits are not cached.

Embedding Configuration

Embedding Model

json

{
  "embedding_model": "text-embedding-3-small"
}

Options:

text-embedding-3-small - Fast, good for most cases
text-embedding-3-large - Higher quality similarity
text-embedding-ada-002 - Legacy, compatibility

Embedding Scope

What to embed for similarity:

json

{
  "embedding_scope": {
    "include_system": false,  // Don't embed system prompt
    "include_history": true,  // Embed conversation history
    "max_history_turns": 3    // Only last 3 turns
  }
}

Cache Warming

Automatic Warming

Pre-cache popular queries:

json

{
  "warming": {
    "enabled": true,
    "sources": ["popular_queries", "predefined_list"],
    "schedule": "0 0 * * *"  // Daily at midnight
  }
}

Manual Warming

bash

curl -X POST https://api.gateflow.ai/v1/management/cache/warm \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "source": "file",
    "url": "s3://bucket/popular-queries.jsonl",
    "model": "gpt-4o"
  }'

Monitoring

Cache Metrics

bash

curl https://api.gateflow.ai/v1/management/cache/metrics \
  -H "Authorization: Bearer gw_prod_admin_key"

Response:

json

{
  "entries": 456789,
  "storage_mb": 1234,
  "hit_rate_1h": 0.42,
  "hit_rate_24h": 0.38,
  "avg_similarity_hit": 0.97,
  "evictions_24h": 1234
}

Alerts

json

{
  "alerts": [
    {
      "name": "Low cache hit rate",
      "condition": {"metric": "hit_rate_1h", "lt": 0.2},
      "notify": ["slack"]
    },
    {
      "name": "Cache storage high",
      "condition": {"metric": "storage_percentage", "gt": 0.9},
      "notify": ["email"]
    }
  ]
}

Debugging

Test Similarity

bash

curl -X POST https://api.gateflow.ai/v1/management/cache/test-similarity \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "query_a": "What is Python?",
    "query_b": "Explain Python programming"
  }'

Response:

json

{
  "similarity": 0.94,
  "would_hit_cache": false,
  "threshold": 0.95
}

Cache Debug Mode

Enable detailed cache logging:

python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_body={
        "gateflow": {
            "cache_debug": true
        }
    }
)

Response includes:

json

{
  "gateflow": {
    "cache_debug": {
      "query_embedding": "[0.123, -0.456, ...]",
      "nearest_matches": [
        {"similarity": 0.93, "query": "Tell me about Python"},
        {"similarity": 0.89, "query": "Python overview"}
      ],
      "decision": "miss",
      "reason": "no_match_above_threshold"
    }
  }
}

Next Steps

Cost Analytics - Track cache savings
Semantic Caching - Caching overview

Cache Configuration ​

Configuration Overview ​

Threshold Tuning ​

Finding the Right Threshold ​

Threshold Impact ​

Per-Model Thresholds ​

TTL Strategies ​

Static TTL ​

Dynamic TTL ​

Sliding TTL ​

Cache Scope ​

Model Scoping ​

Temperature Scoping ​

System Prompt Scoping ​

Custom Scope Keys ​

Exclusions ​

Exclude Models ​

Exclude by Header ​

Exclude by Pattern ​

Storage Configuration ​

Maximum Entries ​

Entry Size Limits ​

Embedding Configuration ​

Embedding Model ​

Embedding Scope ​

Cache Warming ​

Automatic Warming ​

Manual Warming ​

Monitoring ​

Cache Metrics ​

Alerts ​

Debugging ​

Test Similarity ​

Cache Debug Mode ​

Next Steps ​

Cache Configuration

Configuration Overview

Threshold Tuning

Finding the Right Threshold

Threshold Impact

Per-Model Thresholds

TTL Strategies

Static TTL

Dynamic TTL

Sliding TTL

Cache Scope

Model Scoping

Temperature Scoping

System Prompt Scoping

Custom Scope Keys

Exclusions

Exclude Models

Exclude by Header

Exclude by Pattern

Storage Configuration

Maximum Entries

Entry Size Limits

Embedding Configuration

Embedding Model

Embedding Scope

Cache Warming

Automatic Warming

Manual Warming

Monitoring

Cache Metrics

Alerts

Debugging

Test Similarity

Cache Debug Mode

Next Steps