Appearance
Cache Configuration
Advanced configuration options for GateFlow's semantic caching system.
Configuration Overview
json
{
"semantic_cache": {
"enabled": true,
"threshold": 0.95,
"ttl_seconds": 86400,
"max_entries": 1000000,
"embedding_model": "text-embedding-3-small",
"scope": {
"include_model": true,
"include_temperature": true,
"include_system_prompt": true
},
"exclusions": {
"models": [],
"paths": [],
"headers": {}
}
}
}Threshold Tuning
Finding the Right Threshold
bash
# Analyze query similarity distribution
curl https://api.gateflow.ai/v1/management/cache/similarity-analysis \
-H "Authorization: Bearer gw_prod_admin_key"Response:
json
{
"distribution": {
"0.99-1.00": 0.15,
"0.95-0.99": 0.25,
"0.90-0.95": 0.20,
"0.85-0.90": 0.15,
"below_0.85": 0.25
},
"recommendation": {
"threshold": 0.95,
"expected_hit_rate": 0.40,
"quality_score": "high"
}
}Threshold Impact
| Threshold | Hit Rate | Quality Risk |
|---|---|---|
| 0.99 | ~15% | Very Low |
| 0.95 | ~40% | Low |
| 0.90 | ~60% | Medium |
| 0.85 | ~75% | High |
Per-Model Thresholds
Different models may need different thresholds:
json
{
"semantic_cache": {
"model_thresholds": {
"gpt-4o": 0.95,
"gpt-4o-mini": 0.92,
"claude-3-haiku": 0.93
}
}
}TTL Strategies
Static TTL
All entries expire after the same duration:
json
{
"ttl_seconds": 86400 // 24 hours
}Dynamic TTL
Vary TTL based on query characteristics:
json
{
"ttl_strategy": "dynamic",
"ttl_rules": [
{
"condition": {"contains": ["today", "now", "current"]},
"ttl_seconds": 3600 // 1 hour
},
{
"condition": {"model": "gpt-4o-mini"},
"ttl_seconds": 604800 // 1 week
},
{
"condition": {"default": true},
"ttl_seconds": 86400 // 1 day
}
]
}Sliding TTL
Reset TTL on cache hit:
json
{
"ttl_strategy": "sliding",
"ttl_seconds": 86400,
"max_age_seconds": 604800 // Maximum 1 week regardless
}Cache Scope
Model Scoping
Cache per model (default: true):
json
{
"scope": {
"include_model": true
}
}If false, queries across models share cache:
- Pro: Higher hit rate
- Con: May return wrong model's style
Temperature Scoping
Cache per temperature setting:
json
{
"scope": {
"include_temperature": true
}
}Considerations:
temperature: 0→ Deterministic, safe to sharetemperature: 0.7+→ Variable, consider scoping
System Prompt Scoping
Cache per system prompt:
json
{
"scope": {
"include_system_prompt": true
}
}If false, same query with different system prompts shares cache.
Custom Scope Keys
Add custom dimensions to cache key:
json
{
"scope": {
"custom_keys": ["user_locale", "app_version"]
}
}Pass in request:
python
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
extra_body={
"gateflow": {
"cache_scope": {
"user_locale": "en-US",
"app_version": "2.1"
}
}
}
)Exclusions
Exclude Models
json
{
"exclusions": {
"models": ["gpt-4-vision", "whisper-1"]
}
}Exclude by Header
json
{
"exclusions": {
"headers": {
"X-No-Cache": "true",
"X-User-Type": "premium"
}
}
}Exclude by Pattern
json
{
"exclusions": {
"message_patterns": [
"order.*status",
"account.*balance",
"\\$[0-9]+"
]
}
}Storage Configuration
Maximum Entries
json
{
"max_entries": 1000000
}When limit is reached, oldest entries are evicted (LRU).
Entry Size Limits
json
{
"max_entry_size_tokens": 8000,
"max_response_size_tokens": 4000
}Responses larger than limits are not cached.
Embedding Configuration
Embedding Model
json
{
"embedding_model": "text-embedding-3-small"
}Options:
text-embedding-3-small- Fast, good for most casestext-embedding-3-large- Higher quality similaritytext-embedding-ada-002- Legacy, compatibility
Embedding Scope
What to embed for similarity:
json
{
"embedding_scope": {
"include_system": false, // Don't embed system prompt
"include_history": true, // Embed conversation history
"max_history_turns": 3 // Only last 3 turns
}
}Cache Warming
Automatic Warming
Pre-cache popular queries:
json
{
"warming": {
"enabled": true,
"sources": ["popular_queries", "predefined_list"],
"schedule": "0 0 * * *" // Daily at midnight
}
}Manual Warming
bash
curl -X POST https://api.gateflow.ai/v1/management/cache/warm \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"source": "file",
"url": "s3://bucket/popular-queries.jsonl",
"model": "gpt-4o"
}'Monitoring
Cache Metrics
bash
curl https://api.gateflow.ai/v1/management/cache/metrics \
-H "Authorization: Bearer gw_prod_admin_key"Response:
json
{
"entries": 456789,
"storage_mb": 1234,
"hit_rate_1h": 0.42,
"hit_rate_24h": 0.38,
"avg_similarity_hit": 0.97,
"evictions_24h": 1234
}Alerts
json
{
"alerts": [
{
"name": "Low cache hit rate",
"condition": {"metric": "hit_rate_1h", "lt": 0.2},
"notify": ["slack"]
},
{
"name": "Cache storage high",
"condition": {"metric": "storage_percentage", "gt": 0.9},
"notify": ["email"]
}
]
}Debugging
Test Similarity
bash
curl -X POST https://api.gateflow.ai/v1/management/cache/test-similarity \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"query_a": "What is Python?",
"query_b": "Explain Python programming"
}'Response:
json
{
"similarity": 0.94,
"would_hit_cache": false,
"threshold": 0.95
}Cache Debug Mode
Enable detailed cache logging:
python
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
extra_body={
"gateflow": {
"cache_debug": true
}
}
)Response includes:
json
{
"gateflow": {
"cache_debug": {
"query_embedding": "[0.123, -0.456, ...]",
"nearest_matches": [
{"similarity": 0.93, "query": "Tell me about Python"},
{"similarity": 0.89, "query": "Python overview"}
],
"decision": "miss",
"reason": "no_match_above_threshold"
}
}
}Next Steps
- Cost Analytics - Track cache savings
- Semantic Caching - Caching overview