Appearance
Semantic Caching
Semantic caching stores AI responses and returns cached results for similar queries. Cut costs by up to 90% on repetitive workloads.
How It Works
Unlike traditional caching (exact match), semantic caching finds similar queries:
| Query | Traditional Cache | Semantic Cache |
|---|---|---|
| "What is Python?" | HIT | HIT |
| "what is python" | MISS | HIT |
| "Explain Python" | MISS | HIT |
| "Tell me about Python programming" | MISS | HIT |
| "What is JavaScript?" | MISS | MISS |
Enabling Caching
Global Setting
bash
curl -X PATCH https://api.gateflow.ai/v1/management/settings \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"semantic_cache": {
"enabled": true,
"default_threshold": 0.95
}
}'Per-Request Control
Enable or disable for specific requests:
python
# Use cache (default when enabled globally)
response = client.chat.completions.create(
model="gpt-5.2",
messages=[...]
)
# Skip cache
response = client.chat.completions.create(
model="gpt-5.2",
messages=[...],
extra_body={
"gateflow": {
"cache": "skip"
}
}
)
# Force cache refresh
response = client.chat.completions.create(
model="gpt-5.2",
messages=[...],
extra_body={
"gateflow": {
"cache": "refresh" # Bypass cache, store new response
}
}
)Cache Configuration
Similarity Threshold
How similar queries must be to hit cache (0.0 to 1.0):
json
{
"semantic_cache": {
"threshold": 0.95
}
}| Threshold | Behavior |
|---|---|
| 0.99 | Very strict, nearly exact matches only |
| 0.95 | Recommended, catches rephrased queries |
| 0.90 | Looser, more cache hits, risk of wrong matches |
| 0.85 | Very loose, high cache hit rate, quality risk |
TTL (Time to Live)
How long cached responses remain valid:
json
{
"semantic_cache": {
"ttl_seconds": 86400
}
}Considerations:
- Factual queries: Longer TTL (days/weeks)
- Time-sensitive queries: Shorter TTL (hours)
- Real-time data: Disable caching
Cache Scope
Control what's included in cache key:
json
{
"semantic_cache": {
"scope": {
"include_model": true,
"include_temperature": true,
"include_system_prompt": true
}
}
}Cache Key Components
The cache key is computed from:
- Messages - Embedded for semantic similarity
- Model - Different models cached separately
- Temperature - Different temperatures cached separately
- Organization - Isolated per organization
Parameters that don't affect cache:
max_tokensstreamuser
Cache Responses
Cache Hit Response
json
{
"id": "chatcmpl-cached-123",
"model": "gpt-5.2",
"choices": [...],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0
},
"gateflow": {
"cache": {
"hit": true,
"similarity": 0.97,
"original_query": "What is Python?",
"cached_at": "2026-01-15T10:30:00Z"
}
}
}Cache Miss Response
json
{
"id": "chatcmpl-456",
"model": "gpt-5.2",
"choices": [...],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 150,
"total_tokens": 165
},
"gateflow": {
"cache": {
"hit": false,
"stored": true
}
}
}Cost Savings
Example Calculation
Monthly requests: 1,000,000
Average tokens per request: 500
Cache hit rate: 40%
Without cache:
1,000,000 x 500 tokens x $0.00001 = $5,000
With cache:
600,000 requests to provider x 500 tokens x $0.00001 = $3,000
400,000 cache hits x $0 = $0
Total: $3,000
Savings: $2,000 (40%)Typical Hit Rates by Use Case
| Use Case | Typical Hit Rate |
|---|---|
| Customer support FAQ | 60-80% |
| Code documentation | 40-60% |
| Content generation | 20-30% |
| Unique queries | 5-15% |
Cache Analytics
View Cache Statistics
bash
curl https://api.gateflow.ai/v1/management/cache/stats \
-H "Authorization: Bearer gw_prod_admin_key"Response:
json
{
"period": "last_24h",
"stats": {
"total_requests": 50000,
"cache_hits": 22500,
"cache_misses": 27500,
"hit_rate": 0.45,
"tokens_saved": 11250000,
"cost_saved": 112.50
}
}Cache Entry Details
bash
curl https://api.gateflow.ai/v1/management/cache/entries \
-H "Authorization: Bearer gw_prod_admin_key" \
-G -d "limit=10"Cache Management
Clear Cache
bash
# Clear all cache
curl -X DELETE https://api.gateflow.ai/v1/management/cache \
-H "Authorization: Bearer gw_prod_admin_key"
# Clear cache for specific model
curl -X DELETE https://api.gateflow.ai/v1/management/cache \
-H "Authorization: Bearer gw_prod_admin_key" \
-G -d "model=gpt-5.2"Warm Cache
Pre-populate cache with known queries:
bash
curl -X POST https://api.gateflow.ai/v1/management/cache/warm \
-H "Authorization: Bearer gw_prod_admin_key" \
-H "Content-Type: application/json" \
-d '{
"queries": [
{"messages": [{"role": "user", "content": "What is GateFlow?"}]},
{"messages": [{"role": "user", "content": "How do I get started?"}]}
],
"model": "gpt-5.2"
}'When to Disable Caching
Disable caching for:
- Personalized responses - User-specific content
- Real-time data - Stock prices, weather
- Creative generation - Want variety in outputs
- Sensitive operations - Each request must be fresh
python
# Always skip cache for these
response = client.chat.completions.create(
model="gpt-5.2",
messages=[
{"role": "user", "content": f"What's {user_name}'s order status?"}
],
extra_body={
"gateflow": {
"cache": "skip"
}
}
)Embedding Model
GateFlow uses text-embedding-3-small by default for cache embeddings. You can configure this:
json
{
"semantic_cache": {
"embedding_model": "text-embedding-3-large"
}
}Next Steps
- Cache Configuration - Advanced settings
- Cost Analytics - Track savings