Skip to content

Semantic Caching

Semantic caching stores AI responses and returns cached results for similar queries. Cut costs by up to 90% on repetitive workloads.

How It Works

Unlike traditional caching (exact match), semantic caching finds similar queries:

QueryTraditional CacheSemantic Cache
"What is Python?"HITHIT
"what is python"MISSHIT
"Explain Python"MISSHIT
"Tell me about Python programming"MISSHIT
"What is JavaScript?"MISSMISS

Enabling Caching

Global Setting

bash
curl -X PATCH https://api.gateflow.ai/v1/management/settings \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "semantic_cache": {
      "enabled": true,
      "default_threshold": 0.95
    }
  }'

Per-Request Control

Enable or disable for specific requests:

python
# Use cache (default when enabled globally)
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[...]
)

# Skip cache
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[...],
    extra_body={
        "gateflow": {
            "cache": "skip"
        }
    }
)

# Force cache refresh
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[...],
    extra_body={
        "gateflow": {
            "cache": "refresh"  # Bypass cache, store new response
        }
    }
)

Cache Configuration

Similarity Threshold

How similar queries must be to hit cache (0.0 to 1.0):

json
{
  "semantic_cache": {
    "threshold": 0.95
  }
}
ThresholdBehavior
0.99Very strict, nearly exact matches only
0.95Recommended, catches rephrased queries
0.90Looser, more cache hits, risk of wrong matches
0.85Very loose, high cache hit rate, quality risk

TTL (Time to Live)

How long cached responses remain valid:

json
{
  "semantic_cache": {
    "ttl_seconds": 86400
  }
}

Considerations:

  • Factual queries: Longer TTL (days/weeks)
  • Time-sensitive queries: Shorter TTL (hours)
  • Real-time data: Disable caching

Cache Scope

Control what's included in cache key:

json
{
  "semantic_cache": {
    "scope": {
      "include_model": true,
      "include_temperature": true,
      "include_system_prompt": true
    }
  }
}

Cache Key Components

The cache key is computed from:

  1. Messages - Embedded for semantic similarity
  2. Model - Different models cached separately
  3. Temperature - Different temperatures cached separately
  4. Organization - Isolated per organization

Parameters that don't affect cache:

  • max_tokens
  • stream
  • user

Cache Responses

Cache Hit Response

json
{
  "id": "chatcmpl-cached-123",
  "model": "gpt-5.2",
  "choices": [...],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  },
  "gateflow": {
    "cache": {
      "hit": true,
      "similarity": 0.97,
      "original_query": "What is Python?",
      "cached_at": "2026-01-15T10:30:00Z"
    }
  }
}

Cache Miss Response

json
{
  "id": "chatcmpl-456",
  "model": "gpt-5.2",
  "choices": [...],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 150,
    "total_tokens": 165
  },
  "gateflow": {
    "cache": {
      "hit": false,
      "stored": true
    }
  }
}

Cost Savings

Example Calculation

Monthly requests: 1,000,000
Average tokens per request: 500
Cache hit rate: 40%

Without cache:
  1,000,000 x 500 tokens x $0.00001 = $5,000

With cache:
  600,000 requests to provider x 500 tokens x $0.00001 = $3,000
  400,000 cache hits x $0 = $0
  Total: $3,000

Savings: $2,000 (40%)

Typical Hit Rates by Use Case

Use CaseTypical Hit Rate
Customer support FAQ60-80%
Code documentation40-60%
Content generation20-30%
Unique queries5-15%

Cache Analytics

View Cache Statistics

bash
curl https://api.gateflow.ai/v1/management/cache/stats \
  -H "Authorization: Bearer gw_prod_admin_key"

Response:

json
{
  "period": "last_24h",
  "stats": {
    "total_requests": 50000,
    "cache_hits": 22500,
    "cache_misses": 27500,
    "hit_rate": 0.45,
    "tokens_saved": 11250000,
    "cost_saved": 112.50
  }
}

Cache Entry Details

bash
curl https://api.gateflow.ai/v1/management/cache/entries \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -G -d "limit=10"

Cache Management

Clear Cache

bash
# Clear all cache
curl -X DELETE https://api.gateflow.ai/v1/management/cache \
  -H "Authorization: Bearer gw_prod_admin_key"

# Clear cache for specific model
curl -X DELETE https://api.gateflow.ai/v1/management/cache \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -G -d "model=gpt-5.2"

Warm Cache

Pre-populate cache with known queries:

bash
curl -X POST https://api.gateflow.ai/v1/management/cache/warm \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "queries": [
      {"messages": [{"role": "user", "content": "What is GateFlow?"}]},
      {"messages": [{"role": "user", "content": "How do I get started?"}]}
    ],
    "model": "gpt-5.2"
  }'

When to Disable Caching

Disable caching for:

  • Personalized responses - User-specific content
  • Real-time data - Stock prices, weather
  • Creative generation - Want variety in outputs
  • Sensitive operations - Each request must be fresh
python
# Always skip cache for these
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {"role": "user", "content": f"What's {user_name}'s order status?"}
    ],
    extra_body={
        "gateflow": {
            "cache": "skip"
        }
    }
)

Embedding Model

GateFlow uses text-embedding-3-small by default for cache embeddings. You can configure this:

json
{
  "semantic_cache": {
    "embedding_model": "text-embedding-3-large"
  }
}

Next Steps

Built with reliability in mind.