Semantic Caching

Semantic caching stores AI responses and returns cached results for similar queries. Cut costs by up to 90% on repetitive workloads.

How It Works

Unlike traditional caching (exact match), semantic caching finds similar queries:

Query	Traditional Cache	Semantic Cache
"What is Python?"	HIT	HIT
"what is python"	MISS	HIT
"Explain Python"	MISS	HIT
"Tell me about Python programming"	MISS	HIT
"What is JavaScript?"	MISS	MISS

Enabling Caching

Global Setting

bash

curl -X PATCH https://api.gateflow.ai/v1/management/settings \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "semantic_cache": {
      "enabled": true,
      "default_threshold": 0.95
    }
  }'

Per-Request Control

Enable or disable for specific requests:

python

# Use cache (default when enabled globally)
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[...]
)

# Skip cache
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[...],
    extra_body={
        "gateflow": {
            "cache": "skip"
        }
    }
)

# Force cache refresh
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[...],
    extra_body={
        "gateflow": {
            "cache": "refresh"  # Bypass cache, store new response
        }
    }
)

Cache Configuration

Similarity Threshold

How similar queries must be to hit cache (0.0 to 1.0):

json

{
  "semantic_cache": {
    "threshold": 0.95
  }
}

Threshold	Behavior
0.99	Very strict, nearly exact matches only
0.95	Recommended, catches rephrased queries
0.90	Looser, more cache hits, risk of wrong matches
0.85	Very loose, high cache hit rate, quality risk

TTL (Time to Live)

How long cached responses remain valid:

json

{
  "semantic_cache": {
    "ttl_seconds": 86400
  }
}

Considerations:

Factual queries: Longer TTL (days/weeks)
Time-sensitive queries: Shorter TTL (hours)
Real-time data: Disable caching

Cache Scope

Control what's included in cache key:

json

{
  "semantic_cache": {
    "scope": {
      "include_model": true,
      "include_temperature": true,
      "include_system_prompt": true
    }
  }
}

Cache Key Components

The cache key is computed from:

Messages - Embedded for semantic similarity
Model - Different models cached separately
Temperature - Different temperatures cached separately
Organization - Isolated per organization

Parameters that don't affect cache:

max_tokens
stream
user

Cache Responses

Cache Hit Response

json

{
  "id": "chatcmpl-cached-123",
  "model": "gpt-5.2",
  "choices": [...],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  },
  "gateflow": {
    "cache": {
      "hit": true,
      "similarity": 0.97,
      "original_query": "What is Python?",
      "cached_at": "2026-01-15T10:30:00Z"
    }
  }
}

Cache Miss Response

json

{
  "id": "chatcmpl-456",
  "model": "gpt-5.2",
  "choices": [...],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 150,
    "total_tokens": 165
  },
  "gateflow": {
    "cache": {
      "hit": false,
      "stored": true
    }
  }
}

Cost Savings

Example Calculation

Monthly requests: 1,000,000
Average tokens per request: 500
Cache hit rate: 40%

Without cache:
  1,000,000 x 500 tokens x $0.00001 = $5,000

With cache:
  600,000 requests to provider x 500 tokens x $0.00001 = $3,000
  400,000 cache hits x $0 = $0
  Total: $3,000

Savings: $2,000 (40%)

Typical Hit Rates by Use Case

Use Case	Typical Hit Rate
Customer support FAQ	60-80%
Code documentation	40-60%
Content generation	20-30%
Unique queries	5-15%

Cache Analytics

View Cache Statistics

bash

curl https://api.gateflow.ai/v1/management/cache/stats \
  -H "Authorization: Bearer gw_prod_admin_key"

Response:

json

{
  "period": "last_24h",
  "stats": {
    "total_requests": 50000,
    "cache_hits": 22500,
    "cache_misses": 27500,
    "hit_rate": 0.45,
    "tokens_saved": 11250000,
    "cost_saved": 112.50
  }
}

Cache Entry Details

bash

curl https://api.gateflow.ai/v1/management/cache/entries \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -G -d "limit=10"

Cache Management

Clear Cache

bash

# Clear all cache
curl -X DELETE https://api.gateflow.ai/v1/management/cache \
  -H "Authorization: Bearer gw_prod_admin_key"

# Clear cache for specific model
curl -X DELETE https://api.gateflow.ai/v1/management/cache \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -G -d "model=gpt-5.2"

Warm Cache

Pre-populate cache with known queries:

bash

curl -X POST https://api.gateflow.ai/v1/management/cache/warm \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "queries": [
      {"messages": [{"role": "user", "content": "What is GateFlow?"}]},
      {"messages": [{"role": "user", "content": "How do I get started?"}]}
    ],
    "model": "gpt-5.2"
  }'

When to Disable Caching

Disable caching for:

Personalized responses - User-specific content
Real-time data - Stock prices, weather
Creative generation - Want variety in outputs
Sensitive operations - Each request must be fresh

python

# Always skip cache for these
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {"role": "user", "content": f"What's {user_name}'s order status?"}
    ],
    extra_body={
        "gateflow": {
            "cache": "skip"
        }
    }
)

Embedding Model

GateFlow uses text-embedding-3-small by default for cache embeddings. You can configure this:

json

{
  "semantic_cache": {
    "embedding_model": "text-embedding-3-large"
  }
}

Next Steps

Cache Configuration - Advanced settings
Cost Analytics - Track savings

Semantic Caching ​

How It Works ​

Enabling Caching ​

Global Setting ​

Per-Request Control ​

Cache Configuration ​

Similarity Threshold ​

TTL (Time to Live) ​

Cache Scope ​

Cache Key Components ​

Cache Responses ​

Cache Hit Response ​

Cache Miss Response ​

Cost Savings ​

Example Calculation ​

Typical Hit Rates by Use Case ​

Cache Analytics ​

View Cache Statistics ​

Cache Entry Details ​

Cache Management ​

Clear Cache ​

Warm Cache ​

When to Disable Caching ​

Embedding Model ​

Next Steps ​

Semantic Caching

How It Works

Enabling Caching

Global Setting

Per-Request Control

Cache Configuration

Similarity Threshold

TTL (Time to Live)

Cache Scope

Cache Key Components

Cache Responses

Cache Hit Response

Cache Miss Response

Cost Savings

Example Calculation

Typical Hit Rates by Use Case

Cache Analytics

View Cache Statistics

Cache Entry Details

Cache Management

Clear Cache

Warm Cache

When to Disable Caching

Embedding Model

Next Steps