Appearance
Architecture
This page describes how GateFlow works under the hood.
System Overview
Components
API Gateway
The entry point for all requests. Handles:
- Authentication: Validates API keys
- Rate Limiting: Enforces per-key and per-org limits
- Request Parsing: Normalizes different client formats
- Response Streaming: Supports SSE for streaming responses
Technologies: Python FastAPI, async I/O
Routing Engine
Determines which provider and model handles each request:
- Direct Routing: Model specified in request
- Fallback Chains: Sequential attempts if primary fails
- Task Classification: ML model classifies task type
- Load Balancing: Distributes across providers
Decision Flow:
1. Parse model from request
2. Check if model is an alias → resolve
3. Check if model is deprecated → use replacement
4. Get fallback chain for model
5. For each model in chain:
a. Check provider health
b. Check rate limits
c. If available, route request
d. If failed, try next in chainCache Layer
Semantic caching with pgvector:
- Embedding Generation: Creates vector for each prompt
- Similarity Search: Finds cached responses above threshold
- Cache Storage: PostgreSQL with pgvector extension
- TTL Management: Automatic expiration
Cache Key Components:
- Prompt embedding (vector)
- Model name
- Temperature setting
- Organization ID
Provider Connectors
Adapters for each AI provider:
python
# Simplified connector interface
class ProviderConnector:
async def chat_completion(self, request) -> Response
async def embedding(self, request) -> Response
async def health_check(self) -> boolEach connector handles:
- Request format translation
- Response normalization
- Error mapping
- Retry logic
Data Layer
PostgreSQL - Primary database:
- Organizations, users, API keys
- Provider configurations
- Routing rules
- Request logs
pgvector - Vector similarity:
- Semantic cache storage
- Document embeddings (Data Pillar)
Redis - Fast operations:
- Rate limit counters
- Session cache
- Real-time metrics
Request Lifecycle
1. Request Received
http
POST /v1/chat/completions HTTP/1.1
Host: api.gateflow.ai
Authorization: Bearer gw_prod_xxx
Content-Type: application/json
{"model": "gpt-4o", "messages": [...]}2. Authentication
python
async def authenticate(request):
api_key = extract_api_key(request)
key_data = await db.get_api_key(api_key)
if not key_data:
raise AuthenticationError("Invalid API key")
if key_data.revoked:
raise AuthenticationError("API key revoked")
return key_data.organization_id3. Rate Limiting
python
async def check_rate_limit(org_id, key_id):
# Check organization limits
org_count = await redis.incr(f"ratelimit:org:{org_id}")
if org_count > ORG_LIMIT:
raise RateLimitError("Organization rate limit exceeded")
# Check key limits
key_count = await redis.incr(f"ratelimit:key:{key_id}")
if key_count > KEY_LIMIT:
raise RateLimitError("API key rate limit exceeded")4. Cache Lookup
python
async def check_cache(request):
# Generate embedding
embedding = await embed(request.messages)
# Search for similar cached responses
results = await pgvector.similarity_search(
embedding,
threshold=0.95,
filters={"model": request.model, "org_id": org_id}
)
if results:
return CacheHit(results[0].response)
return CacheMiss()5. Provider Selection
python
async def select_provider(model, org_id):
# Get fallback chain
chain = await get_fallback_chain(model, org_id)
for candidate in chain:
provider = get_provider(candidate.provider)
# Check health
if not await provider.is_healthy():
continue
# Check rate limits
if await provider.is_rate_limited(org_id):
continue
return provider, candidate.model
raise NoAvailableProvider("All providers exhausted")6. Provider Request
python
async def call_provider(provider, model, request):
try:
response = await provider.chat_completion(
model=model,
messages=request.messages,
**request.parameters
)
return response
except ProviderError as e:
# Record failure for circuit breaker
await record_failure(provider)
raise7. Response Processing
python
async def process_response(response, request):
# Calculate cost
cost = calculate_cost(
model=response.model,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens
)
# Cache response
await cache_response(request, response)
# Log for analytics
await log_request(request, response, cost)
return responseReliability
Circuit Breaker
Prevents cascading failures when a provider is down:
State: CLOSED (normal)
↓ failures > threshold
State: OPEN (failing fast)
↓ timeout expires
State: HALF-OPEN (testing)
↓ success
State: CLOSEDRetry Logic
Exponential backoff with jitter:
python
retry_delays = [1s, 2s, 4s, 8s] # With ±25% jitterRetries on:
- Network timeouts
- 5xx errors
- Rate limit errors (after delay)
Does not retry:
- Authentication errors
- Invalid request errors
- Content policy violations
Health Checks
Continuous monitoring of provider status:
python
# Every 30 seconds per provider
async def health_check_loop():
while True:
for provider in providers:
try:
await provider.ping()
provider.mark_healthy()
except:
provider.mark_unhealthy()
await sleep(30)Scalability
Horizontal Scaling
API Gateway scales horizontally behind a load balancer:
Load Balancer
│
├── API Instance 1
├── API Instance 2
├── API Instance 3
└── API Instance NDatabase Scaling
- Read Replicas: Analytics queries on replicas
- Connection Pooling: PgBouncer for efficient connections
- Partitioning: Request logs partitioned by date
Caching Tiers
L1: In-memory (hot keys)
L2: Redis (warm keys)
L3: PostgreSQL/pgvector (all keys)Security
Data Encryption
- In Transit: TLS 1.3 for all connections
- At Rest: AES-256 for stored credentials
- Provider Keys: Encrypted with per-org keys
Isolation
- Multi-tenant: Row-level security in PostgreSQL
- Network: Provider requests from isolated workers
- Secrets: HSM for master encryption keys
Next Steps
- Provider Configuration - Set up providers
- Intelligent Routing - Configure routing
- Data & Compliance - Security and compliance features