Skip to content

Architecture

This page describes how GateFlow works under the hood.

System Overview

Components

API Gateway

The entry point for all requests. Handles:

  • Authentication: Validates API keys
  • Rate Limiting: Enforces per-key and per-org limits
  • Request Parsing: Normalizes different client formats
  • Response Streaming: Supports SSE for streaming responses

Technologies: Python FastAPI, async I/O

Routing Engine

Determines which provider and model handles each request:

  • Direct Routing: Model specified in request
  • Fallback Chains: Sequential attempts if primary fails
  • Task Classification: ML model classifies task type
  • Load Balancing: Distributes across providers

Decision Flow:

1. Parse model from request
2. Check if model is an alias → resolve
3. Check if model is deprecated → use replacement
4. Get fallback chain for model
5. For each model in chain:
   a. Check provider health
   b. Check rate limits
   c. If available, route request
   d. If failed, try next in chain

Cache Layer

Semantic caching with pgvector:

  • Embedding Generation: Creates vector for each prompt
  • Similarity Search: Finds cached responses above threshold
  • Cache Storage: PostgreSQL with pgvector extension
  • TTL Management: Automatic expiration

Cache Key Components:

  • Prompt embedding (vector)
  • Model name
  • Temperature setting
  • Organization ID

Provider Connectors

Adapters for each AI provider:

python
# Simplified connector interface
class ProviderConnector:
    async def chat_completion(self, request) -> Response
    async def embedding(self, request) -> Response
    async def health_check(self) -> bool

Each connector handles:

  • Request format translation
  • Response normalization
  • Error mapping
  • Retry logic

Data Layer

PostgreSQL - Primary database:

  • Organizations, users, API keys
  • Provider configurations
  • Routing rules
  • Request logs

pgvector - Vector similarity:

  • Semantic cache storage
  • Document embeddings (Data Pillar)

Redis - Fast operations:

  • Rate limit counters
  • Session cache
  • Real-time metrics

Request Lifecycle

1. Request Received

http
POST /v1/chat/completions HTTP/1.1
Host: api.gateflow.ai
Authorization: Bearer gw_prod_xxx
Content-Type: application/json

{"model": "gpt-4o", "messages": [...]}

2. Authentication

python
async def authenticate(request):
    api_key = extract_api_key(request)
    key_data = await db.get_api_key(api_key)

    if not key_data:
        raise AuthenticationError("Invalid API key")

    if key_data.revoked:
        raise AuthenticationError("API key revoked")

    return key_data.organization_id

3. Rate Limiting

python
async def check_rate_limit(org_id, key_id):
    # Check organization limits
    org_count = await redis.incr(f"ratelimit:org:{org_id}")
    if org_count > ORG_LIMIT:
        raise RateLimitError("Organization rate limit exceeded")

    # Check key limits
    key_count = await redis.incr(f"ratelimit:key:{key_id}")
    if key_count > KEY_LIMIT:
        raise RateLimitError("API key rate limit exceeded")

4. Cache Lookup

python
async def check_cache(request):
    # Generate embedding
    embedding = await embed(request.messages)

    # Search for similar cached responses
    results = await pgvector.similarity_search(
        embedding,
        threshold=0.95,
        filters={"model": request.model, "org_id": org_id}
    )

    if results:
        return CacheHit(results[0].response)

    return CacheMiss()

5. Provider Selection

python
async def select_provider(model, org_id):
    # Get fallback chain
    chain = await get_fallback_chain(model, org_id)

    for candidate in chain:
        provider = get_provider(candidate.provider)

        # Check health
        if not await provider.is_healthy():
            continue

        # Check rate limits
        if await provider.is_rate_limited(org_id):
            continue

        return provider, candidate.model

    raise NoAvailableProvider("All providers exhausted")

6. Provider Request

python
async def call_provider(provider, model, request):
    try:
        response = await provider.chat_completion(
            model=model,
            messages=request.messages,
            **request.parameters
        )
        return response
    except ProviderError as e:
        # Record failure for circuit breaker
        await record_failure(provider)
        raise

7. Response Processing

python
async def process_response(response, request):
    # Calculate cost
    cost = calculate_cost(
        model=response.model,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens
    )

    # Cache response
    await cache_response(request, response)

    # Log for analytics
    await log_request(request, response, cost)

    return response

Reliability

Circuit Breaker

Prevents cascading failures when a provider is down:

State: CLOSED (normal)
  ↓ failures > threshold
State: OPEN (failing fast)
  ↓ timeout expires
State: HALF-OPEN (testing)
  ↓ success
State: CLOSED

Retry Logic

Exponential backoff with jitter:

python
retry_delays = [1s, 2s, 4s, 8s]  # With ±25% jitter

Retries on:

  • Network timeouts
  • 5xx errors
  • Rate limit errors (after delay)

Does not retry:

  • Authentication errors
  • Invalid request errors
  • Content policy violations

Health Checks

Continuous monitoring of provider status:

python
# Every 30 seconds per provider
async def health_check_loop():
    while True:
        for provider in providers:
            try:
                await provider.ping()
                provider.mark_healthy()
            except:
                provider.mark_unhealthy()
        await sleep(30)

Scalability

Horizontal Scaling

API Gateway scales horizontally behind a load balancer:

Load Balancer

     ├── API Instance 1
     ├── API Instance 2
     ├── API Instance 3
     └── API Instance N

Database Scaling

  • Read Replicas: Analytics queries on replicas
  • Connection Pooling: PgBouncer for efficient connections
  • Partitioning: Request logs partitioned by date

Caching Tiers

L1: In-memory (hot keys)
L2: Redis (warm keys)
L3: PostgreSQL/pgvector (all keys)

Security

Data Encryption

  • In Transit: TLS 1.3 for all connections
  • At Rest: AES-256 for stored credentials
  • Provider Keys: Encrypted with per-org keys

Isolation

  • Multi-tenant: Row-level security in PostgreSQL
  • Network: Provider requests from isolated workers
  • Secrets: HSM for master encryption keys

Next Steps

Built with reliability in mind.