Skip to content

Request Queuing

Manage request queues for handling traffic spikes and rate limits.

Overview

Request queuing buffers requests when providers are rate limited or under heavy load:

Queue Configuration

Enable Queuing

bash
curl -X POST https://api.gateflow.ai/v1/management/queue-config \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "max_queue_size": 1000,
    "max_wait_ms": 60000,
    "priority_levels": ["critical", "high", "normal", "low"],
    "default_priority": "normal"
  }'

Per-Request Configuration

python
import openai

client = openai.OpenAI(
    base_url="https://api.gateflow.ai/v1",
    api_key="gw_prod_..."
)

response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={
        "gateflow": {
            "queue": {
                "enabled": True,
                "priority": "high",
                "max_wait_ms": 30000
            }
        }
    }
)

Priority Levels

Priority Queue Order

PriorityUse CaseMax Wait
criticalReal-time user interactions5s
highInteractive features15s
normalBackground processing60s
lowBatch jobs300s

Setting Priority

Per Request:

python
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={
        "gateflow": {
            "queue": {"priority": "critical"}
        }
    }
)

Per API Key:

bash
curl -X POST https://api.gateflow.ai/v1/management/api-keys \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "realtime-key",
    "default_queue_priority": "high"
  }'

Queue Behavior

When Queue is Full

python
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={
        "gateflow": {
            "queue": {
                "enabled": True,
                "on_full": "reject"  # or "drop_lowest"
            }
        }
    }
)
BehaviorDescription
rejectReturn 503 immediately
drop_lowestRemove lowest priority request

Timeout Handling

python
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={
        "gateflow": {
            "queue": {
                "max_wait_ms": 30000,
                "on_timeout": "error"  # or "fallback"
            }
        }
    }
)

Queue Status

Check Queue Status

bash
curl https://api.gateflow.ai/v1/management/queue/status \
  -H "Authorization: Bearer gw_prod_..."

Response:

json
{
  "queue_enabled": true,
  "current_size": 45,
  "max_size": 1000,
  "by_priority": {
    "critical": 2,
    "high": 8,
    "normal": 30,
    "low": 5
  },
  "avg_wait_ms": 2500,
  "processing_rate_per_min": 120
}

Queue Position in Response

json
{
  "id": "chatcmpl-abc123",
  "choices": [...],
  "usage": {...},
  "gateflow": {
    "queue": {
      "was_queued": true,
      "queue_wait_ms": 1500,
      "initial_position": 12
    }
  }
}

Fair Queuing

Prevent single consumers from monopolizing the queue:

bash
curl -X POST https://api.gateflow.ai/v1/management/queue-config \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "fair_queuing": {
      "enabled": true,
      "max_per_key": 50,
      "max_per_ip": 100
    }
  }'

Batch Queue

For batch processing with lower priority:

python
# Queue multiple requests
responses = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-5.2",
        messages=[{"role": "user", "content": prompt}],
        extra_body={
            "gateflow": {
                "queue": {
                    "priority": "low",
                    "max_wait_ms": 300000,  # 5 minutes
                    "batch_id": "batch_abc123"
                }
            }
        }
    )
    responses.append(response)

Batch Status

bash
curl https://api.gateflow.ai/v1/management/queue/batch/batch_abc123 \
  -H "Authorization: Bearer gw_prod_..."

Response:

json
{
  "batch_id": "batch_abc123",
  "total_requests": 100,
  "completed": 45,
  "queued": 50,
  "failed": 5,
  "avg_wait_ms": 8500
}

Queue Webhooks

Get notified of queue events:

bash
curl -X POST https://api.gateflow.ai/v1/management/webhooks \
  -H "Authorization: Bearer gw_prod_admin_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://your-app.com/queue-webhook",
    "events": ["queue.high_usage", "queue.timeout", "queue.full"]
  }'

Webhook Payload:

json
{
  "event": "queue.high_usage",
  "timestamp": "2026-02-16T10:30:00Z",
  "data": {
    "current_size": 850,
    "max_size": 1000,
    "utilization": 0.85
  }
}

Monitoring

Queue Metrics

bash
curl https://api.gateflow.ai/v1/management/analytics/queue \
  -H "Authorization: Bearer gw_prod_..." \
  -G -d "period=1h"

Response:

json
{
  "period": "1h",
  "requests_queued": 1250,
  "requests_processed": 1200,
  "requests_timed_out": 30,
  "requests_rejected": 20,
  "avg_wait_ms": 3500,
  "p95_wait_ms": 12000,
  "p99_wait_ms": 25000,
  "peak_queue_size": 450
}

Best Practices

  1. Set appropriate timeouts - Match max_wait_ms to your use case
  2. Use priority wisely - Reserve "critical" for truly real-time needs
  3. Monitor queue depth - Set alerts before queue fills
  4. Enable fair queuing - Prevent monopolization in multi-tenant setups
  5. Combine with fallbacks - Queue + fallbacks provides best reliability

Next Steps

Built with reliability in mind.