Skip to main content

Overview

Visca AI Gateway implements flexible rate limiting to protect infrastructure, ensure fair usage, and help you manage costs. Rate limits can be configured per API key, user, or model.

Default Limits

Requests per Minute

Free Tier: 60 RPM Pro: 600 RPM Enterprise: Custom

Tokens per Minute

Free Tier: 40,000 TPM Pro: 400,000 TPM Enterprise: Custom

Concurrent Requests

Free Tier: 5 Pro: 20 Enterprise: Unlimited

Daily Request Limit

Free Tier: 5,000 Pro: Unlimited Enterprise: Unlimited

Model-Specific Limits

Different models have different rate limits based on provider constraints:
ModelRequests/MinTokens/MinNotes
GPT-4 Turbo500300,000Shared across all GPT-4 variants
GPT-420040,000Lower limit due to capacity
GPT-3.5 Turbo3,500350,000Highest throughput
Claude 3.5 Sonnet1,000400,000High capacity model
Claude 3 Opus20040,000Premium model, lower limits
Gemini 1.5 Pro3604,000,000Highest token limit
DALL-E 350N/AImage generation only
Embeddings3,0001,000,000High throughput for RAG

Rate Limit Headers

Every API response includes rate limit information:
HTTP/1.1 200 OK
X-RateLimit-Limit-Requests: 600
X-RateLimit-Remaining-Requests: 599
X-RateLimit-Reset-Requests: 2024-01-15T10:00:00Z
X-RateLimit-Limit-Tokens: 400000
X-RateLimit-Remaining-Tokens: 395432
X-RateLimit-Reset-Tokens: 2024-01-15T10:00:00Z
Retry-After: 60

Header Meanings

  • X-RateLimit-Limit-Requests: Maximum requests per minute
  • X-RateLimit-Remaining-Requests: Requests left in current window
  • X-RateLimit-Reset-Requests: When request limit resets
  • X-RateLimit-Limit-Tokens: Maximum tokens per minute
  • X-RateLimit-Remaining-Tokens: Tokens left in current window
  • X-RateLimit-Reset-Tokens: When token limit resets
  • Retry-After: Seconds to wait before retrying (on 429 errors)

Handling Rate Limits

Exponential Backoff

Implement exponential backoff when you hit limits:
import time
import openai
from openai import RateLimitError

def call_with_retry(client, max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
except RateLimitError as e:
if attempt == max_retries - 1:
raise

            # Exponential backoff: 2^attempt seconds
            wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)

Request Queuing

Queue requests to stay within limits:
import asyncio
from asyncio import Semaphore

class RateLimitedClient:
    def __init__(self, client, requests_per_minute=60):
        self.client = client
        self.semaphore = Semaphore(requests_per_minute)
        self.request_times = []

    async def chat_completion(self, **kwargs):
        async with self.semaphore:
            # Remove requests older than 1 minute
            now = time.time()
            self.request_times = [t for t in self.request_times if now - t < 60]

            # Wait if at limit
            if len(self.request_times) >= 60:
                wait_time = 60 - (now - self.request_times[0])
                await asyncio.sleep(wait_time)

            self.request_times.append(time.time())
            return await self.client.chat.completions.create(**kwargs)

# Usage
client = RateLimitedClient(openai.AsyncOpenAI(), requests_per_minute=60)
response = await client.chat_completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

Configuring Custom Limits

Per API Key

Set limits when creating API keys:
curl https://gateway.visca.ai/v1/api-keys \
  -H "Authorization: Bearer your-admin-key" \
  -d '{
    "name": "Production Key",
    "rate_limits": {
      "requests_per_minute": 1000,
      "tokens_per_minute": 500000,
      "max_concurrent_requests": 50,
      "daily_request_limit": 100000
    }
  }'

Per User

Set user-specific limits:
# Create user with custom limits
response = requests.post(
    "https://gateway.visca.ai/v1/users",
    headers={"Authorization": "Bearer your-admin-key"},
    json={
        "user_id": "user-123",
        "rate_limits": {
            "requests_per_minute": 100,
            "tokens_per_minute": 100000
        }
    }
)

Per Model

Override default model limits:
# gateway-config.yaml
rate_limits:
  models:
    gpt-4:
      requests_per_minute: 300
      tokens_per_minute: 50000
    gpt-3.5-turbo:
      requests_per_minute: 2000
      tokens_per_minute: 300000

Monitoring Usage

Dashboard

View real-time usage in the dashboard:
  • Current RPM and TPM
  • Historical usage graphs
  • Top consumers
  • Rate limit violations

API

Query usage programmatically:
curl https://gateway.visca.ai/v1/usage \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "start_date": "2024-01-01",
    "end_date": "2024-01-31",
    "group_by": ["model", "user_id"]
  }'

Best Practices

Check Headers

Always check rate limit headers before making requests

Implement Backoff

Use exponential backoff for automatic retry

Batch Requests

Combine multiple operations when possible

Cache Responses

Cache common responses to reduce API calls

Use Streaming

Streaming doesn’t reduce limits but improves UX

Monitor Usage

Set up alerts for approaching rate limits

Common Errors

Cause: Exceeded requests per minute or tokens per minuteSolution: Implement exponential backoff and check Retry-After header
{
  "error": {
    "message": "Rate limit exceeded. Try again in 30 seconds.",
    "type": "rate_limit_error",
    "code": 429
  }
}
Cause: Exceeded daily request limit on free tierSolution: Upgrade to Pro plan or wait until limit resets
Cause: All concurrent request slots in useSolution: Wait for requests to complete or upgrade plan

Enterprise Features

Unlimited Requests

No rate limits on number of requests

Custom Token Limits

Set token limits based on your needs

Priority Queue

Your requests processed first during high load

Dedicated Capacity

Reserved infrastructure for your workload

Request Prioritization

Set request priority to control ordering:
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Urgent request"}],
    extra_body={
        "metadata": {
            "priority": "high"  # low, normal, high
        }
    }
)

Troubleshooting

Debug Rate Limiting Issues

Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)

# Logs will show rate limit headers
response = client.chat.completions.create(...)

Check Current Limits

Query your current limits:
curl https://gateway.visca.ai/v1/api-keys/current \
  -H "Authorization: Bearer your-api-key"
Response:
{
	"key_id": "key_abc123",
	"rate_limits": {
		"requests_per_minute": 600,
		"tokens_per_minute": 400000,
		"max_concurrent_requests": 20,
		"daily_request_limit": null
	}
}

Next Steps