Skip to main content

Overview

Visca AI Gateway’s intelligent routing automatically selects the best provider for each request based on your specified strategy. This ensures optimal cost, latency, and reliability for your AI applications.

Cost Optimization

Save up to 90% by routing to the cheapest provider

Low Latency

Route to the fastest provider for your region

High Availability

Automatic failover when providers are down

Load Balancing

Distribute load across multiple providers

Routing Strategies

Cost-Optimized Routing

Automatically routes requests to the cheapest provider that supports the requested model capabilities.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.visca.ai/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="gpt-4o",  # Will route to cheapest GPT-4o equivalent
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "X-Routing-Strategy": "cost-optimized"
    }
)
How it works:
  1. Analyzes model requirements (context length, capabilities)
  2. Finds equivalent models across providers
  3. Compares pricing per 1M tokens
  4. Routes to the cheapest option
Savings example:
  • OpenAI GPT-4o: $5.00 / 1M input tokens
  • Alternative provider: $0.50 / 1M input tokens
  • Savings: 90%

Latency-Optimized Routing

Routes to the provider with the lowest latency for your geographic region.
response = client.chat.completions.create(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Quick response needed"}],
    extra_headers={
        "X-Routing-Strategy": "latency-optimized",
        "X-User-Region": "us-west"  # Optional: specify region
    }
)
How it works:
  1. Continuously monitors provider latency
  2. Considers geographic distance
  3. Routes to fastest responding provider
  4. Adapts to real-time performance
Typical latency improvements:
  • Standard routing: 500-1000ms
  • Latency-optimized: 150-300ms
  • Improvement: up to 3x faster

Priority-Based Routing

Define a custom provider preference order with automatic failover.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "X-Routing-Strategy": "priority-based",
        "X-Provider-Priority": "openai,anthropic,google"
    }
)
Use cases:
  • Compliance requirements (prefer specific providers)
  • Contract commitments (use specific quotas first)
  • Quality preferences (prioritize certain providers)

Load-Balanced Routing

Distribute requests evenly across multiple providers to maximize throughput and reliability.
response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "X-Routing-Strategy": "load-balanced",
        "X-Load-Balance-Providers": "groq,together,fireworks"
    }
)
Benefits:
  • Avoid rate limits on individual providers
  • Better resilience during high traffic
  • Improved overall throughput

Automatic Failover

If a provider is unavailable or returns an error, requests automatically fail over to a backup provider.
1

Primary provider fails

Request sent to primary provider returns 503 or times out
2

Automatic retry

Gateway immediately retries with next available provider
3

Seamless response

User receives response without knowing about the failover
4

Health monitoring

Failed provider marked as unhealthy, health checks resume

Configuring Failover

  • Default Behavior
  • Custom Configuration
  • Disable Failover
Failover is enabled by default with these settings:
  • Max retries: 3
  • Timeout: 30 seconds
  • Backoff: Exponential (1s, 2s, 4s)
  • Fallback providers: Automatic selection

Failover Scenarios

Scenario: OpenAI experiences an outageBehavior:
  1. Request fails with 503 error
  2. Gateway routes to Anthropic Claude
  3. Response returned seamlessly
  4. OpenAI marked unhealthy for 5 minutes
Scenario: Hit rate limit on primary provider Behavior: 1. Receives 429 rate limit error 2. Immediately routes to backup provider 3. Original provider recovers after rate limit window
Scenario: Provider takes too long to respond Behavior: 1. Request times out after 30 seconds 2. Gateway cancels and retries with faster provider 3. Slow provider latency tracked for future routing
Scenario: Provider returns malformed responseBehavior:
  1. Gateway detects invalid JSON/format
  2. Automatically retries with another provider
  3. Logs error for investigation

Model Equivalency

Gateway automatically maps model requests to equivalent models across providers:
Original RequestAlternative Providers
gpt-4oAnthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro
gpt-4o-miniClaude 3 Haiku, Gemini 1.5 Flash
gpt-3.5-turboLlama 3.1 70B, Mixtral 8x7B
claude-3-opusGPT-4 Turbo, Gemini 1.5 Pro
Model equivalency considers: - Context window size - Capabilities (vision, function calling) - Performance characteristics - Output quality

Advanced Routing Rules

Conditional Routing

Route based on request characteristics:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    extra_headers={
        "X-Routing-Rules": json.dumps({
            "if_tokens_over": 50000,
            "use_provider": "anthropic",  # Claude has 200K context
            "else_strategy": "cost-optimized"
        })
    }
)

Time-Based Routing

Route differently based on time of day or day of week:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "X-Routing-Schedule": json.dumps({
            "weekday_9to5": "openai",  # Business hours: premium
            "other": "cost-optimized"   # Off-hours: save cost
        })
    }
)

Budget-Based Routing

Set spending caps and automatically switch to cheaper providers:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "X-Budget-Limit": "100.00",  # USD
        "X-Budget-Period": "monthly",
        "X-Budget-Exceeded-Strategy": "cheapest"
    }
)

Monitoring Routing Performance

View Routing Analytics

Access your dashboard to see:
  • Cost savings from intelligent routing
  • Latency improvements by strategy
  • Failover statistics and success rates
  • Provider health and uptime metrics

Routing Metrics API

curl https://api.visca.ai/v1/analytics/routing \
  -H "Authorization: Bearer $VISCA_API_KEY"
Response:
{
	"period": "last_30_days",
	"total_requests": 1000000,
	"routing_breakdown": {
		"cost_optimized": 650000,
		"latency_optimized": 250000,
		"priority_based": 75000,
		"load_balanced": 25000
	},
	"cost_savings": {
		"total_saved": 45000.0,
		"percentage": 72
	},
	"average_latency": {
		"standard": 487,
		"latency_optimized": 189
	},
	"failover_stats": {
		"total_failovers": 1250,
		"success_rate": 0.998,
		"most_common_reason": "rate_limit"
	}
}

Best Practices

  • Use cost-optimized for batch processing
  • Set budget limits to prevent overruns
  • Monitor savings in dashboard
  • Consider model equivalency tradeoffs
  • Use latency-optimized for user-facing features - Specify user regions for better routing - Enable caching for repeated queries - Monitor latency metrics
  • Always enable failover for production - Use load-balanced for high-traffic applications - Set appropriate timeout values - Monitor failover rates
# Test different strategies
strategies = ["cost-optimized", "latency-optimized", "load-balanced"]

for strategy in strategies:
    start = time.time()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Test"}],
        extra_headers={"X-Routing-Strategy": strategy}
    )
    latency = time.time() - start
    print(f"{strategy}: {latency}s")

Configuration Examples

Startup Cost Optimization

# Route everything to cheapest providers
client = OpenAI(
    base_url="https://api.visca.ai/v1",
    api_key=os.environ.get("VISCA_API_KEY"),
    default_headers={
        "X-Routing-Strategy": "cost-optimized",
        "X-Budget-Limit": "500.00",
        "X-Budget-Period": "monthly"
    }
)

Enterprise High Availability

# Maximum reliability with priority failover
client = OpenAI(
    base_url="https://api.visca.ai/v1",
    api_key=os.environ.get("VISCA_API_KEY"),
    default_headers={
        "X-Routing-Strategy": "priority-based",
        "X-Provider-Priority": "openai,anthropic,google",
        "X-Failover-Enabled": "true",
        "X-Failover-Max-Retries": "5"
    }
)

Global Application

# Latency-optimized with regional routing
def get_client(user_region):
    return OpenAI(
        base_url="https://api.visca.ai/v1",
        api_key=os.environ.get("VISCA_API_KEY"),
        default_headers={
            "X-Routing-Strategy": "latency-optimized",
            "X-User-Region": user_region,
            "X-Failover-Enabled": "true"
        }
    )

Troubleshooting

Check response headers for routing information:
response = client.chat.completions.create(...)
print(response.headers.get("X-Provider-Used"))
print(response.headers.get("X-Routing-Strategy-Applied"))
  • Verify cost-optimized strategy is enabled - Check if model equivalents are available - Review provider pricing in dashboard - Ensure failover not overriding strategy
  • Switch to latency-optimized strategy - Check provider health status - Verify user region is correct - Consider geographic proximity to providers
  • Check provider health dashboard
  • Increase timeout values
  • Review error logs for patterns
  • Consider different provider priority

Next Steps