Skip to main content

Overview

Extend Visca AI Gateway by integrating your own AI models, self-hosted providers, or custom endpoints alongside managed providers like OpenAI and Anthropic.

Bring Your Own Models

Host your own fine-tuned or custom models

Unified Interface

Access all models through one API

Intelligent Routing

Route between managed and custom providers

Cost Control

Use custom models for cost-sensitive workloads

Supported Custom Providers

  • Self-hosted OpenAI-compatible APIs (vLLM, Text Generation Inference)
  • Local models (Ollama, LocalAI, LM Studio)
  • Custom endpoints (Your own model API)
  • Other cloud providers (Together AI, Replicate, Hugging Face)

Adding a Custom Provider

1

Navigate to Providers

Go to Settings → Custom Providers in your dashboard
2

Add Provider

Click “Add Custom Provider” and enter provider details
3

Configure Endpoint

Enter base URL, authentication, and model mappings
4

Test Connection

Run a test request to verify configuration
5

Enable Routing

Add custom models to your routing configuration

Configuration

{
  "name": "my-vllm-server",
  "type": "openai-compatible",
  "base_url": "https://my-vllm.example.com/v1",
  "api_key": "your-api-key",
  "models": [
    {
      "id": "llama-3-70b",
      "name": "Llama 3 70B",
      "context_window": 8192,
      "cost_per_1k_input": 0.0001,
      "cost_per_1k_output": 0.0002
    }
  ]
}

Using Custom Models

Once configured, use custom models like any other:
import openai

client = openai.OpenAI(
    base_url="https://gateway.visca.ai/v1",
    api_key="your-api-key"
)

# Use your custom model
response = client.chat.completions.create(
    model="llama-3-70b",  # Your custom model ID
    messages=[{"role": "user", "content": "Hello!"}]
)

Routing with Custom Models

Mix custom and managed providers:
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Analyze this data..."}],
    extra_body={
        "routing": {
            "strategy": "cost-optimized",
            "providers": ["my-vllm-server", "openai", "anthropic"],
            "fallback": True
        }
    }
)

Request/Response Transforms

Transform requests for non-OpenAI-compatible APIs:
def transform_request(openai_request):
    return {
        "prompt": openai_request["messages"][-1]["content"],
        "max_tokens": openai_request.get("max_tokens", 100),
        "temperature": openai_request.get("temperature", 0.7)
    }

Authentication Methods

{
  "auth_type": "bearer",
  "api_key": "${API_KEY}"
}

Health Checks

Configure automatic health monitoring:
{
	"health_check": {
		"enabled": true,
		"endpoint": "/health",
		"interval_seconds": 60,
		"timeout_seconds": 5,
		"failure_threshold": 3
	}
}

Load Balancing

Distribute load across multiple custom endpoints:
{
	"name": "llama-3-cluster",
	"type": "openai-compatible",
	"endpoints": [
		{
			"base_url": "https://llama-1.example.com/v1",
			"weight": 1
		},
		{
			"base_url": "https://llama-2.example.com/v1",
			"weight": 1
		},
		{
			"base_url": "https://llama-3.example.com/v1",
			"weight": 2
		}
	]
}

Best Practices

  • Enable health checks for all custom providers - Set appropriate timeouts (custom models may be slower) - Monitor error rates and latency - Set up alerts for provider downtime
  • Configure accurate cost_per_token for billing - Use custom models for high-volume, cost-sensitive workloads - Fall back to managed providers for mission-critical requests - Monitor actual costs vs. configured costs
  • Store API keys in environment variables - Use HTTPS for all custom endpoints - Implement rate limiting on custom providers - Regularly rotate credentials
  • Deploy custom models close to gateway (low latency) - Use connection pooling for custom endpoints - Configure appropriate timeout values - Load balance across multiple instances

Example Integrations

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-70b-hf \
  --host 0.0.0.0 \
  --port 8000
{
  "name": "vllm-llama2",
  "type": "openai-compatible",
  "base_url": "http://localhost:8000/v1",
  "models": [{"id": "meta-llama/Llama-2-70b-hf"}]
}

Troubleshooting

  • Verify base_url is accessible from gateway - Check firewall rules and network connectivity - Ensure custom provider is running - Test with curl/postman first
  • Verify API key is correct - Check header format matches provider requirements - Ensure OAuth tokens are not expired - Review provider-specific auth documentation
  • Configure correct response transform - Check provider returns expected format - Enable debug logging to inspect raw responses - Validate against OpenAI API spec

Next Steps

Model Routing

Configure intelligent routing with custom models

Load Balancing

Distribute load across providers

Monitoring

Monitor custom provider performance

Cost Tracking

Track costs for custom models