Skip to main content

Overview

Extend Visca AI Gateway by integrating your own AI models, self-hosted providers, or custom endpoints alongside managed providers like OpenAI and Anthropic.

Bring Your Own Models

Host your own fine-tuned or custom models

Unified Interface

Access all models through one API

Intelligent Routing

Route between managed and custom providers

Cost Control

Use custom models for cost-sensitive workloads

Supported Custom Providers

  • Self-hosted OpenAI-compatible APIs (vLLM, Text Generation Inference)
  • Local models (Ollama, LocalAI, LM Studio)
  • Custom endpoints (Your own model API)
  • Other cloud providers (Together AI, Replicate, Hugging Face)

Adding a Custom Provider

1

Navigate to Providers

Go to Settings → Custom Providers in your dashboard
2

Add Provider

Click “Add Custom Provider” and enter provider details
3

Configure Endpoint

Enter base URL, authentication, and model mappings
4

Test Connection

Run a test request to verify configuration
5

Enable Routing

Add custom models to your routing configuration

Configuration

{
  "name": "my-vllm-server",
  "type": "openai-compatible",
  "base_url": "https://my-vllm.example.com/v1",
  "api_key": "your-api-key",
  "models": [
    {
      "id": "llama-3-70b",
      "name": "Llama 3 70B",
      "context_window": 8192,
      "cost_per_1k_input": 0.0001,
      "cost_per_1k_output": 0.0002
    }
  ]
}

Using Custom Models

Once configured, use custom models like any other:
import openai

client = openai.OpenAI(
    base_url="https://gateway.visca.ai/v1",
    api_key="your-api-key"
)

# Use your custom model
response = client.chat.completions.create(
    model="llama-3-70b",  # Your custom model ID
    messages=[{"role": "user", "content": "Hello!"}]
)

Routing with Custom Models

Mix custom and managed providers:
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Analyze this data..."}],
    extra_body={
        "routing": {
            "strategy": "cost-optimized",
            "providers": ["my-vllm-server", "openai", "anthropic"],
            "fallback": True
        }
    }
)

Request/Response Transforms

Transform requests for non-OpenAI-compatible APIs:
def transform_request(openai_request):
    return {
        "prompt": openai_request["messages"][-1]["content"],
        "max_tokens": openai_request.get("max_tokens", 100),
        "temperature": openai_request.get("temperature", 0.7)
    }

Authentication Methods

  • API Key
  • Custom Headers
  • OAuth 2.0
{
  "auth_type": "bearer",
  "api_key": "${API_KEY}"
}

Health Checks

Configure automatic health monitoring:
{
	"health_check": {
		"enabled": true,
		"endpoint": "/health",
		"interval_seconds": 60,
		"timeout_seconds": 5,
		"failure_threshold": 3
	}
}

Load Balancing

Distribute load across multiple custom endpoints:
{
	"name": "llama-3-cluster",
	"type": "openai-compatible",
	"endpoints": [
		{
			"base_url": "https://llama-1.example.com/v1",
			"weight": 1
		},
		{
			"base_url": "https://llama-2.example.com/v1",
			"weight": 1
		},
		{
			"base_url": "https://llama-3.example.com/v1",
			"weight": 2
		}
	]
}

Best Practices

  • Enable health checks for all custom providers - Set appropriate timeouts (custom models may be slower) - Monitor error rates and latency - Set up alerts for provider downtime
  • Configure accurate cost_per_token for billing - Use custom models for high-volume, cost-sensitive workloads - Fall back to managed providers for mission-critical requests - Monitor actual costs vs. configured costs
  • Store API keys in environment variables - Use HTTPS for all custom endpoints - Implement rate limiting on custom providers - Regularly rotate credentials
  • Deploy custom models close to gateway (low latency) - Use connection pooling for custom endpoints - Configure appropriate timeout values - Load balance across multiple instances

Example Integrations

  • vLLM
  • Ollama
  • Hugging Face
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-70b-hf \
  --host 0.0.0.0 \
  --port 8000
{
  "name": "vllm-llama2",
  "type": "openai-compatible",
  "base_url": "http://localhost:8000/v1",
  "models": [{"id": "meta-llama/Llama-2-70b-hf"}]
}

Troubleshooting

  • Verify base_url is accessible from gateway - Check firewall rules and network connectivity - Ensure custom provider is running - Test with curl/postman first
  • Verify API key is correct - Check header format matches provider requirements - Ensure OAuth tokens are not expired - Review provider-specific auth documentation
  • Configure correct response transform - Check provider returns expected format - Enable debug logging to inspect raw responses - Validate against OpenAI API spec

Next Steps