LLM Gateway & Routing

Configure multi-model access, fallbacks, cost optimization, and A/B testing.

Why Use a Gateway?

Without gateway:

Vendor lock-in (one provider) No fallbacks (provider down = app down) Hard to A/B test models Scattered API keys and configs

With gateway:

Single API for 400+ models Automatic fallbacks Easy model switching Unified cost tracking Quick Decision Need Solution Fastest setup, multi-model OpenRouter Full control, self-hosted LiteLLM Observability + routing Helicone Enterprise, guardrails Portkey OpenRouter (Recommended) Why OpenRouter 400+ models: OpenAI, Anthropic, Google, Meta, Mistral, and more Single API: One key for all providers Automatic fallbacks: Built-in reliability A/B testing: Easy model comparison Cost tracking: Unified billing dashboard Free credits: $1 free to start Setup

2. Get API key from dashboard

3. Add to .env:

OPENROUTER_API_KEY=sk-or-v1-...

Basic Usage // Using fetch const response = await fetch('https://openrouter.ai/api/v1/chat/completions', { method: 'POST', headers: { 'Authorization': Bearer ${process.env.OPENROUTER_API_KEY}, 'Content-Type': 'application/json', }, body: JSON.stringify({ model: 'anthropic/claude-3-5-sonnet', messages: [{ role: 'user', content: 'Hello!' }], }), });

With Vercel AI SDK (Recommended) import { createOpenAI } from "@ai-sdk/openai"; import { generateText } from "ai";

const openrouter = createOpenAI({ baseURL: "https://openrouter.ai/api/v1", apiKey: process.env.OPENROUTER_API_KEY, });

const { text } = await generateText({ model: openrouter("anthropic/claude-3-5-sonnet"), prompt: "Explain quantum computing", });

Model IDs // Format: provider/model-name const models = { // Anthropic claude35Sonnet: "anthropic/claude-3-5-sonnet", claudeHaiku: "anthropic/claude-3-5-haiku",

// OpenAI gpt4o: "openai/gpt-4o", gpt4oMini: "openai/gpt-4o-mini",

// Google geminiPro: "google/gemini-pro-1.5", geminiFlash: "google/gemini-flash-1.5",

// Meta llama3: "meta-llama/llama-3.1-70b-instruct",

// Auto (OpenRouter picks best) auto: "openrouter/auto", };

Fallback Chains // Define fallback order const modelChain = [ "anthropic/claude-3-5-sonnet", // Primary "openai/gpt-4o", // Fallback 1 "google/gemini-pro-1.5", // Fallback 2 ];

async function callWithFallback(messages: Message[]) { for (const model of modelChain) { try { return await openrouter.chat({ model, messages }); } catch (error) { console.log(${model} failed, trying next...); } } throw new Error("All models failed"); }

Cost Routing // Route based on query complexity function selectModel(query: string): string { const complexity = analyzeComplexity(query);

if (complexity === "simple") { // Simple queries → cheap model return "openai/gpt-4o-mini"; // ~$0.15/1M tokens } else if (complexity === "medium") { // Medium → balanced return "google/gemini-flash-1.5"; // ~$0.075/1M tokens } else { // Complex → best quality return "anthropic/claude-3-5-sonnet"; // ~$3/1M tokens } }

function analyzeComplexity(query: string): "simple" | "medium" | "complex" { // Simple heuristics if (query.length < 50) return "simple"; if (query.includes("explain") || query.includes("analyze")) return "complex"; return "medium"; }

A/B Testing // Random assignment function getModel(userId: string): string { const hash = userId.charCodeAt(0) % 100;

if (hash < 50) { return "anthropic/claude-3-5-sonnet"; // 50% } else { return "openai/gpt-4o"; // 50% } }

// Track which model was used const model = getModel(userId); const response = await openrouter.chat({ model, messages }); await analytics.track("llm_call", { model, userId, latency, cost });

LiteLLM (Self-Hosted) Why LiteLLM Self-hosted: Full control over data 100+ providers: Same coverage as OpenRouter Load balancing: Distribute across providers Cost tracking: Built-in spend management Caching: Redis or in-memory Rate limiting: Per-user limits Setup

Install

pip install litellm[proxy]

Run proxy

litellm --config config.yaml

Use as OpenAI-compatible endpoint

export OPENAI_API_BASE=http://localhost:4000

Configuration

config.yaml

model_list: # Claude models - model_name: claude-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-latest api_key: sk-ant-...

# OpenAI models - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-...

# Load balanced (multiple providers) - model_name: balanced litellm_params: model: anthropic/claude-3-5-sonnet-latest litellm_params: model: openai/gpt-4o # Requests distributed across both

General settings

general_settings: master_key: sk-master-... database_url: postgresql://...

Routing

router_settings: routing_strategy: simple-shuffle # or latency-based-routing num_retries: 3 timeout: 30

Rate limiting

litellm_settings: max_budget: 100 # $100/month budget_duration: monthly

Fallbacks in LiteLLM model_list: - model_name: primary litellm_params: model: anthropic/claude-3-5-sonnet-latest fallbacks: - model_name: fallback-1 litellm_params: model: openai/gpt-4o - model_name: fallback-2 litellm_params: model: google/gemini-pro

Usage // Use like OpenAI SDK import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:4000", apiKey: "sk-master-...", });

const response = await client.chat.completions.create({ model: "claude-sonnet", // Maps to configured model messages: [{ role: "user", content: "Hello!" }], });

Routing Strategies 1. Cost-Based Routing const costTiers = { cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"], balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"], premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"], };

function routeByCost(budget: "cheap" | "balanced" | "premium"): string { const models = costTiers[budget]; return models[Math.floor(Math.random() * models.length)]; }

Latency-Based Routing // Track latency per model const latencyStats: Record = {};

function routeByLatency(): string { const avgLatencies = Object.entries(latencyStats) .map(([model, times]) => ({ model, avg: times.reduce((a, b) => a + b, 0) / times.length, })) .sort((a, b) => a.avg - b.avg);

return avgLatencies[0].model; }

// Update after each call function recordLatency(model: string, latencyMs: number) { if (!latencyStats[model]) latencyStats[model] = []; latencyStats[model].push(latencyMs); // Keep last 100 samples if (latencyStats[model].length > 100) { latencyStats[model].shift(); } }

Task-Based Routing const taskModels = { coding: "anthropic/claude-3-5-sonnet", // Best for code reasoning: "openai/o1-preview", // Best for logic creative: "anthropic/claude-3-5-sonnet", // Best for writing simple: "openai/gpt-4o-mini", // Cheap and fast multimodal: "google/gemini-pro-1.5", // Vision + text };

function routeByTask(task: keyof typeof taskModels): string { return taskModels[task]; }

Hybrid Routing interface RoutingConfig { task: string; maxCost: number; maxLatency: number; }

function hybridRoute(config: RoutingConfig): string { // Filter by cost const affordable = models.filter(m => m.cost <= config.maxCost);

// Filter by latency const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);

// Select best for task const taskScores = fast.map(m => ({ model: m.id, score: getTaskScore(m.id, config.task), }));

return taskScores.sort((a, b) => b.score - a.score)[0].model; }

Best Practices 1. Always Have Fallbacks // Bad: Single point of failure const response = await openai.chat({ model: "gpt-4o", messages });

// Good: Fallback chain const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"]; for (const model of models) { try { return await gateway.chat({ model, messages }); } catch (e) { continue; } }

Pin Model Versions // Bad: Model can change const model = "gpt-4";

// Good: Pinned version const model = "openai/gpt-4-0125-preview";

Track Costs // Log every call async function trackedCall(model: string, messages: Message[]) { const start = Date.now(); const response = await gateway.chat({ model, messages }); const latency = Date.now() - start;

await analytics.track("llm_call", { model, inputTokens: response.usage.prompt_tokens, outputTokens: response.usage.completion_tokens, cost: calculateCost(model, response.usage), latency, });

return response; }

Set Token Limits // Prevent runaway costs const response = await gateway.chat({ model, messages, max_tokens: 500, // Limit output length });
Use Caching // LiteLLM caching litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour

References references/openrouter-guide.md - OpenRouter deep dive references/litellm-guide.md - LiteLLM self-hosting references/routing-strategies.md - Advanced routing patterns references/alternatives.md - Helicone, Portkey, etc. Templates templates/openrouter-config.ts - TypeScript OpenRouter setup templates/litellm-config.yaml - LiteLLM proxy config templates/fallback-chain.ts - Fallback implementation

llm-gateway-routing

安装

2. Get API key from dashboard

3. Add to .env:

Install

Run proxy

Use as OpenAI-compatible endpoint

config.yaml

General settings

Routing

Rate limiting

安装

1. Sign up at openrouter.ai

2. Get API key from dashboard

3. Add to .env:

Install

Run proxy

Use as OpenAI-compatible endpoint

config.yaml

General settings

Routing

Rate limiting