Your app calls OpenAI. It works. Then a customer asks for Claude. Then the PM wants Gemini for the cheap tier. Now you're maintaining three SDK integrations, three sets of API keys, three retry policies, three billing dashboards, and a switch statement that grows every quarter. An LLM gateway collapses this into a single API layer between your application and every provider you use. This guide covers what a gateway does, how it compares to proxies and routers, the capabilities that matter, and where intelligent model routing fits into the stack.
The Multi-Provider Problem
Most AI applications start with a single provider. You install the OpenAI SDK, hardcode the model name, and ship. The integration takes an afternoon. Then the requirements change.
A customer in a regulated industry needs Anthropic because their security team approved it. Your cost analysis shows that 70% of requests are simple enough for a smaller model, and you're burning money sending them to GPT-4. OpenAI has a 4-hour outage and your entire product goes down because there's no fallback.
Each new provider means a new SDK, new authentication, new error handling, new rate limit logic, and new billing reconciliation. The integration code grows linearly with providers. The operational complexity grows faster than that, because providers interact: when OpenAI's rate limiter kicks in, you want to overflow to Anthropic, but Anthropic has its own rate limits, and now you're writing distributed queue logic in your application layer.
This is the multi-provider problem. Every team that uses LLMs in production hits it eventually. The only question is whether you solve it before or after the first outage.
The single-provider risk
On March 2025, OpenAI experienced a multi-hour API outage. Applications with no fallback provider were completely down. Applications with a gateway-level failover to Anthropic or Google continued serving requests within seconds of detecting the failure. The infrastructure cost of running a gateway is trivial compared to the revenue cost of a multi-hour outage.
What an LLM Gateway Does
An LLM gateway sits between your application and LLM providers. Your app sends requests to the gateway. The gateway translates them to the appropriate provider format, sends them, receives the response, and translates it back to a unified format your app understands. One integration point, regardless of how many providers you use.
The translation layer is the table stakes. The real value is in what the gateway does between receiving your request and forwarding it to the provider.
Request lifecycle through a gateway
A request hits the gateway. The gateway authenticates it against your API key. It checks rate limits. It checks the cache for an identical recent request. If cached, it returns the cached response. If not, it selects the target provider based on routing rules (model mapping, cost preferences, latency requirements). It formats the request for that provider's API. It sends the request. If the provider returns a rate limit error (429), the gateway retries with backoff or fails over to an alternate provider. The response comes back. The gateway logs it, tracks the cost, caches it if appropriate, and returns it to your app in the unified format.
Your application code sees none of this. It sends a request and gets a response. The failover, retry, caching, logging, and cost tracking happen inside the gateway.
Before and after: direct integration vs gateway
// BEFORE: Direct provider integration
// Each provider needs its own SDK, error handling, retry logic
import OpenAI from "openai"
import Anthropic from "@anthropic-ai/sdk"
const openai = new OpenAI({ apiKey: process.env.OPENAI_KEY })
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_KEY })
async function chat(prompt: string, provider: string) {
if (provider === "openai") {
try {
return await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }]
})
} catch (e) {
// Manual failover to Anthropic
const msg = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
messages: [{ role: "user", content: prompt }]
})
// ... translate Anthropic format to OpenAI format
return translateResponse(msg)
}
}
// ... repeat for every provider
}
// AFTER: Gateway integration
// One SDK, one format, failover handled by the gateway
import OpenAI from "openai"
const client = new OpenAI({
apiKey: process.env.GATEWAY_KEY,
baseURL: "https://your-gateway.com/v1"
})
async function chat(prompt: string) {
return await client.chat.completions.create({
model: "gpt-4o", // Gateway handles provider routing
messages: [{ role: "user", content: prompt }]
})
}The "after" code is half the size and handles more failure modes. The gateway manages the failover logic, retry policies, and provider translation that the "before" code does manually and incompletely.
Gateway vs Proxy vs Router
These three terms get used interchangeably, but they describe different layers with different responsibilities. Understanding the distinction matters because most production setups need all three, and confusing them leads to gaps in your stack.
| Layer | What it does | Intelligence level | Example |
|---|---|---|---|
| Proxy | Forwards requests, maybe caches. Thin pass-through layer. | None. Mechanical forwarding. | Nginx reverse proxy, simple API relay |
| Gateway | Routing logic, failover, rate limiting, auth, observability, cost tracking. | Rule-based. If model=X, route to provider Y. | LiteLLM, Portkey, custom gateway |
| Router | Classifies prompt difficulty, picks the optimal model tier. | ML-based. Analyzes the prompt to select the model. | Morph Router, custom classifier |
Proxy: the thin layer
A proxy gives you a single endpoint. Your app sends requests to the proxy, the proxy forwards them to the provider. It might cache responses, rewrite headers, or handle TLS termination. It does not make decisions about where to route or what to do when the provider fails. An Nginx reverse proxy in front of OpenAI's API is a proxy. It's useful for centralizing credentials and adding basic caching, but it's not enough for multi-provider production use.
Gateway: the operational layer
A gateway adds operational intelligence. It knows about multiple providers and can route between them. When OpenAI returns a 429 (rate limited), the gateway retries or fails over to Anthropic. It tracks costs across providers in a single dashboard. It enforces rate limits per user or per team. It logs every request for debugging and compliance. The routing is rule-based: you configure that model "gpt-4o" goes to OpenAI, with Anthropic as the fallback.
Router: the intelligence layer
A router classifies the prompt itself and picks the right model tier. A gateway routes by model name. A router routes by prompt difficulty. "What is 2+2" goes to a small, fast, cheap model. "Design a distributed transaction system with exactly-once semantics" goes to a frontier model. The router makes this decision per-request, based on the content of the prompt, not a static rule.
This is the difference between infrastructure routing and intelligence routing. A gateway decides where to send a request (which provider endpoint). A router decides what should handle it (which model tier). You can use a gateway without a router (and many teams do). But adding a router on top of a gateway is where the cost savings compound, because you stop sending simple prompts to expensive models.
They compose, they don't compete
A proxy, gateway, and router are three layers of the same stack. Your app calls the router, which classifies the prompt and picks a model. The router passes the request to the gateway, which handles failover, rate limits, and logging. The gateway sends it through the proxy layer to the provider. Each layer solves a different problem. Deploying one does not replace the need for the others.
Key Gateway Capabilities
Not every gateway does everything. Some focus on routing, others on observability, others on cost management. These are the capabilities that matter most in production, ranked by how quickly teams need them after going multi-provider.
Provider failover
When your primary provider returns an error (429 rate limit, 500 server error, timeout), the gateway retries with exponential backoff, then fails over to a secondary provider. This is the single most important gateway capability. Without it, your application's uptime is capped by your least reliable provider.
Good failover is not just "try the next provider." It needs to translate the request format (OpenAI and Anthropic have different message schemas), handle model mapping (gpt-4o fails over to claude-sonnet-4-20250514, not claude-haiku), and respect the fallback provider's rate limits so you don't trigger a cascade failure.
Rate limiting and queuing
Each provider has rate limits: requests per minute, tokens per minute, concurrent requests. A gateway tracks usage across all your API keys for a provider and queues requests when approaching limits instead of throwing errors. This is especially important when multiple services in your infrastructure share the same provider account.
Cost tracking and budgets
LLM costs are notoriously hard to predict. Input tokens, output tokens, cached tokens, and different prices per model. A gateway logs the token counts and model for every request, giving you a real-time cost dashboard across all providers. Some gateways support budget limits: stop sending requests when a team or project exceeds its monthly budget.
Caching
Identical prompts return identical responses (at temperature 0). A gateway can cache responses keyed by the request hash and return cached results for duplicate requests. For applications with repetitive queries (chatbots with common questions, code completion with common patterns), caching can reduce costs by 20-40% and cut latency to near-zero for cache hits.
Logging and observability
Every request through the gateway is logged: prompt, response, model, tokens, latency, cost, status code. This is your audit trail. When a user reports a bad response, you can trace the exact request. When costs spike, you can identify which service and which model caused it. When latency increases, you can see which provider is slow.
Request routing
Beyond simple model-to-provider mapping, gateways support routing by cost (prefer the cheapest provider for a given model), by latency (route to the provider with the lowest p50 latency this hour), by region (EU requests go to EU endpoints), or by load (distribute across providers to stay under each one's rate limits).
Provider failover
OpenAI returns 429? Route to Anthropic within milliseconds. No code changes, no deploys, no downtime. The gateway handles model mapping and request translation automatically.
Unified cost tracking
One dashboard for all providers. Per-request cost logging with model, token count, and provider. Budget limits per team or project. No more reconciling three billing portals.
Request caching
Identical prompts at temperature 0 get cached responses. 20-40% cost reduction for applications with repetitive queries. Near-zero latency on cache hits.
Open Source LLM Gateways
You don't need to build a gateway from scratch. Several open source and managed options exist, each with different strengths. The right choice depends on whether you need a self-hosted proxy, a managed service, or primarily analytics.
| Gateway | Primary strength | Provider support | Deployment |
|---|---|---|---|
| LiteLLM | Unified API for 100+ providers. OpenAI-compatible endpoint. | 100+ (OpenAI, Anthropic, Google, Azure, Bedrock, etc.) | Self-hosted (Python) or managed cloud |
| Portkey | Gateway-as-a-service with built-in observability and caching. | Major providers (OpenAI, Anthropic, Google, Azure) | Managed cloud, with self-hosted option |
| Helicone | Logging, cost tracking, and analytics. Proxy-first approach. | Major providers via proxy passthrough | Managed cloud, self-hosted available |
LiteLLM
LiteLLM is the most widely adopted open source LLM gateway. It provides an OpenAI-compatible endpoint that translates requests to 100+ providers. You point your existing OpenAI SDK at LiteLLM's URL and it handles provider translation, failover, and load balancing. It supports virtual keys for team-level rate limiting and cost tracking. Self-hosted via a Python proxy server, or available as a managed service.
Portkey
Portkey is a managed AI gateway with built-in request caching, automatic retries, and an observability dashboard. It provides a unified API with SDKs in multiple languages and supports conditional routing (route by model, cost, or custom metadata). The managed service means less operational overhead but less control over the infrastructure.
Helicone
Helicone focuses on the observability and analytics side of the gateway stack. It operates as a proxy that logs every request and provides dashboards for cost, latency, and usage patterns. If your primary need is understanding what your LLM usage looks like (which models, which costs, which latency percentiles), Helicone is the lightest-weight option.
Each of these covers a different slice of the gateway stack. LiteLLM gives you the broadest provider support and the most mature failover logic. Portkey gives you managed infrastructure with less operational work. Helicone gives you analytics with minimal integration effort. Many teams combine them: LiteLLM for the unified API, Helicone for analytics on top.
Where Intelligent Routing Fits In
Every gateway listed above routes by rules. You configure: "model gpt-4o goes to OpenAI, with Anthropic as fallback." The gateway follows the rule. It does not look at the prompt. It does not consider whether the prompt actually needs GPT-4o or whether a smaller model would produce the same quality answer.
This is the gap between infrastructure routing and intelligence routing. Infrastructure routing answers "which provider handles this model?" Intelligence routing answers "which model should handle this prompt?"
Consider a coding agent. Most of its prompts are simple: read a file, search for a string, list directory contents. Maybe 20-30% of prompts require real reasoning: architectural decisions, complex refactors, debugging subtle race conditions. If you send every prompt to a frontier model, you're paying frontier prices for routine work. If you send everything to a cheap model, the hard prompts fail.
An intelligent router classifies each prompt into difficulty tiers (easy, medium, hard, needs_info) and picks the appropriate model. The easy prompts go to a fast, cheap model. The hard prompts go to a frontier model. The classification itself costs a fraction of the price difference between tiers.
Morph Router: intelligence layer on top of any gateway
import Morph from "morphllm"
import Anthropic from "@anthropic-ai/sdk"
const morph = new Morph({ apiKey: process.env.MORPH_API_KEY })
const anthropic = new Anthropic()
// Step 1: Morph Router classifies the prompt, picks the model tier
const { model } = await morph.routers.anthropic.selectModel({
input: userQuery,
mode: "balanced" // or "aggressive" for max cost savings
})
// Returns: "claude-haiku" for easy, "claude-sonnet" for medium,
// "claude-opus" for hard prompts
// Step 2: Send to your gateway (or directly to provider)
// The gateway handles failover, rate limits, logging
const response = await anthropic.messages.create({
model,
max_tokens: 4096,
messages: [{ role: "user", content: userQuery }]
})The router adds ~430ms of latency and costs $0.001 per classification. On a workload where 60-70% of prompts are easy or medium, the cost savings from downgrading those prompts to cheaper models far exceed the routing cost. The router is trained on millions of coding prompts and classifies into four tiers: easy (single-fact lookups), medium (multi-step but straightforward), hard (complex reasoning), and needs_info (insufficient context to answer).
Router + gateway: the full stack
The router and gateway compose cleanly. Your app calls the Morph Router to get the optimal model. Then it sends the request through your gateway (LiteLLM, Portkey, or custom). The gateway handles failover, rate limits, caching, and logging. The router handles model selection. Neither needs to know about the other.
Full stack: Morph Router + LiteLLM gateway
import Morph from "morphllm"
import OpenAI from "openai"
const morph = new Morph({ apiKey: process.env.MORPH_API_KEY })
// LiteLLM gateway as the OpenAI-compatible endpoint
const gateway = new OpenAI({
apiKey: process.env.LITELLM_KEY,
baseURL: "https://your-litellm-proxy.com/v1"
})
async function routedCompletion(prompt: string) {
// Intelligence layer: classify prompt, pick model
const { model } = await morph.routers.anthropic.selectModel({
input: prompt,
mode: "balanced"
})
// Infrastructure layer: failover, rate limits, logging
const response = await gateway.chat.completions.create({
model,
messages: [{ role: "user", content: prompt }]
})
return response
}Infrastructure routing (gateway)
Rule-based. If model=X, use provider Y. Handles failover, rate limits, load balancing. Does not look at the prompt content.
Intelligence routing (router)
ML-based. Classifies the prompt into difficulty tiers and picks the right model. ~430ms latency, $0.001/request. Trained on millions of coding prompts.
Building a Production Gateway Stack
A production LLM stack has three layers. The application layer sends requests. The intelligence layer (router) picks the model. The infrastructure layer (gateway) delivers the request with failover, caching, and observability. You can adopt these incrementally.
Stage 1: Gateway only
Start with a gateway when you add your second provider. Point your app at LiteLLM (or your gateway of choice), configure your providers, and set up failover rules. This gives you a unified API, automatic failover, and cost tracking. Most teams should start here.
Stage 2: Add observability
Once traffic flows through the gateway, add logging and analytics. Helicone or Portkey's dashboard shows you cost per model, latency percentiles, error rates, and usage patterns. This data tells you where to optimize next.
Stage 3: Add intelligent routing
Once you see your cost breakdown (which models, which prompts, which volumes), you can identify the optimization opportunity. If 60% of your requests go to a frontier model but only 30% actually need it, a router saves you money on the other 30%. Plug in Morph's model router for the intelligence layer. It classifies each prompt and picks the right model tier before the request hits your gateway.
| Stage | What you add | What you get |
|---|---|---|
| Stage 1 | Gateway (LiteLLM, Portkey) | Unified API, failover, rate limiting, cost tracking |
| Stage 2 | Observability (Helicone, built-in logging) | Cost dashboards, latency monitoring, usage analytics |
| Stage 3 | Intelligent routing (Morph Router) | Per-prompt model selection, 40-60% cost savings on eligible traffic |
What to avoid
Do not build your own gateway from scratch unless you have a team to maintain it. Provider APIs change, rate limit behaviors shift, new models launch monthly. Open source gateways amortize this maintenance across the community. Your time is better spent on your application logic.
Do not over-engineer routing rules. Start with simple failover (primary + fallback provider) and add complexity only when your observability data shows the need. A gateway with 15 conditional routing rules is harder to debug than one with 3.
Frequently Asked Questions
What is an LLM gateway?
A unified API layer between your application and multiple LLM providers. It handles request routing, provider failover, rate limiting, cost tracking, caching, and observability through a single endpoint. Your app calls the gateway instead of individual providers. The gateway translates requests, manages provider-specific quirks, and returns responses in a unified format.
What is the difference between an LLM gateway and an LLM proxy?
A proxy forwards requests. It might cache, rewrite headers, or handle TLS, but it does not make routing decisions or handle failover. A gateway adds operational intelligence: routing logic, provider failover, rate limiting, authentication, cost tracking, and observability. A proxy is a pipe. A gateway is a control plane.
What is the difference between an LLM gateway and an LLM router?
A gateway handles infrastructure: failover, rate limits, cost tracking. It routes by rules (if model=X, use provider Y). A router handles intelligence: it classifies the prompt content and picks the right model tier. A gateway decides where to send a request. A router decides what should handle it. Both are useful. They compose as separate layers in the same stack.
Do I need an LLM gateway if I only use one provider?
Even with one provider, a gateway adds rate limit management, request queuing, cost tracking, caching, and observability. But the main value is future-proofing. When you add a second provider (and you will, because single-provider dependency is a single point of failure), the gateway is already in place and your application code does not change.
What are the best open source LLM gateways?
LiteLLM for the broadest provider support (100+) and most mature unified API. Portkey for managed infrastructure with built-in observability. Helicone for logging and cost analytics with minimal integration effort. Most teams use LiteLLM as the core gateway and add Helicone for analytics. See the comparison table above for details.
How does intelligent routing differ from gateway routing?
Gateway routing is rule-based: model X goes to provider Y. Intelligent routing is content-based: the router analyzes the prompt and picks the optimal model tier. A simple question goes to a fast, cheap model. A complex reasoning task goes to a frontier model. Morph's router does this classification in ~430ms at $0.001/request, trained on millions of coding prompts.
Can I use an LLM gateway with Morph's model router?
Yes. They are separate layers that compose cleanly. The router classifies the prompt and picks the model. The gateway delivers the request with failover, rate limits, and logging. Call the Morph Router first, get the model name, then send the request through your gateway. Neither layer needs to know about the other.
Related Resources
Add Intelligent Routing to Your Gateway
Morph's model router classifies each prompt and picks the right model tier. $0.001/request, ~430ms latency, four difficulty tiers (easy/medium/hard/needs_info). Trained on millions of coding prompts. Works with any gateway: LiteLLM, Portkey, or custom. One API call before your gateway does the rest.