LLM Marketplaces with Automatic Fallbacks: How They Work and What They Cost

Matthew Diakonov··13 min read

LLM Marketplaces with Automatic Fallbacks: How They Work and What They Cost

Yes, several LLM marketplaces and API gateways handle automatic fallbacks when your chosen model goes down. The pricing varies from pure pass-through (you pay only what the upstream provider charges) to margin-based markups and flat platform fees. Here is a practical breakdown of what exists, how fallback routing actually works under the hood, and what each option costs.

Why Fallback Routing Matters

When you send a request to an LLM and the provider returns a 500, 503, or times out, your application is stuck. If you built directly against one provider's API, you need to write retry logic, maintain credentials for backup providers, map model capabilities across vendors, and handle the response format differences yourself.

LLM marketplaces solve this by sitting between your application and multiple providers. You send one API call, and the marketplace routes it to your preferred model. If that model is unavailable, the marketplace automatically retries with a fallback model you configured (or one it selects based on capability matching). Your application code stays the same regardless of which provider actually served the response.

Your Appsingle API callLLM Gatewayhealth checksfallback routingmodel mappingunified billingusage loggingProvider A503 - DOWNProvider B200 - OK (fallback)Provider Cstandby

The Major LLM Marketplaces Compared

Here is a direct comparison of the platforms that offer automatic fallback routing today:

| Platform | Fallback Mechanism | Pricing Model | Markup Over Provider | Self-Hostable | |---|---|---|---|---| | OpenRouter | Ordered fallback list per request, automatic retry on failure | Pass-through + small margin | ~5-20% on most models | No | | Portkey AI | Fallback chains, load balancing, conditional routing | Platform fee + pass-through | $0 markup on tokens (SaaS tier fee) | Yes (enterprise) | | LiteLLM | Fallback list in config, retry with next provider on error | Free (open source), proxy pass-through | $0 (you bring your own keys) | Yes | | Martian | Model router that picks the best model per request | Pass-through + routing fee | Variable per request | No | | Not Diamond | ML-based model selection with fallback | Free tier + usage-based | Included in routing fee | No | | Helicone | Gateway with retry/fallback, primarily an observability tool | Free tier, paid plans from $20/mo | $0 markup on tokens | Yes | | Unify AI | Automatic provider fallback with latency-based routing | Pass-through pricing | ~2-5% margin | No |

How Each Platform Handles Fallbacks

OpenRouter

OpenRouter is the most widely used LLM marketplace for individual developers. You send requests to a single endpoint, specify your preferred model (like anthropic/claude-sonnet-4), and OpenRouter routes it to the cheapest available provider for that model.

For automatic fallback, you can pass a route parameter set to "fallback" along with an ordered list of models. If the first model returns an error or is unavailable, OpenRouter tries the next one in your list. The key detail: you pay the price of whichever model actually serves the response, not the one you originally requested.

# OpenRouter fallback example
import requests

response = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {OPENROUTER_KEY}"},
    json={
        "model": "anthropic/claude-sonnet-4",
        "route": "fallback",
        "models": [
            "anthropic/claude-sonnet-4",
            "openai/gpt-4o",
            "google/gemini-2.5-pro"
        ],
        "messages": [{"role": "user", "content": "Hello"}]
    }
)

OpenRouter charges a margin on top of provider prices. For popular models, this is typically 5-20%. They publish the per-model pricing on their site, so you know exactly what you pay before sending a request.

Portkey AI

Portkey positions itself as an "AI gateway" rather than a marketplace. You bring your own API keys for each provider, and Portkey handles routing, fallbacks, load balancing, retries, and observability. Because you use your own keys, there is zero markup on token costs. You pay Portkey a platform fee instead.

Portkey's fallback configuration is more granular than OpenRouter's. You define a "config" object that specifies primary and fallback targets, retry counts, timeout thresholds, and conditional logic (for example, fall back only on 429 errors but not on 400 errors).

{
  "strategy": {
    "mode": "fallback"
  },
  "targets": [
    {
      "provider": "anthropic",
      "api_key": "sk-ant-...",
      "override_params": {"model": "claude-sonnet-4-20250514"}
    },
    {
      "provider": "openai",
      "api_key": "sk-...",
      "override_params": {"model": "gpt-4o"}
    }
  ]
}

Pricing: free tier covers 10K requests/month. Production plans start around $49/month with higher limits. Enterprise pricing includes self-hosted deployment.

LiteLLM

LiteLLM is an open source proxy that normalizes 100+ LLM providers behind a single OpenAI-compatible API. Fallback is built into the routing config: you define a list of model/provider pairs, and LiteLLM tries them in order on failure.

Because it is open source and self-hosted, there is no markup at all. You pay each provider directly at their published rates. The trade-off is that you run and maintain the proxy yourself.

# litellm config.yaml
model_list:
  - model_name: "smart-model"
    litellm_params:
      model: "anthropic/claude-sonnet-4"
      api_key: "sk-ant-..."
  - model_name: "smart-model"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "sk-..."

router_settings:
  routing_strategy: "simple-shuffle"
  num_retries: 2
  fallbacks: [{"smart-model": ["smart-model"]}]

Tip

LiteLLM is the best option if you want zero vendor lock-in and zero markup. You can deploy it as a Docker container and point all your applications at it. The downside is you need to manage uptime, updates, and monitoring yourself.

Martian and Not Diamond (ML-Based Routers)

These platforms take a different approach. Instead of you specifying a fallback list, they use machine learning to select the best model for each request based on the prompt content, required capabilities, latency targets, and cost constraints.

Martian analyzes your prompt and routes to the model most likely to produce the best result for that specific task. If the selected model is unavailable, it falls back to the next best option. Pricing is pass-through plus a small routing fee per request.

Not Diamond works similarly, using a trained classifier to pick the optimal model. Their free tier includes a generous number of routing decisions per month, with paid tiers for higher volume.

The advantage of ML-based routing: you do not need to manually maintain fallback lists or know which model is best for which task. The disadvantage: you give up control over exactly which model serves each request, which can matter for consistency.

Pricing Breakdown: What You Actually Pay

The cost structure falls into three categories:

1. Pass-through with markup (OpenRouter, Unify)

You pay the underlying provider's token price plus a percentage margin. For example, if Claude Sonnet costs $3/$15 per million input/output tokens from Anthropic directly, OpenRouter might charge $3.15/$15.75. The margin funds the routing infrastructure.

2. Platform fee, no token markup (Portkey, Helicone)

You pay each provider at their exact published rate (using your own API keys) plus a monthly subscription to the gateway service. This is cheaper at high volume because the platform fee is fixed while token costs scale linearly.

3. Free and self-hosted (LiteLLM)

You pay only the upstream providers. Zero middleman cost. You absorb the operational cost of running the proxy.

| Monthly Token Spend | Best Option | Why | |---|---|---| | Under $50/mo | OpenRouter | Convenience outweighs the small margin | | $50 - $500/mo | Portkey or Unify | Platform fee is small relative to savings on markup | | $500+/mo | LiteLLM (self-hosted) | Zero markup saves hundreds per month at scale | | Variable, need smart routing | Martian or Not Diamond | ML routing optimizes cost per request automatically |

What to Look for When Choosing

Not all fallback implementations are equal. Here are the things that actually matter when evaluating these platforms:

Fallback trigger conditions. Does it retry on all errors, or can you configure it to only fall back on 503/429 but not 400 (bad request)?
Timeout handling. If the primary model is slow (not down, just degraded), does the gateway switch to a fallback after N seconds? Configurable timeouts matter more than binary up/down checks.
Model capability mapping. When falling back from Claude to GPT-4o, does the gateway handle parameter differences (like system prompt format, tool calling schema)?
Streaming support. If the primary model fails mid-stream, can the gateway seamlessly switch to a fallback? Most cannot. They retry from scratch, which means the client sees a delay but gets a complete response.
Observability. Can you see which model actually served each request, the latency, and why a fallback was triggered? Without this, debugging production issues is guesswork.

Common Pitfalls

  • Assuming fallback means identical output. When you fall back from Claude to GPT-4o, the response will be different. If your application depends on specific output formatting, structured JSON schemas, or model-specific features, test each fallback path independently. The gateway handles availability, not behavioral equivalence.

  • Ignoring cold-start latency. Some providers add significant latency on the first request after idle periods. If your primary model is fast but your fallback has a 2-3 second cold start, the user experience during failover is worse than you expect.

  • Stacking too many fallbacks. Three fallback levels is the practical maximum. Each additional fallback adds latency (the gateway must wait for the timeout on the previous level before trying the next). With five fallbacks and 10-second timeouts, worst case is 50 seconds before the user gets a response.

  • Not testing fallback paths. Set up a staging environment where you can force-fail each provider and verify the fallback chain works. Portkey and LiteLLM both support test modes for this. Do not wait for a real outage to discover your fallback config is broken.

Warning

If you use OpenRouter or another marketplace without your own provider API keys, you share rate limits with every other user on that marketplace. During major outages, the marketplace itself can become a bottleneck because thousands of users simultaneously hit fallback routes. Having your own direct API keys as a backup remains important even when using a marketplace.

Quick Setup Checklist

If you want to get started with automatic fallback routing today:

  1. Pick your primary and fallback models. Choose models with similar capabilities. Falling back from a coding-focused model to a general-purpose one will produce worse results for code tasks.
  2. Decide: marketplace or self-hosted. For prototyping, use OpenRouter (zero setup). For production, evaluate Portkey or LiteLLM based on your monthly spend.
  3. Set timeout thresholds. 15 seconds for the primary, 10 seconds for the first fallback. These numbers work well for most conversational use cases.
  4. Add a local model as the last resort. If you run Ollama locally, add it as the final fallback. It will never be rate limited or unavailable because of a remote outage.
  5. Monitor which fallback fires and when. If your fallback activates more than 5% of the time, reconsider your primary provider.

Wrapping Up

LLM marketplaces with automatic fallback exist and work well today. OpenRouter is the fastest way to get started, Portkey gives you the most control without token markup, and LiteLLM is the right choice if you want full ownership at zero cost. The pricing model you choose depends entirely on your volume: low volume favors convenience (marketplace markup), high volume favors self-hosted (zero markup). Whichever you pick, test the fallback path before you need it.

Fazm is an open source macOS AI agent that works with multiple LLM providers. Open source on GitHub.

Related Posts