Smart Model Routing for AI Agents: Reduce Costs by 70% with Task-Based Routing
Most AI agents send every request to the same frontier model. A simple file rename goes to the same model as a complex multi-file refactoring. This is like taking a helicopter for every trip, including the one to the corner store. Smart model routing matches each task to the cheapest model that can handle it reliably. The result is 60-80% cost reduction with minimal quality impact. Here is how to implement it.
“Fazm supports custom API endpoints, so you can route through LiteLLM or any proxy that handles model selection automatically.”
fazm.ai
1. The Core Insight: Most Tasks Are Simple
Analyze any AI agent's workload and you will find a consistent pattern: 60-70% of tasks are simple operations that any model can handle. File reads, simple edits, search queries, formatting changes, boilerplate generation. These tasks do not require the reasoning capability of a frontier model.
Another 20-25% of tasks are moderate complexity: multi-step operations, code generation with context, debugging with log analysis. A mid-tier model handles these well.
Only 5-15% of tasks truly need a frontier model: complex architectural decisions, multi-file refactoring with subtle dependencies, reasoning about edge cases in unfamiliar code.
When you send everything to the frontier model, you are paying 10-30x more than necessary for the majority of your workload. Model routing fixes this by matching each task to the appropriate tier.
2. Routing Strategies: How to Classify Tasks
There are several approaches to classifying which model a task needs. Each has different trade-offs between accuracy and implementation complexity:
Token count routing
The simplest approach: short prompts (under 500 tokens) go to the cheapest model, medium prompts (500-2000 tokens) go to mid-tier, and long prompts (over 2000 tokens) go to the frontier model. This is a rough proxy for complexity, but it catches the obvious cases: simple questions are usually short, complex reasoning tasks usually have long prompts with lots of context.
Tool call routing
Route based on which tools the agent needs to use. File read operations can use a cheap model. Code editing needs a mid-tier model. Multi-tool operations (search, read, edit, test) warrant the frontier model. This strategy works well because tool usage is a good proxy for task complexity.
Keyword classification
Scan the prompt for keywords that indicate complexity. Words like "refactor," "architecture," "redesign," or "why does this fail" suggest complex reasoning. Words like "format," "rename," "add import," or "fix typo" suggest simple tasks. This heuristic is imperfect but cheap to implement.
Cascade routing
Start with the cheapest model. If the response quality is low (detected by a quality check or confidence score), retry with a better model. This approach optimizes for cost at the expense of latency, since some requests take two attempts. It works well for batch operations where latency is less important.
3. Model Tiers: Cost and Capability Comparison
Here is how the current model landscape breaks down by tier. Prices are per million tokens (input/output):
| Tier | Models | Input $/M | Output $/M | Best for |
|---|---|---|---|---|
| Budget | Haiku, GPT-4o-mini | $0.25-0.80 | $1.25-3.00 | Simple edits, reads, formatting |
| Mid-tier | Sonnet, GPT-4o | $2.50-3.00 | $10-15 | Code generation, debugging |
| Frontier | Opus, o1, o3 | $10-15 | $30-75 | Architecture, complex reasoning |
The price difference between tiers is roughly 5-10x. If 65% of your tasks can use the budget tier and 25% can use mid-tier, you save about 70% compared to sending everything to the frontier tier. The math is straightforward: 0.65 * (1/10) + 0.25 * (1/3) + 0.10 * 1 = approximately 0.25x of the all-frontier cost.
Custom endpoint support for model routing
Fazm supports custom API endpoints in its settings. Point it at a LiteLLM proxy and get automatic model routing without changing your workflow.
Try Fazm Free4. LiteLLM Setup for Automatic Routing
LiteLLM is an open-source proxy that handles model routing transparently. You configure routing rules, point your agent at the LiteLLM endpoint, and the proxy handles model selection for each request.
# litellm_config.yaml
model_list:
- model_name: auto
litellm_params:
model: anthropic/claude-haiku-3.5
api_key: sk-ant-xxx
model_info:
mode: cheap
- model_name: auto
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: sk-ant-xxx
model_info:
mode: default
- model_name: auto
litellm_params:
model: anthropic/claude-opus-4-20250514
api_key: sk-ant-xxx
model_info:
mode: expensive
router_settings:
routing_strategy: cost-based
enable_tag_filtering: trueStart the proxy locally and point your agent at it:
litellm --config litellm_config.yaml --port 8000 # Then in your agent config: export ANTHROPIC_BASE_URL=http://localhost:8000
LiteLLM also provides a dashboard for monitoring costs per model, setting budgets, and viewing routing decisions. This visibility helps you tune your routing rules based on actual usage data.
5. ANTHROPIC_BASE_URL for Seamless Provider Switching
Even without a full proxy setup, ANTHROPIC_BASE_URL lets you switch between providers seamlessly. You can keep multiple endpoint configurations and swap between them:
# Direct Anthropic API (full price, best reliability) export ANTHROPIC_BASE_URL=https://api.anthropic.com # AWS Bedrock (potentially cheaper with reserved capacity) export ANTHROPIC_BASE_URL=https://bedrock-runtime.us-east-1.amazonaws.com # LiteLLM proxy (smart routing) export ANTHROPIC_BASE_URL=http://localhost:8000 # OpenRouter (access to multiple providers) export ANTHROPIC_BASE_URL=https://openrouter.ai/api/v1
The key benefit is that your agent code and workflow do not change. The same agent, the same prompts, the same tools. Only the routing layer underneath changes. This makes it easy to experiment with different providers and routing strategies without touching your application code.
Some teams create shell aliases for quick switching:alias ai-cheap='export ANTHROPIC_BASE_URL=http://localhost:8000'andalias ai-direct='export ANTHROPIC_BASE_URL=https://api.anthropic.com'. Switch to cheap routing for routine tasks and direct access for critical work.
6. Measuring Your Actual Savings
Before implementing routing, measure your current spending to establish a baseline. Most API providers have dashboards showing cost per model and tokens per request. Export a week of data and categorize it:
Step 1: Count the total number of requests and tokens for the week.
Step 2: Classify each request by complexity (simple, moderate, complex) based on token count or tool usage.
Step 3: Calculate what the week would have cost if each category used the appropriate model tier.
Step 4: Compare the hypothetical cost to your actual cost. The difference is your potential savings.
Most teams find that 60-75% of their requests are simple enough for budget models, confirming the 70% savings estimate. If your workload is heavily skewed toward complex tasks (like architectural planning), the savings will be lower, perhaps 30-40%.
7. AI Agents with Built-In Routing Support
The easiest way to implement model routing is to use an agent that supports custom endpoints natively. When the agent has a settings field for the API endpoint, you can point it at a routing proxy without any configuration file editing.
Fazm is one example: it has a built-in settings field for custom API endpoints. Point it at your LiteLLM proxy and every task the agent handles gets routed to the appropriate model automatically. The agent does not need to know about the routing; it sends requests to whatever endpoint you configure, and the proxy handles model selection.
Other agents support similar configuration through environment variables or config files. The important thing is that the agent does not hardcode its model provider. Any agent that supports ANTHROPIC_BASE_URL or OPENAI_BASE_URL can be used with a routing proxy.
Model routing is one of the highest-leverage optimizations you can make for AI agent costs. It requires minimal code changes, provides immediate cost savings, and scales linearly with your usage. If you are spending more than $50/month on AI agent API calls, the 30 minutes it takes to set up LiteLLM will pay for itself within the first week.
AI agent with custom endpoint routing
Fazm is a free, open-source AI agent for macOS that controls your apps natively through accessibility APIs. Custom endpoint support lets you route through any proxy for cost optimization.
Try Fazm FreeFree to start. Fully open source. Runs locally on your Mac.