How to Cut AI Agent Costs 70%: Smart Model Routing Guide

Your AI agent bill hit $2,000 last month and it's climbing fast. The fix isn't switching providers or writing less code - it's routing each task to the right model. Here's how teams are cutting costs 60-90% without sacrificing output quality.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. Why AI Agent Costs Balloon So Fast

Most AI agents default to a single model for everything. If you set up Claude Code with Opus or GPT-4o as your default, every task - from renaming a variable to architecting a distributed system - burns tokens at the same premium rate.

The math gets ugly fast. A typical development session involves 50-200 API calls. If each call averages 2,000 input tokens and 1,000 output tokens, a single day with Claude Opus costs roughly $3-12. Scale that across a team of 5 developers working 20 days a month and you're looking at $300-1,200/month - just on AI tokens.

But here's the key insight: 70-80% of agent tasks are simple. File operations, string formatting, boilerplate generation, basic refactoring, lint fixes - none of these need a frontier model. They need a fast, cheap model that can follow straightforward instructions.

The core problem: Sending a simple "rename this variable from userList to users" request to Claude Opus costs 60x more than sending it to Haiku - and the output quality is identical for that task.

2. Task Complexity Classification

The first step in model routing is classifying what your agent actually does. Most agent workloads break down into three tiers:

Tier 1: Simple Tasks (60-70% of calls)

File read/write operations with template-based changes
Code formatting, linting, and style fixes
Simple refactoring (rename, extract variable, inline)
Boilerplate generation from well-defined patterns
Status checks, parsing structured data, JSON manipulation
Straightforward Q&A from provided context

Best model: Claude Haiku, GPT-4o-mini, Gemini Flash. These models handle simple tasks with 95%+ accuracy at a fraction of the cost.

Tier 2: Medium Tasks (20-30% of calls)

Multi-file refactoring that requires understanding relationships
Writing new functions with moderate business logic
Bug diagnosis with 2-3 files of context
Test generation that requires understanding edge cases
Code review with meaningful feedback

Best model: Claude Sonnet, GPT-4o, Gemini Pro. The sweet spot of capability vs. cost for most real development work.

Tier 3: Complex Tasks (5-10% of calls)

System architecture decisions spanning 10+ files
Debugging subtle concurrency or race condition issues
Security audit and vulnerability analysis
Complex algorithm design and optimization
Migrating between frameworks or major refactors

Best model: Claude Opus, o1/o3, Gemini Ultra. Reserve the expensive models for tasks where reasoning depth actually matters.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Model Pricing Comparison: The 60x Gap

Here's the current pricing landscape that makes model routing so impactful. The price difference between the cheapest and most expensive models is staggering:

Model	Input (/1M tokens)	Output (/1M tokens)	Relative Cost
Claude Haiku 3.5	$0.80	$4.00	1x
GPT-4o-mini	$0.15	$0.60	0.2x
Gemini 2.0 Flash	$0.10	$0.40	0.1x
Claude Sonnet 4	$3.00	$15.00	4x
GPT-4o	$2.50	$10.00	3x
Claude Opus 4	$15.00	$75.00	19x
o3	$10.00	$40.00	13x

Key takeaway: GPT-4o-mini is 100x cheaper than Claude Opus on input tokens. If 70% of your agent's tasks can run on a mini/flash model, you're burning money by sending everything to a frontier model.

4. Model Routing Strategies

There are three main approaches to routing tasks to the right model, each with different tradeoffs:

Strategy 1: Keyword/Pattern-Based Routing

The simplest approach. Classify tasks based on keywords and patterns in the prompt. If the task mentions "rename", "format", "lint", or "boilerplate" - route to the cheap model. If it mentions "architect", "debug race condition", or "security audit" - route to the expensive one.

Pros: Zero latency overhead, easy to implement. Cons: Misclassifies nuanced tasks, requires manual rule maintenance.

Strategy 2: Classifier-Based Routing

Use a small, fast model (or a fine-tuned classifier) to analyze each incoming task and decide which model should handle it. The classifier itself costs almost nothing - a quick Haiku call to classify a task costs less than $0.001. The classifier looks at task description, file count, context window size, and required output complexity.

Pros: More accurate than keywords, adapts to context. Cons: Adds one small API call of latency per task.

Strategy 3: Cascading/Fallback Routing

Start every task with the cheapest model. If the output quality is low (detected via confidence scores, error rates, or validation checks), automatically retry with a more capable model. This is particularly effective for code generation where you can validate output by running tests or type checks.

For example: generate code with Haiku, run the test suite. If tests pass - done. If they fail, escalate to Sonnet. Still failing? Bring in Opus. In practice, 65-75% of tasks complete on the first try with the cheap model.

Pros: Guarantees quality, maximizes savings. Cons: Failed attempts waste some tokens, higher latency for escalated tasks.

5. Implementation Patterns

Here's what a basic model router looks like in practice. The key is defining clear routing rules and making them easy to adjust:

router.py

def route_task(task: AgentTask) -> str:
    # Count context signals
    file_count = len(task.files)
    token_count = task.estimated_tokens

    # Tier 1: Simple tasks -> Haiku/mini
    if file_count <= 2 and token_count < 4000:
        if task.type in ["format", "rename", "lint",
                         "boilerplate", "parse"]:
            return "claude-haiku"

    # Tier 3: Complex tasks -> Opus
    if file_count > 8 or token_count > 50000:
        if task.type in ["architecture", "security",
                         "debug_complex", "migrate"]:
            return "claude-opus"

    # Tier 2: Everything else -> Sonnet
    return "claude-sonnet"

For teams using the Anthropic API directly, you can implement routing at the API gateway level. For those using frameworks like LangChain or LlamaIndex, most have built-in model selection that you can configure per-chain or per-step.

A more advanced pattern is to track cost and quality metrics per task type over time. Log every task's model assignment, cost, and outcome (success/failure/retry). After a few weeks of data, you can optimize the routing rules based on actual performance rather than guesses. Teams that do this typically find 10-15% additional savings beyond the initial routing setup.

Pro tip: Cache frequently repeated queries. If your agent generates the same boilerplate or answers the same questions regularly, a simple prompt-hash cache can eliminate 10-20% of API calls entirely. Combine caching with routing for maximum savings.

6. Real-World Savings Breakdown

Let's do the math for a typical team of 5 developers, each making ~100 agent calls per day (500 total), averaging 2K input tokens and 1K output tokens per call.

Approach	Monthly Cost	Savings
All Opus (no routing)	$2,250	baseline
All Sonnet (single mid-tier)	$540	76%
Smart routing (70/25/5 split)	$315	86%
Routing + caching (15% cache hit)	$268	88%

The 70/25/5 split means 70% of calls go to Haiku ($0.80/$4.00), 25% to Sonnet ($3/$15), and 5% to Opus ($15/$75). The blended effective rate drops from $15/1M input to about $2.10/1M input.

For teams that were running everything on GPT-4o, switching to a routed setup with GPT-4o-mini handling simple tasks cuts costs by roughly 60-70%. The exact numbers depend on your task distribution, but the pattern is consistent: most of your spend is on tasks that don't need expensive models.

7. Tools That Help With Model Routing

You can build routing yourself, but several tools already handle this:

OpenRouter - acts as a unified API gateway across providers with built-in model fallback chains. You define a priority list of models and it automatically routes based on availability, cost, and latency.
LiteLLM - open-source proxy that standardizes 100+ LLM APIs behind one interface. Supports custom routing logic, cost tracking, and rate limit handling across providers.
Martian - specifically built for intelligent model routing. Their router analyzes each prompt and picks the cheapest model that can handle it at your target quality level.
Portkey - AI gateway with routing, caching, and observability. Useful for teams that need detailed cost breakdowns per feature, per developer, or per task type.
Fazm - for desktop automation specifically, Fazm handles model routing internally for its macOS agent tasks. Simple UI interactions use lighter models while complex multi-step workflows automatically escalate. Open-source, so you can inspect the routing logic.
Claude Code's built-in routing - Anthropic's CLI already routes between Haiku and Sonnet/Opus depending on task complexity when you use the auto model setting.

The best approach for most teams is to start with a simple keyword-based router, measure the results for 2-3 weeks, then graduate to a classifier-based approach once you have data on which tasks actually need expensive models. Don't over-engineer the routing before you understand your workload distribution.

Want a desktop AI agent with smart cost management built in?

Fazm is an open-source macOS AI agent that intelligently routes tasks across models - so you get fast, affordable desktop automation without manual configuration.

Free to start. Fully open source. Runs locally on your Mac.