Back to Blog

How to Cut AI Agent Costs 50-70% with Model Routing

Fazm Team··2 min read
model-routingcost-reductionollamaclaudeoptimization

How to Cut AI Agent Costs 50-70% with Model Routing

Running every agent action through Claude Opus is expensive. A single complex workflow might involve 50+ LLM calls - reading screen state, deciding what to click, verifying results, handling errors. At API pricing, a heavy user can burn through $20-50 per day easily.

Most of those calls do not need a frontier model. The trick is figuring out which ones do.

Simple Routing Rules

Start with task complexity. Reading a button label and confirming it matches what you expected? That is a local model task. Llama 3 running on Ollama handles simple classification and extraction just fine, and the latency is lower than an API call.

Planning a multi-step workflow that requires understanding context from three different applications? That is Claude territory. The reasoning quality difference between local and frontier models shows up most on tasks that require holding multiple pieces of context simultaneously.

A basic router looks at the task type and routes accordingly:

  • Screen element identification - local model
  • Simple yes/no decisions - local model
  • Text extraction and formatting - local model
  • Multi-step planning - frontier model
  • Error recovery and replanning - frontier model
  • Complex form filling with context - frontier model

State Summarization

The other cost lever is context size. After each step, summarize the current state instead of carrying the full conversation history. An agent that has been running for 20 steps does not need the raw screenshots and tool outputs from step 3. It needs a one-line summary: "Logged into Salesforce, navigated to the Contacts page."

This cuts token usage dramatically. A 50-step workflow with full context might use 500K tokens. With summarization after every 5 steps, you can keep it under 100K.

Context Pruning

Not everything in the agent's context is relevant to the current step. Prune aggressively. If the agent is filling out a form, it does not need the accessibility tree of the menu bar. If it is reading an email, it does not need the state of the sidebar.

Combine routing, summarization, and pruning and you can realistically cut costs by 50-70% without meaningful quality loss on the tasks that matter.

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts