Multi-LLM Agent Routing - Using Different Models for Different Subtasks

Fazm Team··3 min read

Multi-LLM Agent Routing - Using Different Models for Different Subtasks

Running every agent task through a single large model is wasteful. A screen classification that takes 50 tokens does not need the same model as a complex multi-step code refactor. Smart agents route different subtasks to different models.

The Routing Architecture

A practical multi-LLM setup looks like this:

  • Orchestrator (Claude, GPT-4) - Handles planning, complex reasoning, and multi-step task decomposition. This is where you need the best model.
  • Fast classifier (Haiku, GPT-4o-mini) - Determines what is on screen, categorizes user intent, routes to the right workflow. Speed matters more than depth here.
  • Code generator (Claude Sonnet, Codex) - Writes and edits code. Needs strong coding ability but not necessarily the largest context window.
  • Vision model (GPT-4o, local Llava) - Processes screenshots when accessibility APIs are not available. Specialized for visual understanding.

Why Not Just Use the Best Model for Everything

Cost and latency. A single orchestration step with Claude Opus costs roughly 10x what the same step costs with Haiku. If your agent makes 50 LLM calls per task, using the top model for all of them burns through budget fast.

More importantly, latency compounds. If each call takes 3 seconds instead of 0.5 seconds because you are using a larger model, a 50-call task takes 2.5 minutes instead of 25 seconds. Users notice.

Practical Routing Rules

The routing logic does not need to be complex:

  • If classifying or categorizing - use the smallest model that achieves 95%+ accuracy on your eval set.
  • If generating or editing code - use a mid-tier coding model.
  • If planning or reasoning about multi-step actions - use the best available model. This is where mistakes are most expensive.
  • If processing images - use a vision-capable model, ideally running locally to avoid upload latency.

The Orchestrator Pattern

The most effective pattern: Claude orchestrates the overall task, decides which subtasks to delegate, and shells out to cheaper or faster models for execution. It reviews the results and decides next steps. Think of it as a senior engineer delegating to specialists.

This is not theoretical. Production AI agents already do this, and the cost and speed improvements are significant.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts