AI Models and Agents

Claude Sonnet 4 to 4.6 and Opus for Agent Work: Practical Improvements That Matter

Model benchmarks tell you how a model performs on standardized tests. They do not tell you whether your AI agent will stop hallucinating tool calls that crash your pipeline, or whether it will finally handle a 20-step workflow without losing track of step 14. The improvements from Claude Sonnet 4 through Sonnet 4.6 and the latest Opus are measured best by what breaks less often in production agent workflows. This guide covers the specific, practical improvements that matter for anyone building or using AI agents, with real comparisons across model versions and honest assessments of what still does not work well.

1. Tool Calling Reliability: The Biggest Practical Improvement

The single most impactful improvement from Sonnet 4 to 4.6 for agent work is tool calling reliability. In earlier versions, the model would occasionally hallucinate tool calls, inventing function names that did not exist, passing arguments with wrong types, or calling tools in an order that violated the workflow's constraints. Each hallucinated tool call could crash an agent pipeline or produce silently wrong results.

Sonnet 4.6 reduced hallucinated tool calls by roughly 70-80% compared to Sonnet 4 in typical agent workflows. The model is significantly better at staying within the defined tool schema, respecting required vs. optional parameters, and choosing the correct tool when multiple similar tools are available. For agent builders, this means fewer retry loops, fewer error handling branches, and more reliable end-to-end execution.

Opus takes this further with near-zero hallucinated tool calls in structured workflows. When given a clear tool schema and instructions, Opus almost never invents tools or misformats arguments. The trade-off is cost and latency. Opus is roughly 5x more expensive per token and 2-3x slower per response than Sonnet 4.6. For high-stakes agent workflows where reliability matters more than speed, Opus is the clear choice. For high-volume, latency-sensitive tasks, Sonnet 4.6 offers the better balance.

2. Multi-Step Planning: Following Through on Complex Tasks

Earlier Claude models would handle 3-5 step tasks reliably but start dropping or reordering steps around step 7-8. This made them unsuitable for complex agent workflows that required 15-20 sequential actions. The model would complete the first half of a task perfectly, then skip steps, repeat earlier steps, or lose track of where it was in the sequence.

Sonnet 4.6 handles 10-15 step sequences reliably, and Opus can maintain coherence through 20-30 step workflows. This is not just about context length. It is about the model's ability to maintain an internal plan, track progress against that plan, and adapt when intermediate steps produce unexpected results. The improvement shows up most clearly in coding tasks where the agent needs to: read a specification, plan the implementation, create multiple files, wire them together, write tests, run the tests, debug failures, and iterate until everything passes.

The practical impact is that agent workflows that previously required human checkpoints every 3-4 steps can now run autonomously for the full sequence. A "ship feature" workflow that required a human to verify each stage can now execute end-to-end with human review only at the final output. This does not eliminate the need for human oversight, but it changes oversight from step-by-step babysitting to result verification.

3. The 1M Context Window: What It Changes for Agents

The expansion to 1 million tokens of context changes what is possible for agent workflows in two important ways. First, agents can hold entire codebases in context simultaneously. A 50,000-line codebase fits comfortably in a 1M context window, meaning the agent can reference any file without needing to search or read files incrementally. This eliminates a common source of agent errors: making changes that conflict with code the agent has not seen.

Second, long-running agent sessions can maintain full conversation history. A debugging session that involves reading logs, checking configurations, testing hypotheses, and iterating on fixes can now run for an extended period without losing earlier context. Previously, long sessions would hit context limits and either summarize (losing detail) or reset (losing progress).

However, more context is not always better. Models still exhibit some degradation in attention to specific details as context length increases. A rule mentioned at token position 5,000 gets more consistent attention than a rule mentioned at token position 800,000. For agent builders, this means the most important instructions should still appear early in the context, in system prompts and CLAUDE.md files that load first. The large context window is best used for reference material that the agent might need, not for critical instructions that the agent must follow.

4. Sonnet vs. Opus: When to Use Which for Agent Work

The choice between Sonnet and Opus for agent work is not about which is "better." It is about matching the model to the task's requirements for reliability, speed, and cost.

Task Type	Recommended Model	Reasoning
Code generation (well-defined spec)	Sonnet 4.6	Fast, cheap, reliable for clear tasks
Architecture decisions	Opus	Better at weighing trade-offs
Multi-file refactoring	Opus	Maintains coherence across many files
Test generation	Sonnet 4.6	Pattern-heavy, speed matters
Bug investigation	Opus	Better at forming and testing hypotheses
Documentation	Sonnet 4.6	Clear writing, fast output
Complex agent orchestration	Opus for coordinator, Sonnet for workers	Balances reliability with cost

The hybrid approach, using Opus for the coordinator agent and Sonnet for worker agents, is emerging as the standard pattern for production agent systems. The coordinator makes high-level decisions about task breakdown, dependency ordering, and result verification. The workers execute well-defined sub-tasks where speed and cost efficiency matter more than nuanced reasoning.

5. Version-by-Version Comparison for Agent Tasks

Here is how the Claude model versions compare on specific agent capabilities that matter in production:

Capability	Sonnet 4	Sonnet 4.5	Sonnet 4.6	Opus
Tool call accuracy	~85%	~92%	~96%	~99%
Max reliable steps	5-7	8-12	10-15	20-30
Context window	200K	200K	1M	1M
Error recovery	Basic retry	Retry with adaptation	Hypothesis-driven retry	Root cause analysis
Speed (tokens/sec)	~80	~90	~100	~40
Cost per 1M tokens (output)	$15	$15	$15	$75

The numbers above are approximate and based on typical agent workloads rather than benchmarks. Your results will vary based on task complexity, prompt quality, and tool definitions. The key takeaway is that each version brought meaningful, measurable improvements to the capabilities that agent builders care about most.

6. Enterprise Adoption: Why Claude Became the Default

The shift toward Claude as the default model for enterprise agent workflows happened for reasons that benchmarks do not capture. Enterprises care about consistency, predictability, and the ability to debug when things go wrong. Claude's improvements in structured output, tool calling, and instruction following made it the most predictable model for production agent systems.

Consistency matters because enterprise agents run thousands of times per day. A model that produces correct output 96% of the time is dramatically different from one that produces correct output 85% of the time when you are running 1,000 agent tasks daily. The first model produces ~40 failures per day. The second produces ~150 failures per day. At scale, reliability improvements have outsized impact on operational costs.

The API design also matters for enterprise adoption. Claude's tool use API provides structured tool definitions, clear error reporting, and predictable response formats. The extended thinking feature in Opus lets enterprises audit the model's reasoning process, which is critical for regulated industries. And the system prompt caching feature reduces costs for high-volume agent deployments by avoiding redundant processing of shared instructions.

7. Building Agent Workflows on Claude in Practice

If you are building agent workflows on the Claude API, start with Sonnet 4.6 for everything and upgrade individual agents to Opus only when you identify reliability gaps. Most agent tasks do not need Opus-level reasoning, and the cost difference adds up quickly at scale. The exceptions are coordinator agents, complex debugging agents, and any agent that makes decisions affecting other agents.

Design your tool schemas carefully. The model's tool calling reliability depends heavily on clear, unambiguous tool definitions. Each tool should have a specific purpose, well-typed parameters, and a description that explains when to use it vs. similar tools. Avoid tools with overlapping functionality, as this is where the model is most likely to choose the wrong one.

Use system prompt caching for any agent that runs frequently. The cached system prompt avoids re-processing your instructions on every API call, reducing both latency and cost. For agent systems that process hundreds of tasks per hour with the same base instructions, caching can reduce input token costs by 90%.

Desktop agent tools like Fazm, an AI computer agent for macOS that is voice-first, open source, and uses accessibility APIs, build on top of these same model improvements. Better tool calling reliability means desktop agents can execute multi-step UI automation sequences more reliably. Better multi-step planning means desktop agents can handle complex cross-application workflows without losing track of the overall task. The model improvements compound with good agent design.

See Claude-Powered Desktop Automation

Fazm uses the latest Claude models to automate macOS workflows through voice commands and accessibility APIs. Open source and free to try.

Try Fazm Free