Building macOS AI Agents: Lessons from Simplifying Agent Code with Better Models
When the model gets smarter, your agent code should get simpler. But knowing what to delete — and what to keep — is the hard part.
1. The Scaffolding Problem in Agent Development
Every AI agent starts the same way: you pick a model, wire up some tools, and start testing. Within hours you're adding retry logic. Within days, you've built a context management layer. Within weeks, you have hundreds of lines of code whose only job is compensating for the model's weaknesses.
This scaffolding is necessary — until it isn't. When you upgrade from a weaker model to a stronger one, much of that compensating code becomes dead weight. It still runs, it still costs tokens in prompts, and it can actually hurt performance by over-constraining a model that doesn't need the guardrails.
Real example: One team building a macOS agent reported deleting over 300 lines of retry logic and context management code after switching from a mid-tier model to Claude Opus. The agent performed better with less code because the model handled edge cases that previously needed explicit handling.
2. What You Can Delete When Models Improve
Not all scaffolding is created equal. Here's what typically becomes unnecessary with stronger models:
- Retry loops with error classification — weaker models fail on tool calls ~15-20% of the time. Stronger models drop this to <2%. Your 50-line retry-with-backoff handler can become a simple single retry.
- Output format enforcement — parsing logic that extracts JSON from markdown code blocks, strips trailing commas, fixes missing quotes. Better models just output valid JSON.
- Context window management — summarization chains that compress history to fit context limits. Larger context windows and better attention mean you can often just pass the raw context.
- Step-by-step decomposition prompts — "First analyze the screen. Then identify the target element. Then plan your action." Stronger models do this reasoning internally without explicit chain-of-thought prompting.
- Validation layers — checking that the model's tool calls have valid parameters before executing them. If the model reliably generates correct parameters, the validation is overhead.
3. What You Should Never Delete
Some code looks like scaffolding but is actually load-bearing:
- Safety boundaries — permission checks, confirmation prompts for destructive actions, rate limits on external APIs. These protect against model errors that will always happen, no matter how good the model gets.
- Logging and observability — you need to debug failures in production. Never delete structured logging just because failures are rarer.
- Timeout handling — API calls hang, processes stall, UI elements don't appear. This isn't about model quality, it's about real-world reliability.
- User feedback loops — showing the user what the agent is doing and letting them intervene. Trust but verify.
4. macOS-Specific Agent Architecture
Building agents for macOS comes with unique advantages and constraints. The platform offers powerful APIs that most agent frameworks ignore:
| macOS API | What It Gives Your Agent | Reliability |
|---|---|---|
| Accessibility (AX) APIs | Full UI tree of any app — buttons, text fields, labels with exact coordinates | Very high — system-level, no rendering variance |
| ScreenCaptureKit | Efficient screen capture with window-level filtering | High — hardware-accelerated |
| CGEvent / IOHIDEvent | Synthetic mouse/keyboard events at the system level | Very high — bypasses app-level input handling |
| NSWorkspace | App launching, file handling, URL schemes | Very high — standard Cocoa API |
The key insight is that these APIs give your agent structured data about the screen state — not pixels to interpret, but actual UI elements with semantic meaning. This fundamentally changes the agent architecture from "look at a screenshot and guess" to "read the UI tree and act precisely."
5. Accessibility APIs vs Screenshots
This is the single biggest architectural decision in desktop agent development. The two approaches have dramatically different trade-offs:
| Factor | Accessibility APIs | Screenshot + Vision |
|---|---|---|
| Click accuracy | ~99% (exact coordinates from UI tree) | ~80-90% (estimated from pixels) |
| Speed | ~50ms to read UI tree | ~2-5s per screenshot + inference |
| Token cost | Low (text-only UI tree) | High (image tokens expensive) |
| Dynamic content | Handles well (reads current state) | Can miss updates between captures |
| Platform support | macOS, Windows (different APIs) | Any platform with screen access |
The practical difference is enormous. An accessibility-based agent can interact with a complex form in 2-3 seconds. A screenshot-based agent needs 15-30 seconds for the same task and may fail on tricky UI elements like dropdown menus or overlapping modals.
6. Choosing the Right Model for Your Agent
Model choice for agents is different from model choice for chatbots. Agents need:
- Reliable tool calling — generating valid JSON parameters every time
- Multi-step reasoning — planning 3-5 actions ahead without losing track
- Error recovery — recognizing when an action failed and adapting
- Context utilization — using all the information provided, not just the last message
In practice, the top-tier models (Claude Opus, GPT-4o) produce dramatically simpler agent code. The mid-tier models (Sonnet, GPT-4o mini) need more scaffolding but cost 5-10x less per token. The trade-off isn't just token cost — it's engineering time spent building and maintaining compensating code.
For most teams, starting with the strongest model and simplifying your codebase, then selectively downgrading specific tasks to cheaper models, is more efficient than building complex scaffolding around a weaker model from day one.
7. The Discipline to Simplify
The hardest part of improving your agent isn't adding features. It's removing code that works but is no longer necessary. Every developer feels the pull: "This retry logic took me two days to build and it works perfectly. Why would I delete it?"
Because unnecessary code has costs even when it works:
- Extra tokens in system prompts describing the retry behavior
- Latency from validation checks that always pass
- Cognitive load for anyone reading the codebase
- Surface area for bugs when you change something else
The best agent developers run a regular "scaffolding audit" — testing each piece of compensating code with the current model to see if it's still needed. If removing it doesn't change the success rate, it goes.
See a macOS agent built on these principles
Fazm is an open-source macOS AI agent using accessibility APIs for reliable desktop automation. Clean codebase, no unnecessary scaffolding. Free to use.
Explore the Source Code