Building macOS AI Agents: Lessons from Simplifying Agent Code with Better Models

When the model gets smarter, your agent code should get simpler. But knowing what to delete — and what to keep — is the hard part.

1. The Scaffolding Problem in Agent Development

Every AI agent starts the same way: you pick a model, wire up some tools, and start testing. Within hours you're adding retry logic. Within days, you've built a context management layer. Within weeks, you have hundreds of lines of code whose only job is compensating for the model's weaknesses.

This scaffolding is necessary — until it isn't. When you upgrade from a weaker model to a stronger one, much of that compensating code becomes dead weight. It still runs, it still costs tokens in prompts, and it can actually hurt performance by over-constraining a model that doesn't need the guardrails.

Real example: One team building a macOS agent reported deleting over 300 lines of retry logic and context management code after switching from a mid-tier model to Claude Opus. The agent performed better with less code because the model handled edge cases that previously needed explicit handling.

2. What You Can Delete When Models Improve

Not all scaffolding is created equal. Here's what typically becomes unnecessary with stronger models:

Retry loops with error classification — weaker models fail on tool calls ~15-20% of the time. Stronger models drop this to <2%. Your 50-line retry-with-backoff handler can become a simple single retry.
Output format enforcement — parsing logic that extracts JSON from markdown code blocks, strips trailing commas, fixes missing quotes. Better models just output valid JSON.
Context window management — summarization chains that compress history to fit context limits. Larger context windows and better attention mean you can often just pass the raw context.
Step-by-step decomposition prompts — "First analyze the screen. Then identify the target element. Then plan your action." Stronger models do this reasoning internally without explicit chain-of-thought prompting.
Validation layers — checking that the model's tool calls have valid parameters before executing them. If the model reliably generates correct parameters, the validation is overhead.

3. What You Should Never Delete

Some code looks like scaffolding but is actually load-bearing:

Safety boundaries — permission checks, confirmation prompts for destructive actions, rate limits on external APIs. These protect against model errors that will always happen, no matter how good the model gets.
Logging and observability — you need to debug failures in production. Never delete structured logging just because failures are rarer.
Timeout handling — API calls hang, processes stall, UI elements don't appear. This isn't about model quality, it's about real-world reliability.
User feedback loops — showing the user what the agent is doing and letting them intervene. Trust but verify.

4. macOS-Specific Agent Architecture

Building agents for macOS comes with unique advantages and constraints. The platform offers powerful APIs that most agent frameworks ignore:

macOS API	What It Gives Your Agent	Reliability
Accessibility (AX) APIs	Full UI tree of any app — buttons, text fields, labels with exact coordinates	Very high — system-level, no rendering variance
ScreenCaptureKit	Efficient screen capture with window-level filtering	High — hardware-accelerated
CGEvent / IOHIDEvent	Synthetic mouse/keyboard events at the system level	Very high — bypasses app-level input handling
NSWorkspace	App launching, file handling, URL schemes	Very high — standard Cocoa API

The key insight is that these APIs give your agent structured data about the screen state — not pixels to interpret, but actual UI elements with semantic meaning. This fundamentally changes the agent architecture from "look at a screenshot and guess" to "read the UI tree and act precisely."

5. Accessibility APIs vs Screenshots

This is the single biggest architectural decision in desktop agent development. The two approaches have dramatically different trade-offs:

Factor	Accessibility APIs	Screenshot + Vision
Click accuracy	~99% (exact coordinates from UI tree)	~80-90% (estimated from pixels)
Speed	~50ms to read UI tree	~2-5s per screenshot + inference
Token cost	Low (text-only UI tree)	High (image tokens expensive)
Dynamic content	Handles well (reads current state)	Can miss updates between captures
Platform support	macOS, Windows (different APIs)	Any platform with screen access

The practical difference is enormous. An accessibility-based agent can interact with a complex form in 2-3 seconds. A screenshot-based agent needs 15-30 seconds for the same task and may fail on tricky UI elements like dropdown menus or overlapping modals.

6. Choosing the Right Model for Your Agent

Model choice for agents is different from model choice for chatbots. Agents need:

Reliable tool calling — generating valid JSON parameters every time
Multi-step reasoning — planning 3-5 actions ahead without losing track
Error recovery — recognizing when an action failed and adapting
Context utilization — using all the information provided, not just the last message

In practice, the top-tier models (Claude Opus, GPT-4o) produce dramatically simpler agent code. The mid-tier models (Sonnet, GPT-4o mini) need more scaffolding but cost 5-10x less per token. The trade-off isn't just token cost — it's engineering time spent building and maintaining compensating code.

For most teams, starting with the strongest model and simplifying your codebase, then selectively downgrading specific tasks to cheaper models, is more efficient than building complex scaffolding around a weaker model from day one.

7. The Discipline to Simplify

The hardest part of improving your agent isn't adding features. It's removing code that works but is no longer necessary. Every developer feels the pull: "This retry logic took me two days to build and it works perfectly. Why would I delete it?"

Because unnecessary code has costs even when it works:

Extra tokens in system prompts describing the retry behavior
Latency from validation checks that always pass
Cognitive load for anyone reading the codebase
Surface area for bugs when you change something else

The best agent developers run a regular "scaffolding audit" — testing each piece of compensating code with the current model to see if it's still needed. If removing it doesn't change the success rate, it goes.

See a macOS agent built on these principles

Fazm is an open-source macOS AI agent using accessibility APIs for reliable desktop automation. Clean codebase, no unnecessary scaffolding. Free to use.

Explore the Source Code