Building a Full macOS Desktop Agent with Claude
Building a Full macOS Desktop Agent with Claude
Built a full macOS desktop agent with Claude. The app reads the screen accessibility tree, understands what's on screen, and can click and type in any native application. Here's how the architecture works.
The Accessibility Tree Foundation
Every macOS application exposes its UI through the accessibility API. This gives you a structured tree of every element - buttons, text fields, labels, menus, windows - with their properties and positions.
The agent queries this tree to understand the current state of the screen. Instead of taking a screenshot and feeding it to a vision model, it gets structured data directly. A button labeled "Send" at coordinates (450, 320) is just a data point, not a pattern recognition problem.
How the Agent Loop Works
The core loop is straightforward:
- Observe - read the accessibility tree of the frontmost application
- Understand - send the tree structure to Claude with the current task context
- Decide - Claude determines the next action (click, type, scroll, switch apps)
- Execute - perform the action through accessibility APIs
- Verify - read the tree again to confirm the action worked
This loop runs continuously until the task is complete or the agent encounters something it can't handle.
Why Native App Control Matters
Browser-based agents are limited to web apps. A desktop agent that controls native applications can automate workflows that span multiple apps - copy data from a spreadsheet, paste it into an email client, attach a file from Finder, and send it.
The accessibility API approach works with any app that follows standard macOS UI conventions. That covers most productivity software, creative tools, and system utilities.
The Hard Parts
Screen reading is easy. Making it reliable is hard. Applications update their UI unpredictably, accessibility labels are sometimes missing or misleading, and timing matters - you need to wait for animations and loading states before reading the tree again.
Building a desktop agent that works in demos is a weekend project. Building one that works reliably every day takes months of edge case handling.
Fazm is an open source macOS AI agent. Open source on GitHub.