Building a Full macOS Desktop Agent with Claude

Fazm Team··2 min read

Building a Full macOS Desktop Agent with Claude

Built a full macOS desktop agent with Claude. The app reads the screen accessibility tree, understands what's on screen, and can click and type in any native application. Here's how the architecture works.

The Accessibility Tree Foundation

Every macOS application exposes its UI through the accessibility API. This gives you a structured tree of every element - buttons, text fields, labels, menus, windows - with their properties and positions.

The agent queries this tree to understand the current state of the screen. Instead of taking a screenshot and feeding it to a vision model, it gets structured data directly. A button labeled "Send" at coordinates (450, 320) is just a data point, not a pattern recognition problem.

How the Agent Loop Works

The core loop is straightforward:

  1. Observe - read the accessibility tree of the frontmost application
  2. Understand - send the tree structure to Claude with the current task context
  3. Decide - Claude determines the next action (click, type, scroll, switch apps)
  4. Execute - perform the action through accessibility APIs
  5. Verify - read the tree again to confirm the action worked

This loop runs continuously until the task is complete or the agent encounters something it can't handle.

Why Native App Control Matters

Browser-based agents are limited to web apps. A desktop agent that controls native applications can automate workflows that span multiple apps - copy data from a spreadsheet, paste it into an email client, attach a file from Finder, and send it.

The accessibility API approach works with any app that follows standard macOS UI conventions. That covers most productivity software, creative tools, and system utilities.

The Hard Parts

Screen reading is easy. Making it reliable is hard. Applications update their UI unpredictably, accessibility labels are sometimes missing or misleading, and timing matters - you need to wait for animations and loading states before reading the tree again.

Building a desktop agent that works in demos is a weekend project. Building one that works reliably every day takes months of edge case handling.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts