Building a Full macOS Desktop Agent with Claude

Matthew Diakonov

Updated March 19, 2026

macos desktop-agent accessibility-tree claude screen-reading native-app-control

Building a Full macOS Desktop Agent with Claude

Built a full macOS desktop agent with Claude. The app reads the screen accessibility tree, understands what's on screen, and can click and type in any native application. Here's how the architecture works.

The Accessibility Tree Foundation

Every macOS application exposes its UI through the accessibility API. This gives you a structured tree of every element - buttons, text fields, labels, menus, windows - with their properties and positions.

The agent queries this tree to understand the current state of the screen. Instead of taking a screenshot and feeding it to a vision model, it gets structured data directly. A button labeled "Send" at coordinates (450, 320) is just a data point, not a pattern recognition problem.

How the Agent Loop Works

The core loop is straightforward:

Observe - read the accessibility tree of the frontmost application
Understand - send the tree structure to Claude with the current task context
Decide - Claude determines the next action (click, type, scroll, switch apps)
Execute - perform the action through accessibility APIs
Verify - read the tree again to confirm the action worked

This loop runs continuously until the task is complete or the agent encounters something it can't handle.

Why Native App Control Matters

Browser-based agents are limited to web apps. A desktop agent that controls native applications can automate workflows that span multiple apps - copy data from a spreadsheet, paste it into an email client, attach a file from Finder, and send it.

The accessibility API approach works with any app that follows standard macOS UI conventions. That covers most productivity software, creative tools, and system utilities.

The Hard Parts

Screen reading is easy. Making it reliable is hard. Applications update their UI unpredictably, accessibility labels are sometimes missing or misleading, and timing matters - you need to wait for animations and loading states before reading the tree again.

Building a desktop agent that works in demos is a weekend project. Building one that works reliably every day takes months of edge case handling.

Fazm is an open source macOS AI agent. Open source on GitHub.

Building a Full macOS Desktop Agent with Claude

Building a Full macOS Desktop Agent with Claude

The Accessibility Tree Foundation

How the Agent Loop Works

Why Native App Control Matters

The Hard Parts

More on This Topic

Related Posts

Plug-and-Play Claude Access to Mac Apps via the Accessibility API

Accessibility Tree Dumps Overflow LLM Context Windows - How to Fix It

Accessibility Tree vs DOM: What They Are, How They Differ, and When Each Matters