Using an MCP Server to Read the macOS Accessibility Tree for Desktop Control

Fazm Team··3 min read

Using an MCP Server to Read the macOS Accessibility Tree for Desktop Control

Most AI desktop agents rely on screenshots. They capture the screen, send the image to a vision model, and try to figure out where to click. It works sometimes, but it is slow, expensive, and unreliable. There is a better way - building an MCP server that reads the macOS accessibility tree directly.

How It Works

The MCP (Model Context Protocol) server connects to Claude and exposes the macOS accessibility tree as structured data. When Claude needs to interact with any application, it queries the server to get the full UI hierarchy - every button, text field, menu item, and interactive element with its label, role, and available actions.

This means Claude can open any app, see every button, field, and menu, and click or type into them. No screenshot interpretation needed. No coordinate guessing. The accessibility tree tells it exactly what is available and how to interact with it.

Why This Is More Reliable Than Screenshots

Screenshot-based agents have several failure modes. Resolution changes break coordinate calculations. Dark mode versus light mode confuses vision models. Overlapping windows hide elements. Animations during capture cause blurry or incomplete images.

The accessibility tree has none of these problems. It is a structured representation that does not depend on visual rendering. A button is a button regardless of what theme the app uses or where the window is positioned on screen.

Real-World Usage

In practice, this approach handles tasks that screenshot agents struggle with. Navigating complex forms across multiple applications, interacting with dropdown menus that only appear on hover, reading text from non-standard UI components - all of these work reliably because the accessibility tree exposes them as structured data.

The MCP architecture also means any tool that supports the protocol can use this capability. It is not locked into a single agent framework.

Performance Benefits

Sending structured text instead of images dramatically reduces token usage and latency. An accessibility tree snapshot for a typical application is a few kilobytes of text. A screenshot is hundreds of kilobytes of image data that requires vision model processing. The structured approach is both faster and cheaper.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts