Building a macOS Desktop Agent with Accessibility APIs Instead of CSS Selectors

Fazm Team··2 min read

Building a macOS Desktop Agent with Accessibility APIs Instead of CSS Selectors

Most desktop automation tools try to control apps through CSS selectors, pixel coordinates, or screenshot analysis. All of these approaches are fragile. CSS selectors break when apps update. Pixel matching fails at different resolutions. Screenshots waste tokens on visual processing that misses interactive elements.

There is a better approach - using the macOS accessibility API directly.

Why Accessibility APIs Win

Every macOS application exposes a structured tree of UI elements through the accessibility framework. Buttons, text fields, menus, sliders - they are all represented as nodes with roles, labels, and actions. This is the same tree that screen readers like VoiceOver use.

When you feed this tree to an LLM instead of a screenshot, the model gets structured, semantic information about every interactive element on screen. It knows what each button does, what text is in each field, and what actions are available. No guessing from pixels required.

The Token Problem and Pruning

A full accessibility tree for a complex application can be enormous - thousands of nodes with attributes, children, and relationships. Feeding the entire tree to an LLM burns through context windows fast.

The solution is aggressive pruning. By filtering out decorative elements, collapsed sections, and off-screen content, you can cut token usage by roughly 60% while keeping all the actionable information. The pruning system learns which elements matter for each type of task and drops the rest.

Voice Control That Actually Works

Once you have reliable accessibility tree interpretation, voice control becomes straightforward. Spoken commands map to native accessibility actions - "click the save button" finds the button node and triggers its press action. "Type hello in the search field" locates the text field and inserts text.

This is fundamentally more reliable than voice-to-screenshot-to-click pipelines because the system knows exactly what elements exist and what actions they support. No coordinate mapping, no OCR errors, no resolution dependencies.

The Result

Desktop automation built on accessibility APIs handles app updates, resolution changes, and theme switches without breaking. The LLM works with structured data instead of raw pixels, and the pruning system keeps costs manageable.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts