Building a macOS Desktop Agent with Accessibility APIs Instead of CSS Selectors

Matthew Diakonov

Updated March 19, 2026

macos accessibility-api desktop-agent voice-control ai-agents

Building a macOS Desktop Agent with Accessibility APIs Instead of CSS Selectors

Most desktop automation tools try to control apps through CSS selectors, pixel coordinates, or screenshot analysis. All of these approaches are fragile. CSS selectors break when apps update. Pixel matching fails at different resolutions. Screenshots waste tokens on visual processing that misses interactive elements.

There is a better approach - using the macOS accessibility API directly.

Why Accessibility APIs Win

Every macOS application exposes a structured tree of UI elements through the accessibility framework. Buttons, text fields, menus, sliders - they are all represented as nodes with roles, labels, and actions. This is the same tree that screen readers like VoiceOver use.

When you feed this tree to an LLM instead of a screenshot, the model gets structured, semantic information about every interactive element on screen. It knows what each button does, what text is in each field, and what actions are available. No guessing from pixels required.

The Token Problem and Pruning

A full accessibility tree for a complex application can be enormous - thousands of nodes with attributes, children, and relationships. Feeding the entire tree to an LLM burns through context windows fast.

The solution is aggressive pruning. By filtering out decorative elements, collapsed sections, and off-screen content, you can cut token usage by roughly 60% while keeping all the actionable information. The pruning system learns which elements matter for each type of task and drops the rest.

Voice Control That Actually Works

Once you have reliable accessibility tree interpretation, voice control becomes straightforward. Spoken commands map to native accessibility actions - "click the save button" finds the button node and triggers its press action. "Type hello in the search field" locates the text field and inserts text.

This is fundamentally more reliable than voice-to-screenshot-to-click pipelines because the system knows exactly what elements exist and what actions they support. No coordinate mapping, no OCR errors, no resolution dependencies.

The Result

Desktop automation built on accessibility APIs handles app updates, resolution changes, and theme switches without breaking. The LLM works with structured data instead of raw pixels, and the pruning system keeps costs manageable.

Fazm is an open source macOS AI agent. Open source on GitHub.

Building a macOS Desktop Agent with Accessibility APIs Instead of CSS Selectors

Building a macOS Desktop Agent with Accessibility APIs Instead of CSS Selectors

Why Accessibility APIs Win

The Token Problem and Pruning

Voice Control That Actually Works

The Result

More on This Topic

Related Posts

AI Agent Desktop: How Autonomous Software Controls Your Computer in 2026

Agent Workflow: How AI Agents Execute Multi-Step Tasks on Your Desktop

Open Source AI Agent Desktop Automation: Why It Matters and How to Get Started