macOS Accessibility API vs Screenshot Agents: Performance Deep Dive
A recurring discussion in the local AI agent community is the gap between what macOS accessibility APIs can do and what most agents actually use. The majority of desktop AI agents today take screenshots, send them to a vision model, and click on coordinates. It works - but it is slow, fragile, and blind to the semantic structure of what is on screen. There is another approach that most agent builders overlook, and the performance difference is not incremental. It is 50x.
1. Two Approaches to Desktop Automation
If you want an AI agent to interact with applications on macOS, there are fundamentally two strategies. The first is to look at the screen the way a human does - capture a screenshot, interpret the pixels, decide where to click, and simulate mouse events. This is the approach taken by most current agent frameworks including Anthropic Computer Use, OpenAI Operator, and various open-source projects.
The second approach is to skip the visual layer entirely and talk directly to the operating system about what UI elements exist, where they are, and what they do. macOS has had APIs for this since 2001 - originally built for screen readers and assistive technology. These accessibility APIs expose the full semantic tree of every running application: buttons, text fields, menus, labels, values, and their relationships.
The distinction matters because it is not just a performance difference. It is an architectural difference that affects reliability, token cost, latency, and what kinds of tasks an agent can handle. Most of the discussion around AI agents focuses on model capabilities, but the input/output interface to the desktop is equally important and far less discussed.
2. How macOS Accessibility APIs Work
The core of macOS accessibility is the AXUIElement API, part of the ApplicationServices framework. Every UI element in every application is represented as an AXUIElement with attributes you can query: its role (button, text field, menu item), its title or label, its position and size, its value (for text fields and sliders), and its available actions (press, increment, show menu).
A typical interaction looks like this: you get a reference to the frontmost application, walk its element tree to find a specific button by role and title, then call AXUIElementPerformAction to press it. The entire round trip - querying the tree and performing the action - takes roughly 5 to 50 milliseconds depending on tree depth and application complexity.
For lower-level input simulation, macOS provides CGEvent - the Core Graphics event system. CGEvent lets you create and post synthetic mouse clicks, key presses, and scroll events at the Quartz event system level. This is what you use when you need to type text into a field or perform click sequences that AXUIElement actions do not cover. CGEvent operations complete in under a millisecond.
- AXUIElementCopyAttributeValue - Read any attribute of any UI element: title, value, position, size, enabled state, children, parent
- AXUIElementPerformAction - Press buttons, toggle checkboxes, open menus, confirm dialogs
- AXObserver - Watch for UI changes in real time: window moved, value changed, element created or destroyed
- CGEventCreateMouseEvent / CGEventCreateKeyboardEvent - Simulate raw input when you need pixel-precise control
- AXUIElementCopyElementAtPosition - Hit-test any screen coordinate to find out what element is there
The key insight is that these APIs give you structured data. You do not have to guess what a button says - you read its title attribute. You do not have to estimate where a text field is - you read its AXPosition and AXSize. You do not have to figure out if an element is clickable - you check its AXRole and AXEnabled attributes.
3. How Screenshot-Based Agents Work
Screenshot-based agents follow a fundamentally different pipeline. On every action step, the agent captures a screenshot of the entire screen (or a window), encodes it as a JPEG or PNG, and sends it to a vision-language model along with the task description and action history. The model analyzes the image, identifies relevant UI elements visually, and outputs coordinates for where to click or what text to type.
The typical latency breakdown for a single action step looks something like this: 50-100ms for screenshot capture and encoding, 200-500ms for image upload to the API, 1000-2000ms for model inference, and 50-100ms for the resulting mouse or keyboard event. Total round trip: 1500 to 2500 milliseconds per action, assuming a fast API connection.
Beyond latency, screenshot agents have a structural limitation: they operate on pixels, not semantics. A vision model looking at a screenshot cannot reliably distinguish between a button and an image that looks like a button. It cannot read text in low-contrast themes. It struggles with overlapping windows, dropdown menus that extend beyond their parent window, and UI elements that are present in the accessibility tree but not visible on screen (scrolled-out content, hidden panels).
That said, screenshot agents have one significant advantage: they work with any application, including those with poor or nonexistent accessibility support. They can handle custom-rendered canvases, game UIs, and web applications with non-standard DOM structures. This universality is why they remain popular despite their limitations.
4. Performance Benchmarks: 50ms vs 2500ms
The performance gap between these approaches is not subtle. Here is a concrete comparison across common desktop tasks.
| Task | Accessibility API | Screenshot Agent | Speedup |
|---|---|---|---|
| Click a named button | 10-30ms | 1500-2500ms | 50-100x |
| Read text field value | 5-10ms | 1500-2500ms | 150-500x |
| Fill a 5-field form | 50-150ms | 10-15s (multi-step) | 70-100x |
| Navigate a menu hierarchy | 20-50ms | 3-5s (multiple screenshots) | 60-100x |
| Enumerate all buttons in a window | 30-80ms | Not reliably possible | N/A |
| Detect if element is enabled/disabled | 5ms | Unreliable from pixels | N/A |
The cost difference is equally significant. Each screenshot-based action step consumes vision model tokens - typically 1000 to 3000 tokens for the image plus context. A ten-step workflow costs roughly $0.02 to $0.10 in API calls. The same workflow through accessibility APIs costs zero in model inference because no vision model is needed for UI element identification.
For workflows that involve many steps - filling out a long form, processing a batch of files through an application, or navigating a complex multi-pane interface - the cumulative difference is substantial. A task that takes a screenshot agent two minutes and costs $0.50 in API calls might take an accessibility-based agent three seconds and cost nothing beyond the initial planning step.
5. Reliability and Edge Cases
Performance is only part of the story. The bigger practical difference is reliability. Screenshot agents fail in predictable ways that are hard to fix.
- Resolution and scaling - A screenshot taken on a Retina display has different pixel coordinates than the same screenshot on an external monitor. Many agents mishandle the 2x scaling factor, clicking in the wrong location.
- Dark mode and custom themes - Vision models trained primarily on light-mode screenshots can misidentify elements in dark mode. Buttons blend into backgrounds, text contrast changes, and active/inactive states look different.
- Overlapping windows - When a dialog or dropdown partially obscures the target window, the screenshot agent sees a composite image and may click on the wrong element.
- Dynamic content - Loading spinners, animations, and content that appears after the screenshot was captured cause the agent to act on stale information.
- Off-screen elements - A text field that requires scrolling to reach is invisible to a screenshot agent but fully accessible through the AX tree.
Accessibility APIs sidestep most of these issues because they operate on the semantic model, not the visual rendering. A button is a button regardless of what theme is active or whether it is partially obscured. A text field has a value attribute whether or not it is currently visible on screen. The AX tree is the source of truth for the UI state - the visual rendering is just one representation of it.
6. When to Use Each Approach
Despite the performance advantages of accessibility APIs, screenshot agents are not going away. Each approach has a clear set of scenarios where it is the better choice.
| Scenario | Best approach | Why |
|---|---|---|
| Native macOS apps (Finder, Mail, Safari) | Accessibility API | Excellent AX support, full element trees |
| Electron apps (Slack, VS Code, Discord) | Accessibility API | Chromium exposes AX tree well |
| Web browsers (Chrome, Safari, Arc) | Hybrid (AX + DOM) | AX for browser chrome, DOM for web content |
| Canvas-heavy apps (Figma, games) | Screenshot | Custom rendering, no AX tree for canvas content |
| Remote desktop / VMs | Screenshot | No local AX tree for remote UI |
| Cross-platform agents | Screenshot | AX APIs differ across Windows, Linux, macOS |
The practical reality is that most desktop agent tasks fall in the first two categories - native apps and Electron apps, which together cover the vast majority of professional desktop workflows. For these applications, the accessibility API approach is strictly superior.
This is the approach Fazm takes - an open-source macOS agent built primarily on accessibility APIs (AXUIElement and CGEvent), falling back to screenshot analysis only when the accessibility tree is insufficient. The result is sub-second action execution for most desktop tasks, with the visual fallback handling edge cases like canvas-based applications.
The best agents will likely be hybrid, using accessibility APIs as the primary interface and screenshots as a fallback for applications with poor AX support. This gives you the speed and reliability of structured APIs for 90% of interactions while maintaining universality for the remaining 10%.
7. Building Hybrid Agents
If you are building a macOS agent or evaluating existing ones, here is a practical framework for thinking about the architecture.
Start with accessibility. Before reaching for a vision model, check if the target application exposes its UI through the AX tree. Most do. Use AXUIElementCopyAttributeValue to query the frontmost application and walk its children. If you can find and interact with the elements you need through AX calls alone, you have a fast, reliable, and free (no API cost) interaction path.
Add CGEvent for input simulation. AXUIElement handles reading UI state and performing standard actions (press, select, toggle). For typing text, complex mouse gestures, and keyboard shortcuts, layer in CGEvent. The combination of AXUIElement for understanding the UI and CGEvent for input covers the vast majority of desktop automation tasks.
Fall back to screenshots selectively. When the AX tree is empty or incomplete - which you can detect programmatically by checking tree depth and element count - switch to screenshot analysis for that specific interaction. This keeps vision model calls to a minimum while maintaining coverage.
Use AXObserver for reactive behavior. Instead of polling screenshots on a timer, register AX observers for relevant UI changes. Get notified when a dialog appears, a value changes, or a new window opens. This makes your agent reactive instead of polling, reducing both latency and resource usage.
The macOS accessibility API surface area is larger than most developers realize. Apple has invested decades in it because it underpins VoiceOver, Switch Control, and the entire assistive technology ecosystem. For AI agent builders, this mature infrastructure is a significant advantage over screenshot-first approaches - one that the current wave of agent frameworks is only beginning to exploit. The agents that figure out how to use both approaches intelligently will be meaningfully faster, cheaper, and more reliable than those that rely on vision alone.
Try a macOS agent built on accessibility APIs
Fazm is an open-source desktop agent that uses AXUIElement and CGEvent for sub-second interactions, falling back to vision only when needed. Fully local, voice-first, and free.
Get Started Free