AI Agent Reliability

AI Agent Error Recovery: Accessibility APIs vs Screenshots for Reliable Desktop Automation

When Anthropic launched computer use, it showed an AI agent controlling a full desktop, opening browsers, clicking buttons, filling forms. The demos were impressive. The reality of using it daily is more complicated. Desktop AI agents confidently click the wrong things, misread text from screenshots, and struggle to recover when something unexpected appears on screen. The core reliability problem comes down to how agents perceive the desktop: pixel matching from screenshots versus structured data from accessibility APIs.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. The Confident Failure Problem

The most dangerous behavior in desktop AI agents is not failure. It is confident failure. An agent that crashes and stops is annoying but safe. An agent that clicks the wrong button, fills in the wrong field, or confirms the wrong dialog, all while reporting success, can cause real damage.

This happens because screenshot-based agents rely on visual pattern matching to identify UI elements. The agent sees pixels, not meaning. A "Delete" button and a "Submit" button might be the same size, the same shape, and in a similar position. If the agent misreads the text or the screenshot is slightly blurred, it clicks the wrong one with full confidence.

Common confident failure scenarios:

Clicking "OK" on an unexpected permission dialog without reading it
Selecting the wrong item from a dropdown because two options look similar in a screenshot
Typing into the wrong text field because the cursor position was misjudged
Proceeding through a multi-step wizard while missing an error message displayed in a small font or non-standard color
Confirming a destructive action because the confirmation dialog layout was not what the agent expected

Each of these failures looks like success to the agent. It took an action and the screen changed. Without understanding the semantic meaning of what changed, the agent has no way to detect that it did the wrong thing.

2. Screenshots vs Accessibility APIs

Desktop AI agents need to perceive what is on screen and interact with it. There are two fundamentally different approaches, and they have very different reliability profiles.

Screenshot-based (vision) approach: The agent takes a screenshot, sends it to a vision model, and gets back a description of what is on screen plus coordinates to click. This is how Anthropic's computer use, OpenAI's operator, and several other agents work.

Accessibility API approach: The agent queries the operating system's accessibility tree, which provides a structured representation of every UI element on screen, including its role, label, value, position, and state. This is the same data that screen readers like VoiceOver use.

Property	Screenshots	Accessibility APIs
Data format	Pixels (unstructured)	Structured tree (roles, labels, values)
Element identification	Visual pattern matching	Exact label and role lookup
State awareness	Inferred from appearance	Explicit (enabled, disabled, checked, expanded)
Hidden elements	Not visible, not accessible	Present in tree, can query off-screen elements
Speed	Slow (screenshot + vision model inference)	Fast (local API call, no inference needed)
Works with any app	Yes (everything renders to pixels)	Mostly (depends on app accessibility support)
Cost per action	High (vision API call per step)	Low (local computation only)

The tradeoff is universality versus reliability. Screenshots work with any application, including games, custom-rendered UIs, and remote desktops. Accessibility APIs work reliably with well-built native applications but may have incomplete data for poorly implemented apps or non-native interfaces.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Error Recovery Strategies

Error recovery is what separates a demo agent from a production agent. When things go wrong, and they will, the agent needs to detect the error, understand what happened, and decide on a recovery path.

Detection is the hardest part. Accessibility APIs make detection significantly easier because the agent can check element states programmatically. After clicking a "Submit" button, the agent can check: did a success message appear? Is the form still visible (meaning it failed)? Did an error alert show up? With screenshots, the agent has to take another screenshot, run inference, and try to determine if the screen changed in the expected way.

Common error recovery patterns:

State verification - After every action, verify the expected state change occurred. If clicking "Save" should show a confirmation toast, check for the toast. If it did not appear, the action may have failed.
Timeout with fallback - If an expected UI change does not happen within a reasonable time, try an alternative approach. Perhaps the button was not actually clicked, or a loading spinner appeared that the agent did not expect.
Undo and retry - When the agent detects it took the wrong action, use Cmd+Z or navigate back before trying again. This requires knowing which actions are reversible.
Escalation - When the agent cannot recover, stop and ask the user for help rather than continuing blindly. This is better than an agent that confidently destroys data while trying to fix its own mistake.
Checkpoint-based recovery - Save state at known-good points during a workflow. If something goes wrong, roll back to the last checkpoint rather than starting over.

Accessibility APIs enable all of these patterns more reliably than screenshots because you get deterministic state information rather than probabilistic visual interpretation.

4. Structured Data vs Pixel Matching

The fundamental advantage of accessibility APIs is that they provide structured, semantic data. When the accessibility tree says a button is labeled "Delete Account" and its role is "button" and its state is "enabled," there is no ambiguity. The agent knows exactly what element it is dealing with.

Pixel matching, by contrast, is inherently ambiguous. The agent sees a rectangular region with certain colors and tries to infer meaning. This breaks in predictable ways:

Resolution sensitivity - A button that is clearly readable at 2x Retina resolution may be blurred or misread at 1x. Different monitors produce different screenshots.
Theme sensitivity - Dark mode changes colors, which changes how vision models interpret the UI. An agent trained on light mode screenshots may struggle with dark mode.
Overlapping elements - Dropdown menus, tooltips, notification banners, and modal dialogs can overlap the target element. The agent sees the overlap as part of the UI and gets confused.
Animation state - If the screenshot is taken during a transition animation, elements may be partially visible, in the wrong position, or distorted.
Localization - A button labeled "Submit" in English becomes "Soumettre" in French. The accessibility label can be language-independent (using accessibility identifiers), but the visual text changes.

None of these issues affect accessibility APIs because the data comes from the application's internal representation, not from its visual rendering. The button is a button regardless of resolution, theme, or language.

5. Hybrid Approaches in Practice

The most effective desktop agents in 2026 use a hybrid approach: accessibility APIs as the primary perception and interaction method, with screenshots as a fallback for applications that have poor accessibility support.

The hybrid workflow typically looks like this:

Query the accessibility tree for the target element
If the element is found with a clear label and role, interact with it directly through the accessibility API (click, type, read value)
If the element is not found in the tree (custom-rendered UI, Electron app with poor accessibility, game), fall back to screenshot + vision model
After any interaction, verify the result through the accessibility tree first, screenshot second

On macOS specifically, most native applications have good accessibility support because Apple requires it for VoiceOver compatibility and enforces accessibility standards through the App Store review process. Electron-based applications (Slack, Discord, VS Code) have variable accessibility support, some elements are well-labeled, others are not.

Web applications rendered in a browser actually have excellent accessibility data available through the browser's accessibility tree, which mirrors the DOM structure. Tools like Playwright already use this approach for reliable browser automation. Extending the same principle to desktop applications is the natural next step.

6. Current Desktop Agent Approaches

The desktop AI agent landscape in 2026 splits roughly along the screenshot vs accessibility API line:

Anthropic Computer Use - Screenshot-based. Uses Claude's vision capabilities to interpret the screen and generate mouse/keyboard actions. Works with any application but has the reliability issues described above.
OpenAI Operator - Primarily browser-focused, using a combination of screenshots and DOM access. Reliable within the browser, limited outside it.
Apple Intelligence - Uses on-device models with deep OS integration. Has access to both visual data and system-level APIs but currently limited in scope.
Fazm - An open source AI computer agent for macOS that takes an accessibility-first approach. Voice-first interaction model with accessibility API perception. Focuses on reliability through structured data rather than screenshot inference.
Various RPA tools - Traditional robotic process automation tools like UiPath and Automation Anywhere have added AI capabilities. They typically use a mix of selectors, OCR, and image matching.

The trend is toward hybrid approaches. Pure screenshot agents are reliable enough for demos but struggle in production. Pure accessibility API agents are reliable but limited to well-instrumented applications. The agents that combine both approaches get the reliability of structured data with the universality of visual perception.

7. Building Reliable Agent Workflows

Whether you are using desktop AI agents or building your own, certain principles improve reliability regardless of the underlying perception method:

Verify after every action - Never assume an action succeeded. Always check the result before proceeding to the next step.
Use the most specific selector available - If an element has an accessibility identifier, use it. If not, use the label + role combination. Use coordinates only as a last resort.
Handle unexpected dialogs - System updates, permission requests, and notification banners can appear at any time. Build handlers for common interruptions.
Set explicit timeouts - Do not wait forever for a UI change. Set a timeout and define what happens when it expires.
Log everything - Every action, every state check, every decision. When something goes wrong (and it will), logs are how you debug it.
Scope the automation narrowly - An agent that automates a specific five-step workflow will be more reliable than one that tries to accomplish vague goals. Constrain what the agent can do.

Desktop AI agents are at roughly the same stage web automation was 10 years ago. The tools work, but reliability requires thoughtful design. The teams and individuals who invest in understanding the underlying perception methods, whether accessibility APIs, screenshots, or both, will build automations that actually hold up in daily use.

Desktop Automation That Uses Structured Data, Not Guessing

Fazm is an open source AI computer agent for macOS. Voice-first, accessibility API-driven, built for reliability over demos.

Try Fazm Free