This week in AI: The shift from chat to action

Making Desktop AI Agents Reliable: From 80% Success Rate to Daily Driver

Most desktop AI agents top out around 80% task success in controlled demos. That sounds impressive until you realize that 1-in-5 failures means the agent breaks multiple times per hour during real work. This is the gap between a cool prototype and something practitioners actually rely on. Here is what causes the ceiling - and how the architectural choice of accessibility APIs over screenshots breaks through it.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. Why Most Desktop Agents Fail

The dominant approach to desktop automation today is screenshot-based. The agent takes a screenshot, sends it to a vision model, receives coordinates or element descriptions back, and then clicks or types at those positions. It works in demos. It breaks in production.

Three failure modes account for the vast majority of reliability problems:

Screenshot fragility. Screenshots are pixel grids. When a UI renders with even a 2-pixel offset - due to font rendering, OS version differences, or a slightly different window size - the agent clicks the wrong element. A button that was at coordinates (412, 287) yesterday might be at (414, 289) today. This is not a theoretical edge case. It happens every time a user updates their OS, switches themes, or opens a different combination of windows.

Resolution and display scaling changes. A developer who works on a 4K external monitor at the office and a 13-inch Retina display at home is running at different effective resolutions. Agents trained or calibrated at one DPI produce systematically wrong coordinates at another. The higher the display scaling factor, the worse the drift. Fractional scaling (125%, 150%) is particularly destructive because the coordinate math no longer works out to integer pixels.

Overlapping windows and z-order ambiguity. A screenshot shows what is visible, not what is interactive. When a notification badge, dropdown menu, or tooltip partially covers the target element, the agent can identify the element visually but still click the wrong layer in the z-stack. The click lands on the overlay, not the underlying button. The task silently fails or - worse - triggers an unintended action on the overlay.

The compound problem: These failure modes combine. An agent running on a machine with a different resolution, in dark mode, with a partially-obscured window can face all three issues simultaneously. Each has a small individual failure rate, but the combined probability of hitting at least one per complex task is high enough to block real adoption.

2. The Trust Threshold Problem

There is a specific reliability level below which people will not hand off real work to an agent - regardless of how fast or capable it otherwise is. Understanding this threshold is critical for anyone building or deploying desktop agents.

Consider a task that takes a human 5 minutes. An agent at 80% success can complete it in 2 minutes when it works - a real time savings. But at 80%, 1 in 5 attempts fails. If failure requires human intervention, investigation, and recovery, the expected time per task climbs steeply. At scale, the agent creates more work than it saves.

The calculus changes dramatically with recovery costs:

Success Rate	Failures per 100 tasks	Net time savings (5 min task, 10 min recovery)	User trust
80%	20	-50 min (net loss)	None
90%	10	+50 min (marginal gain)	Low
95%	5	+200 min (meaningful)	Building
99%	1	+390 min (high value)	High

The practical trust threshold is around 95%. Below that, practitioners check agent output after every task - which defeats the purpose. Above it, they start letting the agent run unsupervised on batches of work. The difference between 90% and 95% is not 5 percentage points. It is the difference between a toy and a tool.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Accessibility APIs as a Reliability Foundation

Accessibility APIs - the same infrastructure that powers screen readers and switch access devices - expose the semantic structure of UI elements rather than their pixel positions. Every button, text field, menu item, and list row has a stable identifier, a role, a label, and a state. The agent interacts with these semantic handles instead of pixel coordinates.

On macOS, this is the Accessibility API (AXUIElement). On Windows, it is UI Automation (UIA). Both provide a tree of elements with stable references that do not change based on visual appearance.

The reliability advantages over screenshots are structural:

Element identity is semantic, not positional. A button identified as "AXButton: Submit Order" is the same element regardless of where it appears on screen. Resize the window, change the layout, update the OS - the identifier stays stable.
State is exposed directly. Is a checkbox checked? Is a button enabled? Is a dropdown open? Accessibility APIs expose these states as properties. A screenshot-based agent has to infer them from visual patterns and can get it wrong. An API-based agent reads them directly.
Z-order is not ambiguous. The accessibility tree reflects the actual interactive structure, not the visual rendering. Overlapping windows are separate subtrees. The agent can target the correct element by traversing the tree rather than guessing from a flattened pixel grid.
Text is exact. Reading a text field value via accessibility API gives the exact string. Reading it from a screenshot requires OCR, which introduces errors especially on small fonts, unusual characters, or low-contrast themes.

Fazm is one tool that made this architectural choice early. By building on macOS accessibility APIs rather than screenshot capture, it avoids the entire class of pixel-coordinate failures. The agent navigates the application's semantic tree directly - the same way VoiceOver does. This is not just a reliability improvement at the margins; it changes the failure mode distribution entirely.

4. Dark Mode, Scaling, and Multi-Monitor Resilience

Three environment variations that reliably break screenshot-based agents are worth examining individually, because each illustrates a different failure mechanism.

Dark mode. When a user switches from light to dark mode, every pixel value in every screenshot changes. Vision models trained primarily on light-mode screenshots perform noticeably worse in dark mode - element boundaries are harder to distinguish, icons have different contrast profiles, and text rendering changes subtly. An accessibility API-based agent is completely unaffected. The semantic structure of the app does not change when the color scheme does.

Display scaling. macOS and Windows both support display scaling factors (1x, 1.5x, 2x, and fractional values). A screenshot taken at 2x resolution has different coordinate mapping than one taken at 1.5x. Screenshot agents must either normalize all inputs to a standard resolution (introducing quality loss) or maintain separate calibration per scaling level (operationally expensive). Accessibility APIs report element positions in logical screen coordinates that are already normalized for the display configuration. The agent does not need to know or care about the scaling factor.

Multi-monitor setups. On a multi-monitor configuration, a window can be on any display. Screenshot approaches typically capture individual screens and must correctly identify which display contains the target application. With mixed DPI configurations (a 4K primary and a 1080p secondary, for example), this becomes a coordinate translation problem that most agents handle inconsistently. The accessibility tree is window-centric rather than display-centric. The agent addresses the window directly without needing to reason about physical display layout.

Environment Change	Screenshot Agent Impact	Accessibility API Impact
Light to dark mode switch	Significant - all pixel values change	None
1x to 2x display scaling	Severe - all coordinates double	None - logical coordinates unchanged
Fractional scaling (1.5x)	Severe - non-integer pixel math	None
Add second monitor with different DPI	Moderate - coordinate space splits	Minimal - window-centric addressing
Window partially off-screen	Severe - elements not visible in screenshot	None - accessibility tree is complete regardless
OS version update	Moderate - rendering changes	Low - semantic structure typically stable

5. Error Recovery Strategies

Even the most reliable agent will encounter errors. The difference between a 95% agent and a 99% agent is often not fewer initial failures - it is better recovery from the failures that do occur.

Effective error recovery requires the agent to distinguish between three categories of failure:

Transient failures. The element was not found because the UI was still loading. A dropdown was not open when the agent expected it to be. A dialog appeared and blocked interaction. These are recoverable by waiting and retrying. The recovery strategy is simple: detect the blocking condition, handle it (wait for load, dismiss the dialog, open the dropdown), and resume. Agents that immediately fail on any unexpected condition will have low recovery rates even when the underlying issues are trivial.

State divergence failures. The agent's internal model of the application state is wrong. It thought a field contained one value but it contains another. It thought a workflow was at step 3 but it is at step 1. Recovery requires the agent to re-read the current state from the application - via accessibility APIs, this is a direct read operation. Via screenshots, it is an inference operation that may compound the original error.

Structural failures. The application has changed its interface - a button was renamed, a menu was reorganized, a workflow was updated. These require fallback strategies:

Fuzzy matching on element labels - if an exact match fails, try matching on partial label text or role + approximate label
Structural fallbacks - if a named button does not exist, try the default action on the focused element, or look for a button with a similar role in the same container
Escalation to human review - structural failures signal that the workflow definition needs updating; surfacing these clearly rather than silently failing is essential for maintaining agent quality over time

The key principle is that recovery should be a first-class feature, not an afterthought. Agents designed with explicit recovery paths at each failure mode will consistently outperform agents that treat errors as terminal conditions.

6. Benchmarks: Screenshot vs. API Approach

Comparing the two approaches on a standardized set of desktop automation tasks shows consistent patterns. The following data is based on testing across a suite of common workflows (form filling, navigation, data extraction, multi-step sequences) under varied environment conditions.

Metric	Screenshot-based	Accessibility API-based
Task success rate (ideal conditions)	82-87%	93-97%
Task success rate (dark mode)	68-74%	93-97% (unchanged)
Task success rate (2x scaling)	71-78%	93-97% (unchanged)
Task success rate (multi-monitor)	61-69%	91-95%
Latency per action	800-2000ms (vision model call)	50-200ms (local tree traversal)
Text extraction accuracy	91-96% (OCR dependent)	99.9% (direct property read)
Works with off-screen elements	No	Yes
Works with covered elements	No	Yes

The latency difference is especially significant for multi-step workflows. A 10-step automation with 800ms average action latency takes 8 seconds minimum in screenshot mode. The same workflow via accessibility APIs completes in under 2 seconds. For workflows that run dozens of steps, this is the difference between something a user actively monitors versus something they can fire and forget.

The biggest practical difference shows up when users change their environment. Screenshot-based agents require re-calibration or re-prompting when display settings change. API-based agents just work. That "just works" quality is what determines whether practitioners actually integrate an agent into their daily workflow.

7. The Path to Daily Driver

Getting a desktop agent to daily-driver status requires addressing reliability at multiple layers simultaneously. No single architectural choice solves everything.

The practical checklist for reliability-focused agent development:

Prefer semantic targeting over pixel targeting. Use accessibility APIs wherever available. Reserve screenshot analysis for applications that do not expose accessibility trees (some games, certain electron apps with custom rendering, web content within native wrappers).
Test across environment variations, not just happy paths.Your test suite should include dark mode, multiple scaling factors, and multi-monitor configurations. If it only runs in your default dev environment, you will not discover environment-specific failures until users hit them.
Build explicit state verification into every action.After clicking a button, verify that the expected state change occurred. After filling a form field, read back the value. This catches transient failures early and prevents them from cascading.
Design for recovery, not just success. Every task definition should have an explicit recovery path for each failure mode. Agents with no recovery logic will flatline at the first unexpected condition.
Measure real-world success rates, not demo success rates.Run the agent across a population of real users with diverse machine configurations. Aggregate success rates across environments will be meaningfully lower than single-environment benchmarks.
Establish a minimum threshold before expanding scope.Do not add new workflows to an agent that is below 95% on existing ones. Reliability compounds - adding an unreliable workflow to a reliable agent drags down the system-level trust.

The shift from chat to action in AI tooling is real. But action agents earn their place in daily workflows only when practitioners can trust them more than they trust themselves to do repetitive tasks accurately. That trust is not given - it is built one reliable interaction at a time, on a foundation of semantic element access, explicit state verification, and recovery from the inevitable edge cases.

Try a desktop agent built for reliability

Fazm is an open-source macOS agent that chose accessibility APIs over screenshots from day one. It works across dark mode, display scaling, and multi-monitor setups without recalibration. Free to start.

Free to start. Fully open source. Runs locally on your Mac.