AI Desktop Agents

AI Desktop Agent Edge Cases: Why Building Reliable Agents Is Harder Than It Looks

A recent discussion in the developer community captured a frustration many builders share: we can now generate code at the speed of thought, but we still cannot build desktop agents that fast. The bottleneck is not writing the code. It is handling the edge cases. Apps that do not expose their UI tree. Unexpected permission dialogs. System alerts that appear mid-workflow. Electron apps that render everything in a single webview. The gap between a demo that works on your machine and an agent that works reliably across thousands of different Mac setups is enormous. This guide covers the real challenges of building AI desktop agents, explains why domain expertise matters more than code generation speed, and describes the strategies that working agents use to handle the chaos of real desktop environments.

Works across 200+ Mac apps

“Fazm handles complex desktop workflows across any Mac app, including the edge cases that break simpler agents: permission dialogs, apps with incomplete UI trees, and multi-window workflows.”

fazm.ai

1. The Speed vs. Reliability Gap

Code generation has reached a point where an experienced developer working with an AI assistant can produce a functional prototype in hours instead of days. The scaffolding, the boilerplate, the initial implementation of business logic: all of this moves faster than ever. But when the goal is a desktop agent that interacts with real applications on a real operating system, the speed of code generation becomes almost irrelevant.

The core issue is that desktop environments are adversarial in ways that web environments are not. A web app controls its entire rendering surface. The browser is a relatively predictable sandbox. Desktop agents, by contrast, must interact with applications built by thousands of different developers, using different UI frameworks, following different conventions, and running in an environment where the OS itself can interrupt at any moment with a notification, a permission dialog, or a software update prompt.

The result is a pattern that many desktop agent builders have experienced: the first 80% of the agent works after a weekend of coding. The remaining 20% takes months, because it consists entirely of edge cases. Each edge case is individually small and solvable, but there are hundreds of them, and they interact with each other in unpredictable ways. This is the reliability gap, and it is the primary reason why there are far fewer working desktop agents than working web scrapers or API integrations.

2. When Apps Do Not Expose Their UI Tree

The accessibility API on macOS provides a structured tree of UI elements for every application. In theory, an agent can read this tree to understand what is on screen, find the right button, and click it programmatically. In practice, many applications have incomplete or broken accessibility trees.

Electron apps are a common source of problems. Apps like Slack, Discord, VS Code, and Notion are essentially web pages running inside Chromium. Their accessibility tree can be extremely deep (hundreds of nested divs), poorly labeled (elements identified as "group" or "generic" with no meaningful name), or structurally misleading (interactive elements buried inside non-interactive containers). An agent that works perfectly with native macOS apps like Mail or Finder can completely fail when pointed at an Electron app.

Custom-rendered applications present an even harder challenge. Professional tools like CAD software, video editors, game engines, and music production apps often render their interfaces using OpenGL, Metal, or custom drawing frameworks. To the accessibility API, these apps may appear as a single opaque rectangle. There is no button to find, no menu to read, and no text field to type into, at least not through the standard accessibility interface.

Java and cross-platform apps (built with Qt, wxWidgets, or similar frameworks) fall somewhere in between. They usually expose some accessibility information, but it often uses non-standard roles, missing labels, or inconsistent structure. An agent that relies on finding a "button" role might miss a Java Swing JButton that reports itself as a "push button" or a Qt widget that exposes an unfamiliar role name. These inconsistencies require per-framework handling that cannot be generated automatically.

An agent built for real desktop edge cases

Fazm handles Electron apps, custom-rendered UIs, and permission dialogs across 200+ Mac apps. Free, open source, and built on native accessibility APIs.

Try Fazm Free

3. Unexpected Dialogs and System Interruptions

Even when an app has a perfect accessibility tree, the macOS environment introduces interruptions that can derail any automated workflow. The most common offender is the permission dialog. macOS aggressively gates access to the camera, microphone, screen recording, file system locations, automation permissions, and network access. When an agent triggers an action that requires a permission it does not yet have, macOS presents a system dialog that blocks the entire application until the user responds.

These dialogs are challenging for agents because they appear outside the application's own UI hierarchy. The agent might be monitoring the accessibility tree of Safari, but the permission dialog belongs to a system process (UserNotificationCenter or TCC). If the agent is not watching for system-level dialogs, it will hang indefinitely, waiting for an action that already completed from the app's perspective but was blocked by the OS.

Software update notifications, low battery warnings, incoming call alerts (from FaceTime or iPhone continuity), and Siri suggestions are all system interruptions that can appear at any time. Each one can obscure the interface the agent is trying to interact with, steal keyboard focus, or change the frontmost application. Handling these requires a monitoring layer that runs independently of the main workflow, watches for known system dialog patterns, and either dismisses them or pauses the workflow until they clear.

Application-level dialogs add another layer of complexity. Save dialogs, unsaved changes warnings, authentication prompts (Keychain access, re-enter password), crash recovery dialogs, and license activation windows can all appear unexpectedly during a workflow. Unlike system dialogs, these are application-specific and highly variable. A "Do you want to save?" dialog in TextEdit looks and behaves differently from the same concept in Photoshop. Building a general-purpose handler for these requires cataloging patterns across dozens of applications, which is the kind of domain expertise that no amount of code generation speed can replace.

4. Why Domain Expertise Beats Code Generation Speed

The common assumption is that faster code generation means faster product development. For many categories of software, this is true. For desktop agents, it is misleading. The bottleneck is not writing code. It is knowing what code to write, which comes from deep familiarity with the operating system, its quirks, and the behavior of specific applications.

For example, knowing that macOS Sonoma changed how accessibility permissions are granted for automation, or that certain apps reset their window positions after a relaunch, or that Finder's column view has a different accessibility structure than list view: these are facts that emerge from months of testing and debugging, not from generating code faster. An AI coding assistant can help implement the solution once you know what the problem is, but discovering the problem in the first place requires hands-on experience with the platform.

This is why the most successful desktop agent projects tend to come from teams with deep operating system expertise, not teams with the best AI models. The model is a commodity; the edge case knowledge is the moat. A team that has spent months cataloging how 200 different Mac apps behave when automated has a massive advantage over a team that can generate a basic automation script in 10 minutes.

The pattern extends to error recovery. When something goes wrong during an automated workflow (and something always goes wrong), the agent needs to recover gracefully. What does "recovery" look like when an app has crashed? When a file save dialog is blocking the UI? When the app has entered a state that was not anticipated? Each of these scenarios requires a specific recovery strategy, and the quality of those strategies depends on the builder's understanding of how macOS and its applications actually behave in failure modes.

5. Strategies for Building Resilient Desktop Agents

Despite the challenges, teams are building desktop agents that work reliably. The strategies they use share common patterns. First, they implement multi-layer perception. Instead of relying solely on the accessibility tree or solely on screenshots, resilient agents use the accessibility tree as the primary perception layer and fall back to vision when the tree is incomplete. Some agents also monitor system logs, process lists, and window management APIs to maintain awareness of the broader system state.

Second, they build comprehensive dialog handling. This means maintaining a database of known dialog patterns across applications and the OS, with strategies for each. Some teams implement this as a background monitor that continuously checks for new windows and classifies them as expected (part of the workflow), known interruption (dismiss or handle automatically), or unknown (pause and alert the user). Tools like Fazm, which aim to work across many Mac apps, invest heavily in this dialog catalog.

Third, they implement state verification at every step. Instead of assuming an action succeeded, resilient agents verify the result after each interaction. Did the button click actually trigger the expected change? Is the expected screen now visible? Did a new dialog appear? This verification loop adds latency, but it catches failures early before they cascade into harder-to-diagnose problems later in the workflow.

Fourth, they design for graceful degradation. When an agent encounters a situation it cannot handle, the right behavior is not to crash or loop indefinitely. It is to pause, capture the current state (a screenshot, the accessibility tree, relevant logs), and either try an alternative approach or ask the user for help. The best agents treat each unhandled edge case as a learning opportunity, logging enough context to reproduce and fix the issue later.

Finally, they embrace incremental coverage. No desktop agent launches with support for every possible edge case. The practical approach is to start with a focused set of applications and workflows, handle those reliably, and expand coverage gradually. Each new application brings new edge cases, but over time the pattern library grows and new apps require less custom handling. This is fundamentally a domain expertise problem, not a code generation problem, and the teams that recognize this early build better agents faster.

A desktop agent built for the edge cases

Fazm is a free, open-source AI agent for macOS that handles the hard parts of desktop automation: incomplete UI trees, unexpected dialogs, and cross-app workflows. Built by a team with deep macOS expertise.

Try Fazm Free

Free to start. Fully open source. Runs locally on your Mac.