From 37% to 85% UI Automation Success Rate - What We Learned
From 37% to 85% UI Automation Success Rate
When we first connected Fazm's UI automation to real macOS apps, the success rate was around 40%. Four out of ten clicks missed their target. Text ended up in the wrong field. Buttons got clicked before they were fully rendered.
It was humbling. It was also extremely useful data.
The Failure Taxonomy
After logging hundreds of failed actions with screenshots and accessibility tree snapshots, the failures clustered into four categories.
Coordinate Misalignment
macOS accessibility APIs report element frames as top-left origin rectangles: {x: 100, y: 200, width: 80, height: 30}. If you click at (100, 200), you are clicking the top-left corner of the element - which is often the border pixel, or the element just above it.
The fix is clicking the center point: (x + width/2, y + height/2). In this example, (140, 215). Switching from top-left to center-point clicking fixed roughly 15% of all failures in one change.
This is a beginner mistake but it is not obvious unless you are looking at frame data carefully. The accessibility tree gives you bounds; it does not tell you where to click.
Lazy-Loading Races
Modern apps load content progressively. The accessibility tree reflects what the app has rendered, but "in the tree" does not mean "interactive." A button can appear in the accessibility tree a few hundred milliseconds before it becomes clickable - before its event handler is attached or before a loading state resolves.
The agent would see the element, click it immediately, and hit a loading spinner or get no response. The element existed; it just was not ready.
The fix is a post-action verification loop. After clicking, re-read the accessibility tree and check whether the expected state change occurred. If the target still shows the same state, wait 200ms and retry. Cap at 3-4 retries before flagging as a failure.
func clickWithVerification(element: AXUIElement, expectedState: UIState) async throws {
var attempts = 0
while attempts < 4 {
try performClick(on: element)
try await Task.sleep(nanoseconds: 200_000_000)
let currentState = try readAccessibilityState()
if currentState == expectedState {
return // Success
}
attempts += 1
}
throw AutomationError.verificationFailed
}
Scroll Position Drift
The agent locates an element that requires scrolling to reach, scrolls to reveal it, then clicks - but the scroll animation is still in progress. The element's accessibility frame reports its final position, but the visual render is mid-animation. The click lands in the wrong place.
Two fixes: use AXScrollToVisible (tell the accessibility API to scroll the element into view, which blocks until the scroll completes) rather than synthesizing scroll events manually. If you must use scroll events, wait for the scroll animation to settle before clicking - monitor NSScrollView.isScrolling or add a fixed delay calibrated to your scroll distance.
Stale Tree References
The accessibility tree is a snapshot. Between reading the tree and acting on an element reference, the UI might have changed. In apps with aggressive state updates (every few hundred milliseconds), the element you captured a reference to no longer exists in the same form.
The fix is to not cache element references across operations. Re-read the relevant portion of the accessibility tree immediately before each action rather than holding references. Yes, this is slower. It is also correct.
For apps with very rapid state updates, add a tree stability check: read the relevant subtree twice, 100ms apart. If the structure is the same both times, proceed. If it changed, wait and retry.
The Single Biggest Improvement
The single most impactful change was adding post-action accessibility tree traversal. After every click, type, or scroll, Fazm re-reads the tree and compares the new state against the expected outcome.
This does three things:
- Detects failures immediately rather than letting errors compound through subsequent steps
- Provides retry signals with specific failure information (what state appeared versus what was expected)
- Builds a dataset of action outcomes that reveals which apps, which element types, and which workflows are most failure-prone
Implementing this pattern took us from around 40% to 85-90% success rate. It is not magic - it is just feedback. Actions without feedback are brittle. Actions with immediate verification and retry are robust.
The Remaining 10-15%
The failures that remain are mostly apps with non-standard UI frameworks that do not expose clean accessibility trees. Electron apps vary widely in accessibility support - some expose rich trees, others expose almost nothing. Apps built with custom rendering engines (games, some creative tools) often expose no accessibility data at all.
For these apps, the fallback is screenshot-based interaction: take a screenshot, send it to a vision model to identify element locations, click at the identified coordinates. This works at 60-70% success rate with more variance - worse than accessibility-tree-based automation but better than nothing.
MacPaw's research on parsing macOS app UI identifies the same issue: SwiftUI apps add an additional NSHostingView layer that some tools cannot navigate correctly. The hierarchy can have elements on a hidden layer that standard traversal misses. The fix is hit-testing (ask the accessibility API what element is at a specific coordinate) rather than relying solely on tree traversal.
Lessons That Generalize
UI automation reliability is fundamentally about feedback loops. The difference between 40% and 85% is not smarter action selection - it is verifying that each action had the intended effect and retrying intelligently when it did not.
The specific failure modes (coordinate offset, lazy load, scroll drift, stale references) are macOS-specific. But the underlying pattern - "your action model has implicit assumptions that are often violated" - applies to any UI automation context.
Build in verification. Assume every action can fail for reasons outside your model. Treat failure as information rather than an error state.
- Accessibility API vs Screenshot Computer Control
- Avoid Fragile Automations With the Accessibility Tree
- AI Agent Self-Report Trap - Screenshot Verification
Fazm is an open source macOS AI agent. Open source on GitHub.