Why the Accessibility Tree Beats Screenshots for Desktop Automation: Lessons From Amazon Checkout
Why the Accessibility Tree Beats Screenshots for Desktop Automation: Lessons From Amazon Checkout
When most people think about AI controlling a computer, they picture a model staring at screenshots and clicking around like a human would. That works in demos. In production, it falls apart. We learned this the hard way while automating Amazon checkout flows.
The Screenshot Problem
Screenshots are expensive and fragile. Every frame needs to be processed by a vision model, consuming thousands of tokens. The model has to identify buttons by their visual appearance, which breaks when layouts shift, dark mode toggles, or resolution changes. On a checkout page with dozens of small buttons and text fields, the error rate climbs fast.
We were burning through tokens and still getting roughly 60% success on the checkout flow. The agent would confuse "Place your order" with "Add to cart." It would miss dropdown menus that had not loaded yet. It would click the wrong address field because two fields looked visually similar at the screenshot resolution.
The per-task cost was significant: a single Amazon checkout automation using screenshot-based vision ran to roughly 8,000-12,000 tokens. At that rate, you cannot run automation in production without economics that make no sense.
The Accessibility Tree Alternative
macOS exposes every UI element through the AXUIElement hierarchy - the accessibility tree. Every button, text field, checkbox, and label is represented as a structured node with properties: role, title, value, position, and enabled state. You do not need vision at all.
Instead of sending a 4,000-token screenshot, we send a structured tree that is typically 200-500 tokens for a complex page. The agent gets exact button labels, field values, and element states as text. "Place your order" is just a string property on a button node, not a pattern to recognize in pixels.
Here is what the accessibility tree output looks like for a typical checkout page:
AXGroup "Order Summary"
AXStaticText "Order Total: $47.99"
AXButton "Place your order" [enabled=true]
AXButton "Add to cart" [enabled=true]
AXGroup "Shipping Address"
AXTextField "Full name" value="John Smith"
AXTextField "Street address" value="123 Main St"
AXPopUpButton "State" value="California"
AXGroup "Payment"
AXStaticText "Visa ending in 4242"
AXCheckBox "Use a gift card" [checked=false]
Compare that to feeding a 4K screenshot to a vision model and asking "where is the place order button." The text representation is unambiguous and requires no visual interpretation.
Extracting the Tree in Python
Using pyobjc to get the accessibility tree from any macOS application:
import ApplicationServices as AS
import AppKit
def get_interactive_elements(app_name: str) -> list[dict]:
"""Return all interactive elements (buttons, fields, checkboxes) from an app."""
workspace = AppKit.NSWorkspace.sharedWorkspace()
apps = workspace.runningApplications()
target = next(
(a for a in apps if app_name.lower() in a.localizedName().lower()),
None
)
if not target:
return []
ax_app = AS.AXUIElementCreateApplication(target.processIdentifier())
return _collect_interactive(ax_app)
def _collect_interactive(element, depth=0, max_depth=8) -> list[dict]:
if depth > max_depth:
return []
interactive_roles = {
"AXButton", "AXTextField", "AXTextArea",
"AXCheckBox", "AXRadioButton", "AXPopUpButton",
"AXMenuItem", "AXLink", "AXComboBox"
}
results = []
role = AS.AXUIElementCopyAttributeValue(element, "AXRole", None)[1]
if role and str(role) in interactive_roles:
title = AS.AXUIElementCopyAttributeValue(element, "AXTitle", None)[1]
value = AS.AXUIElementCopyAttributeValue(element, "AXValue", None)[1]
enabled = AS.AXUIElementCopyAttributeValue(element, "AXEnabled", None)[1]
position = AS.AXUIElementCopyAttributeValue(element, "AXPosition", None)[1]
results.append({
"role": str(role),
"title": str(title) if title else None,
"value": str(value) if value else None,
"enabled": bool(enabled) if enabled is not None else True,
"position": {"x": position.x, "y": position.y} if position else None
})
# Recurse into children
children = AS.AXUIElementCopyAttributeValue(element, "AXChildren", None)[1]
if children:
for child in children:
results.extend(_collect_interactive(child, depth + 1, max_depth))
return results
def click_element_by_title(app_name: str, button_title: str) -> bool:
"""Click a button by its exact title string."""
import Quartz
elements = get_interactive_elements(app_name)
target = next(
(e for e in elements if e.get("title") == button_title and e.get("enabled")),
None
)
if not target or not target.get("position"):
return False
pos = target["position"]
# Use Quartz to simulate the click at the element's position
event = Quartz.CGEventCreateMouseEvent(
None, Quartz.kCGEventLeftMouseDown,
(pos["x"], pos["y"]),
Quartz.kCGMouseButtonLeft
)
Quartz.CGEventPost(Quartz.kCGHIDEventTap, event)
return True
This retrieves all interactive elements from a running app, by title, without any screenshot or vision model.
The Results After Switching
After switching from screenshots to the accessibility tree for the Amazon checkout flow:
- Success rate: 60% to over 90%
- Token cost per task: 8,000-12,000 tokens to 400-800 tokens
- Latency: removed the vision model inference step entirely
The automation also became resilient to UI changes. When Amazon updated their checkout page layout, the screenshot-based agent broke immediately because buttons had moved. The accessibility tree-based agent kept working because it found elements by semantic label, not by visual position.
The 33% Coverage Gap
One important caveat: not all macOS applications expose full accessibility metadata. Research from MacPaw's Screen2AX project found that only 33% of macOS apps provide complete accessibility support. For the other 67%, the tree is either partial or missing entirely.
In practice, this breaks down into three categories:
Well-supported: Native macOS apps (Safari, Mail, Calendar, Finder, most productivity apps), Electron apps with accessibility enabled, web apps in a browser.
Partially supported: Many enterprise desktop apps, older Cocoa apps that were not built with accessibility in mind.
Unsupported: Some games, video players, custom-rendered UIs that draw everything with CoreGraphics rather than using AppKit controls.
For the partially and unsupported cases, screenshots and vision models are the fallback. The best approach combines both: use the accessibility tree for all structured interactions, fall back to screenshot + vision only when the tree cannot find the element you need.
When to Use Each
The accessibility tree is the right default for any structured interaction - forms, buttons, menus, navigation in standard apps. It is faster, cheaper, and more reliable than vision.
Screenshots still matter for tasks that require visual understanding: reading a chart, comparing product images, understanding a layout that has no text labels. A production automation system uses both, choosing the right tool based on whether the information is structural (use the tree) or visual (use screenshots).
The 10x token cost of screenshots is not the whole story - the reliability difference matters more. An agent that runs at 90% success rate with the accessibility tree versus 60% with screenshots is not just cheaper to run. It is actually deployable. The 60% agent requires so much supervision that the economics do not work.
Fazm is an open source macOS AI agent. Open source on GitHub.