Why the Accessibility Tree Beats Screenshots for Desktop Automation: Lessons From Amazon Checkout

Fazm Team··2 min read

Why the Accessibility Tree Beats Screenshots for Desktop Automation: Lessons From Amazon Checkout

When most people think about AI controlling a computer, they picture a model staring at screenshots and clicking around like a human would. That works in demos. In production, it falls apart. We learned this the hard way while automating Amazon checkout flows.

The Screenshot Problem

Screenshots are expensive and fragile. Every frame needs to be processed by a vision model, consuming thousands of tokens. The model has to identify buttons by their visual appearance, which breaks when layouts shift, dark mode toggles, or resolution changes. On a checkout page with dozens of small buttons and text fields, the error rate climbs fast.

We were burning through tokens and still getting roughly a 60% success rate. The agent would misidentify "Place your order" vs "Add to cart" buttons. It would miss dropdown menus. It would click the wrong address field.

The Accessibility Tree Alternative

macOS exposes every UI element through the AXUIElement hierarchy - the accessibility tree. Every button, text field, checkbox, and label is represented as a structured node with properties like its role, title, value, and position. You do not need vision at all.

Instead of sending a 4000-token screenshot, we send a structured tree that is typically 200 to 500 tokens. The agent gets exact button labels, field values, and element states. "Place your order" is just a string property on a button node, not a pattern to recognize in pixels.

The Results

After switching to the accessibility tree, our success rate on the Amazon checkout flow jumped from roughly 60% to over 90%. Token costs dropped by an order of magnitude. And the automation became resilient to UI changes because it relies on semantic labels rather than visual layout.

When to Use Each

The accessibility tree is the right default for any structured interaction - forms, buttons, menus, navigation. Screenshots still matter for tasks that require visual understanding, like comparing product images or reading charts. The best approach combines both, using the accessibility tree for interaction and screenshots only when visual context is truly needed.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts