Your Bracket Is a Speculation Play - Accessibility APIs Over Screenshots

Fazm Team··2 min read

Your Bracket Is a Speculation Play - Accessibility APIs Over Screenshots

The computer control space has two competing approaches: screenshot-based (take a picture, ask the model what to click) and accessibility API-based (read the structured UI tree, interact with named elements). The accuracy difference is not marginal. It went from 40% to 90% when we switched.

Why Screenshots Are a Speculation Play

Screenshot-based agents are literally speculating. They see pixels and guess what those pixels represent. A button that says "Submit" next to a button that says "Cancel" - the model has to identify both, determine their boundaries, calculate click coordinates, and hope the resolution is high enough that it read the text correctly.

Every step is a speculation. The model speculates about element boundaries. It speculates about what is clickable. It speculates about coordinates. Each speculation has an error rate, and they compound.

Why Accessibility APIs Are Precise

Accessibility APIs give you the actual UI tree. Every button has a label, a role, a position, and an action. There is no guessing. You do not need to figure out where the "Submit" button is - the API tells you exactly. You do not need to read text from pixels - the text is provided as a string.

The 40% to 90% jump comes from eliminating speculation entirely. The agent goes from "I think that cluster of pixels is a button labeled Submit" to "there is a button element with accessibilityLabel Submit at these exact coordinates."

The Hybrid Approach

The best approach combines both. Use accessibility APIs as the primary interaction method and fall back to screenshot analysis only when the API does not expose an element (which happens with poorly-coded apps or custom UI components). LLM vision as a fallback, not the primary input.

This is why desktop agents have an advantage over browser-only tools. macOS accessibility APIs expose the UI of every running application, giving the agent structured access to controls that screenshot-based approaches can only guess at.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts