Loop math, with line numbers from one shipping macOS agent
Why an accessibility API beats a screenshot loop, measured per turn
The screenshot loop is not a wrong choice for desktop agents. It is a choice that pays a fixed cost on every iteration, in tokens, in latency, and in coordinate accuracy that the model itself can hallucinate. The accessibility-tree path pays none of those costs and buys you a different set of tradeoffs. This piece walks the math one turn at a time, using Anthropic's own published numbers for the screenshot side and the actual CoreFoundation call from one open source desktop loop for the other.
Direct answer (verified 2026-05-13)
A screenshot loop pays per-iteration: a 735-token tool definition, a 466 to 499-token computer-use system prompt, an image up to 1568 pixels on the long edge, plus model time spent reading pixels, plus coordinates the model can hallucinate. An accessibility-tree read does the equivalent perception step with a single AXUIElementCopyAttributeValue call that returns structured text describing role, label, value, and frame for every focused element. The model targets elements by identity, not by pixel coordinate. The per-iteration delta is on the order of 1 to 3 seconds and several hundred tokens, compounded across every step of a task. Numbers and limitations sourced from Anthropic's computer use tool documentation.
The per-turn budget Anthropic publishes
Anthropic's computer use docs list the budget directly. These are not estimates from a third party. They are the numbers the provider itself publishes on its own pricing and reference pages for the tool you are using.
The tool definition cost is paid per request, which in the agent loop means per turn. The system prompt overhead is paid per request. The screenshot is sent every time the screen state could have changed (and the docs literally recommend taking one after every step to verify the outcome, because the model otherwise assumes its action worked). The 1024-pixel default is the recommended starting display size, which keeps the long edge at 1024 and the image under the 1.15-megapixel cap.
What the loop actually does, in Anthropic's own code
The reference agent loop is the spine of every screenshot-based computer-use agent. The model emits a tool_use block. Your code runs the action. If the model needs to see the new screen state, your code takes another screenshot and includes it as the next tool_result. The loop continues until the model stops emitting tool_use. Here is the structure of that loop, paraphrased from the quickstart on the official docs:
Every iteration through that while loop is one API call. Every API call pays the tool definition cost and the system prompt cost. Every turn that involves a screenshot adds image tokens on top. The provider's own limitations note, verbatim: “the current computer use latency for human-AI interactions may be too slow compared to regular human-directed computer actions. Focus on use cases where speed is not critical.” The fix Anthropic recommends in the same doc is taking another screenshot after every step to verify the action worked, which doubles the per-step screenshot count. The loop is honest about its cost. It just has a cost.
The single call that replaces the perception step
On macOS, the accessibility API is not a wrapper around OCR. It is the same channel VoiceOver uses, exposed by every well-behaved Cocoa app, and it speaks in elements (role, label, value, frame) instead of pixels. A read of the focused window is one CoreFoundation round trip. There is no image to encode, no network payload that scales with screen resolution, no coordinate the model has to compute. Below is the exact call shipped in one open source macOS agent. The reason it is worth quoting in full is the error handling, not the call itself, because that is where the real difference lives.
The same call that returns the AX tree returns the AXError code that disambiguates five distinct permission states. The screenshot path has one possible failure: the bitmap is wrong or the action missed. The AX path has four named failures (.success, .noValue, .apiDisabled, .cannotComplete, plus .notImplemented and .attributeUnsupported for apps without AX support) that map to actionable next steps. A stuck TCC cache on macOS 26 Tahoe (a real bug the file's comment at line 538 points to) is detectable here. It is not detectable from a screenshot.
The honest counterargument
The accessibility-tree path is not free. It depends on the apps you care about exposing useful AX. Native Cocoa apps (Finder, Safari, Mail, Calendar, Preview, Pages, Numbers) are excellent. AX-aware Electron apps (VS Code, Linear) are good. Typical Electron (Slack, Discord, Notion) is thin and you will hit gaps. Canvas surfaces (Figma, Whimsical, custom-drawn games) expose almost nothing structural and the screenshot path is the only honest option.
The right framing is not “AX wins, screenshots lose.” The right framing is “AX is the default substrate when it works, screenshots are the deliberate fallback when it does not.” Fazm ships both: the macos-use MCP server exposes the AX tree as the primary perception channel, capture_screenshot exists as a tool the model can request explicitly when the tree returns nothing useful. Most desktop agents invert that: screenshots are the default and the AX tree is at best a hint. The cost difference shows up in the loop.
What this looks like to a user
A 10-step task that touches two apps. With the screenshot loop: each step waits on a 1568-px image upload, vision tokens get processed, the model emits coordinates that may or may not land on the right pixel, and the docs recommend a verification screenshot after each step. The user watches a cursor pause for one to three seconds per click. With the AX-tree loop: each step sends a few hundred tokens of structured text, the model emits an element identifier, the click goes through AXUIElementPerformAction without any coordinate. The user watches the cursor move at something close to normal-software speed. The model never has to guess where a button is in screen space, because the AX tree already told it.
The deeper effect of skipping the screenshot loop is that the loop itself shrinks. Fewer image tokens means cheaper context windows means more room for tool schemas and conversation history. A 13B-class local MLX model becomes viable when the per-turn input is 200 to 400 tokens of AX tree instead of a megapixel image. The substrate choice is not just a latency question. It is what makes local desktop agents on Apple Silicon practical at all.
“The thing that surprised me about Fazm is how fast the loop feels. It does not pause to think about pixels. It just clicks the thing, because it already knows what the thing is called.”
Skip the screenshot loop on your own Mac
Fifteen minutes, live, with the team that builds Fazm. Bring a workflow you wish your screenshot agent did not stutter on.
Frequently asked questions
Does the screenshot really get sent on every single turn?
Yes. Anthropic's docs describe the loop as: model emits a tool_use, your app executes the action, your app returns a tool_result that the model needs to see, the model emits the next tool_use. The agent loop only ends when the model stops emitting tool_use. For visual tasks that means a screenshot every time the screen state could have changed, because the model has no other channel to see what the click actually did. Even with the new computer_20251124 zoom action, the zoom returns a region of a screenshot, not a structured tree.
Is the 735 token figure for the tool definition or for each call?
It is the tool definition cost, paid once per request as part of the system context (Claude 4.x: 735 tokens per tool definition, plus 466-499 tokens of computer-use-specific system prompt). It is not paid per click, but it is paid on every API call inside the loop, and the loop typically makes one API call per action. So for a 12-step task you pay it roughly 12 times. The image tokens on top of that are the variable cost.
How big is the image, in tokens, per turn?
Anthropic constrains computer-use images to 1568 pixels on the long edge and about 1.15 megapixels total. A 1024x768 default display gets sent as is (about 0.79 megapixels). The exact token count depends on the model's image tokenizer, but Anthropic's vision pricing page sets the floor around 1000+ tokens per screenshot and rises with resolution. Multiply by the number of turns and you have the full visual cost of the loop. None of that cost exists when the model reads structured text from the accessibility tree.
What about coordinate accuracy? Vision models are good now.
Anthropic's own limitations section is the cleanest source on this: 'Claude may make mistakes or hallucinate when outputting specific coordinates while generating actions.' That is from the official computer use tool documentation, not a benchmark someone ran. The mitigation Anthropic suggests is prompting Claude to take a screenshot after every step and verify the outcome, which doubles the screenshot count. An accessibility-tree click targets an element by role and identifier, not by pixel coordinates, so there is no coordinate to hallucinate.
Does Fazm always avoid screenshots, or does it have a fallback?
Fazm exposes capture_screenshot as a tool the model can call deliberately when the accessibility tree returns nothing useful (canvas surfaces, Electron apps with partial AX support, custom-drawn UIs). The default screen-state representation is the AX tree, served through the bundled macos-use MCP server. Screenshots are a fallback the model asks for, not the substrate the whole loop is built on. That inversion is the whole point.
Why does AppState.swift call AXUIElementCopyAttributeValue four times against four error cases?
Because the macOS accessibility permission can be in five distinct states, only two of which are visible in System Settings: granted-and-working, granted-but-stale-TCC-cache (a known macOS 26 Tahoe bug), denied, never-asked, and apiDisabled (system-wide AX off). The same call that fetches kAXFocusedWindowAttribute also returns the AXError code that disambiguates these states. Lines 488-512 of AppState.swift handle .success, .noValue, .notImplemented, .attributeUnsupported, .apiDisabled, and .cannotComplete with different paths, including a Finder cross-check at lines 517-534 because some apps (Qt, OpenGL, Python-based like PyMOL) just don't implement AX and a single failed call is not enough to declare the permission broken. A screenshot pipeline cannot distinguish 'permission is broken' from 'app has no AX support' because it never asks the question.
Does the accessibility API work on non-native apps like Slack, Discord, VS Code, Notion?
Variably. Slack and Discord expose enough AX for message reading and clicking the obvious controls but miss thread structure and emoji pickers. VS Code is unusually good for an Electron app because it ships custom AX nodes. Notion is weak: the editor surface is a contenteditable mess and the AX tree underdescribes block boundaries. The honest framing is: AX is excellent for native Cocoa apps (Finder, Mail, Safari, Preview, Calendar), good-enough for AX-aware Electron (VS Code, Linear), thin for typical Electron (Slack, Discord, Notion), and absent on canvas surfaces (Figma, Whimsical, custom maps). That is also why fazm keeps capture_screenshot as a tool the model can request when the tree is thin, instead of removing it entirely.
What's the practical speedup, end to end, for a multi-step task?
It depends on the model and the steps, but the loop math is straightforward. For each iteration, the screenshot path pays: image capture (50-200ms), image encoding to base64 PNG (20-50ms), API round trip with a large image payload (network bound, typically 1-3s on a residential connection), model vision pass over ~1000+ image tokens (model bound), action emission, action execution. The AX path pays: one CoreFoundation call (single-digit milliseconds), a serialized tree typically under 400 tokens of structured text, API round trip with a small payload, model text pass over the tree (cheaper than vision tokens of equivalent information density), action emission, action execution. The per-iteration delta is in the range of 1-3 seconds, multiplied by the number of iterations. On a 10-step task that compounds.
Other notes on the same substrate question, looked at from different angles.
Adjacent reading
Accessibility API vs screenshot agents: the audit trail gap
Same substrate question, a different cost: machine-readable logs for compliance. Screenshots are images, AX events are structured rows.
Desktop AI agents accessibility APIs vs screenshots in the Cowork layer
Why AX trees on macOS and UI Automation on Windows give stable element references that survive retina, dark mode, and minor UI tweaks, where screenshot pipelines flake.
Local MLX model for desktop loops: the one settings field
Once your screen state is 200-400 tokens of AX tree instead of a 1568-px image, a 13B-class local MLX model becomes viable for multi-step desktop work. Same wiring, different endpoint.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.