Claude Cowork and Why Desktop Agents Need Accessibility APIs Not Screenshots
Anthropic's 2026 workflow stack splits AI usage into five layers: Chat for ad-hoc Q&A, Code for software work, Projects for persistent context, Skills for reusable capabilities, and Cowork for AI that operates alongside you on your computer. The Cowork layer is the one that has the most physical contact with reality. It is the one that actually drives apps, fills forms, sends messages, opens windows, and clicks Save. Whether it works reliably depends almost entirely on a single architectural choice: does the agent read the operating system's accessibility tree, or does it pipe screenshots through a vision model and click on pixel coordinates? This guide makes the case that AX trees win on speed, stability, retina handling, dark mode, and audit, and sketches what a Cowork-style agent looks like when accessibility APIs are the default input.
“Fazm is a Cowork-style desktop agent for macOS that reads the system accessibility tree by default, with vision as fallback. Free, open source, runs locally.”
fazm.ai
1. The Cowork Layer in Anthropic's Stack
The cleanest way to read Anthropic's product surface in 2026 is as a stack of five workflow layers. Chat is conversational use, mostly throwaway. Code is the developer surface, where Claude edits files and runs commands. Projects is a persistent workspace that holds context, files, and prompts across sessions. Skills are packaged capabilities that can be invoked across products. Cowork is the most physical layer: AI that actually operates a computer alongside the user, watching what happens on screen and taking actions on real apps.
The other four layers live mostly inside text. Chat moves characters around. Code edits files and runs scripts in sandboxes. Projects holds documents. Skills route function calls. Cowork is the only layer that has to read a real screen and press real buttons in software written by other people. That difference matters because the failure modes are different. A chat that loses context recovers in a turn. A Cowork agent that misclicks the wrong button can send the wrong email, delete the wrong row, or charge the wrong card.
The natural question for the Cowork layer is: how does the agent see the screen? Two architectures dominate. Option one: capture a screenshot and run it through a vision model that emits click coordinates. Option two: read the operating system's accessibility tree, walk the structured nodes, and call actions by reference. Both produce demos. Only one produces an agent that holds up over a long workday on real apps.
2. AX Tree Fundamentals
Every modern operating system exposes an accessibility tree. On macOS it is the AX API, served by AXUIElementRef and friends. On Windows it is UI Automation (UIA). On Linux it is AT-SPI. They exist because screen readers (VoiceOver, NVDA, Orca) need a structured representation of the UI in order to announce it to users who cannot see the screen. Without a tree, screen readers would be reduced to OCR, which is what they were in the 1990s and the reason that era of accessibility software was so unreliable.
The tree is structured. Every interactive element has a role (button, text field, menu, group), a label (the human-readable name), a value (the current contents or state), a position, and a set of supported actions (press, set value, focus, show menu). Walking the tree gives an agent a complete map of what the user can do, in a format that does not depend on pixels.
Crucially, the tree is stable. If the developer changes the color of a button, the tree node does not change. If the user switches to dark mode, the tree node does not change. If the user moves the window or zooms the display, the tree node does not change. The label, role, and action set persist across all of these. That stability is exactly what an agent needs to chain dozens of actions without flaking.
Try a Cowork-style agent that reads the accessibility tree
Fazm uses native macOS accessibility APIs by default, so it survives retina, dark mode, and UI tweaks that break vision agents. Free and open source.
3. Where Screenshot Pipelines Flake
Pure vision agents take a screenshot, send it to a multimodal model, and ask for click coordinates. The architecture is elegant in a slide deck. In production, a handful of recurring failure modes show up.
Retina and HiDPI scaling. macOS retina displays use point and pixel coordinates that differ by a factor of two or three. Many vision pipelines downscale screenshots to fit context windows, then upscale coordinates to click. A small alignment drift, often two or three points, is enough to land a click on the wrong element in a dense form. Users on Studio Display see this more than users on a stock MacBook Air.
Dark mode and theming. Vision models trained heavily on light UIs stumble when text is on dark backgrounds, when accent colors are nonstandard, or when the user has set a high-contrast theme. The element is still there in the AX tree with the same role and label; the screenshot looks unfamiliar to the model.
UI tweaks between releases. App developers shift padding, change icon style, swap toolbar layouts. Every visual change is a potential regression for a vision agent. Accessibility node references survive these changes because the underlying widget identity does not move.
Latency and cost. Each screenshot is a large image token bill. Each vision call adds two to five seconds of round-trip time. On a workflow with thirty actions, that adds up to a minute of pure perception overhead. AX tree walks complete in milliseconds and bill in text tokens, often a tenth of the cost.
4. What Screen-Reader-Style Input Looks Like
A screen-reader-style agent input is a serialized version of the AX tree, scoped to the focused window or the relevant region. Each node becomes a line of text the model can read. A typical entry might describe a button with its label, role, focus state, and available actions, plus a stable reference (an opaque identifier the agent can use to invoke an action without specifying coordinates).
The model receives the serialized tree, decides which element it wants, and emits an action that names the reference: press this button, set this text field to that value, choose this menu item. The agent runtime translates the action into a native AX call. No pixel math, no coordinate scaling, no retina tax.
This is the same shape that screen readers have used for decades, just consumed by a model instead of a human. Apple, Microsoft, and the GNOME project have all spent years building this surface. The interesting realization in 2024-2026 was that the same surface, originally for users with disabilities, is also the cleanest input for AI agents on the desktop.
The other half of the picture is graceful fallback. Some apps ship with poor accessibility coverage: custom-painted controls that bypass native widgets, Electron apps with broken AX mappings, games. A well-built Cowork-style agent uses the tree where it exists and falls back to vision for the holes, instead of running everything through pixels by default. This hybrid posture matters because it keeps the cheap, reliable path active most of the time.
5. AX Trees vs Screenshots, Side by Side
| Dimension | Screenshot pipeline | AX tree input |
|---|---|---|
| Stable element references | No (coords only) | Yes (opaque AX refs) |
| Behavior on retina / HiDPI | Drift on downscale | Unaffected |
| Dark mode and theming | Accuracy drops | Unchanged |
| Latency per action | 2 to 5 seconds | Tens of milliseconds |
| Token cost per action | Image tokens (high) | Text tokens (low) |
| Auditability | Coordinates in logs | Element label and role |
| Coverage of legacy apps | Universal (any pixels) | High where AX is implemented |
The screenshot pipeline has one structural advantage: it requires no special permission and works on any pixels. That advantage matters for prototypes and for browser-only flows. For long-lived Cowork-style agents that run all day on a developer's or operator's desktop, the AX tree wins on every other axis that matters in practice.
The most resilient agents combine both. Tools like Fazm on macOS read the AX tree by default and reach for vision only when the tree is missing or incomplete. That posture keeps the common case fast and stable while still covering the edge cases that pure tree walks cannot reach.
6. Where Cowork-Style Agents Fit on This Axis
A Cowork-style agent is not the same as a chat agent that occasionally clicks something. It is a continuous companion that sits in the corner of the screen, watches what the user is doing, takes actions on the user's behalf, and hands work back and forth. That use pattern puts unusual stress on the input layer. The agent has to be cheap enough to run constantly, stable enough to survive a typical workday of UI churn, and auditable enough that the user trusts it with real credentials.
Vision pipelines fail all three criteria as the default input. They are too expensive to run continuously, too brittle to survive a day of theme tweaks and window resizes, and too hard to audit because the logs come back as click coordinates. AX tree input passes all three: it is cheap, stable, and produces logs that humans can actually read.
The implication for product builders is concrete. If you are shipping a Cowork-style agent on macOS or Windows, the architectural decision that matters most is making the accessibility tree the primary input and letting vision be a fallback. Get that ordering right and most of the operational problems take care of themselves. Get it wrong and you spend quarters chasing flakes that never resolve.
7. FAQ
What exactly is the Cowork layer?
It is the layer in Anthropic's 2026 workflow stack where the AI operates the user's computer alongside them, in real time, on real apps. Distinct from Chat (text), Code (file edits), Projects (persistent workspace), and Skills (packaged capabilities), Cowork is the layer that has to perceive a live UI and take actions on it.
Why not just use screenshots, given how good vision models are?
Vision models are excellent at recognizing things, but agent workflows reward stability and speed more than recognition. AX trees give stable references and millisecond response; screenshots give pixel coordinates and seconds of latency. For workflows that chain dozens of actions, the difference compounds.
Do AX trees cover every desktop app?
Most modern apps, yes. Older apps with custom-painted widgets, some Electron apps with broken accessibility mappings, and games are uneven. The fix is a hybrid agent that reads the tree when it is there and falls back to vision when it is not, instead of running everything through pixels.
What about privacy? Does the agent send my screen to a server?
With AX-tree input, the agent sends a structured text description of the focused UI to whichever model it is using. That is much smaller and less identifying than a full screenshot. Some agents (Fazm included) run with local models for the perception step, so even the AX content stays on the machine.
Is this just a macOS thing?
No. macOS, Windows, and Linux all expose accessibility trees (AX, UIA, AT-SPI respectively). The tradeoffs and advantages are similar across platforms. The implementation of any given Cowork-style agent will be platform-specific, but the architectural pattern (tree first, vision fallback) holds across all three.
Where does Fazm sit on this axis?
Fazm is one option among many for a Cowork-style agent on macOS. It reads the AX tree by default, falls back to vision when the tree is incomplete, runs locally, and is voice-first. It is open source and free to start. Other Cowork-flavored options include vision-first stacks like Anthropic's Computer Use and OpenAI's Operator, plus various RPA platforms layering AI on top of older automation engines.
Run a Cowork-style desktop agent built on accessibility APIs
Fazm is an open source AI desktop agent for macOS. AX tree by default, vision as fallback, runs locally, voice-first.
Free to start. Fully open source. Runs locally on your Mac.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.