The Cowork Layer for Desktop AI Agents: Why Accessibility APIs Beat Screenshots

Anthropic recently split their product surface into five layers: Chat, Code, Projects, Skills, and Cowork. The first four are familiar. Cowork is the new one, and it is the layer where an agent sits next to you on the same desktop, watches what you are doing, and acts on the same windows you are clicking. That layer has a substrate problem. The agent has to perceive a UI it did not design and act on widgets it does not own. There are two serious ways to do that today: drive a screenshot pipeline through a vision model, or read the operating system's accessibility tree. This guide makes the technical case for the second.

OSS

“Fazm is built on macOS accessibility APIs, so it cooperates with whatever app is in front of you instead of guessing pixels. Free, open source.”

fazm.ai

1. What the cowork layer actually is

In the Chat / Code / Projects / Skills / Cowork breakdown, each layer answers a different question. Chat is "help me think". Code is "help me write". Projects is "hold context across turns". Skills is "package a competence so an agent can pick it up". Cowork is the layer where the agent stops being on the other side of a tab and starts being on the same desktop you are. It opens windows you can see. It moves the cursor you watch move. It saves files into folders you already had open in Finder.

That is a different engineering problem than chat. In chat, the agent owns the surface, so it gets to define every input and output. In cowork, the agent does not own the surface. It walks into someone else's app: Numbers, Photoshop, Logic Pro, Linear, a custom internal CRM, a 1998 banking client running through a terminal emulator. The agent has to figure out what is in front of the user, what is interactive, what state things are in, and what action will get to the goal. It has to do this while the user is also moving the mouse and hitting keys.

The substrate the cowork layer chooses for that perception task decides almost every other property of the system: latency, reliability, cost per action, what apps it can support, whether dark mode breaks it, whether a Retina display breaks it, whether a 10% font-size change in an enterprise design system breaks it. This is the choice that matters more than model choice and more than prompt design.

2. The two substrates a cowork agent can sit on

Substrate one is screenshots plus a vision model. Capture the screen. Send the bitmap to a multimodal model. Ask it where the "Send" button is, in pixel coordinates. Synthesize a CGEvent or equivalent click at those coordinates. Wait. Capture again. Read the new state. This is what Computer Use, Operator, and most early demo agents do. It is universally compatible because every desktop has pixels, and conceptually it mirrors how humans operate a computer.

Substrate two is the operating system's accessibility API. On macOS this is the Accessibility framework, exposed primarily as AXUIElement and the AX* attribute family from the ApplicationServices framework. On Windows it is UI Automation (UIA), the successor to MSAA. On Linux there is AT-SPI. On the web there is the platform accessibility tree the browser builds for assistive tech. These are not new APIs; VoiceOver, JAWS, NVDA, and Orca have driven them for two decades so that blind users could use the same software as everyone else.

The cowork layer needs the same thing assistive tech needs: a structured, semantic description of what is on screen, plus a way to perform actions on those elements without inventing pixel coordinates. The accessibility APIs were designed for exactly this. The fact that AI agents and screen readers want the same substrate is not a coincidence; it is the whole point.

3. AXUIElement, AX trees, and how macOS describes a window

On macOS, every running application that is built on AppKit, Catalyst, SwiftUI, or any framework that bridges to NSAccessibility (which includes Electron via its accessibility tree, and Chromium-based browsers) exposes its UI as a tree of AXUIElement nodes. You acquire the system-wide element with AXUIElementCreateSystemWide() or the element for an app with AXUIElementCreateApplication(pid). From there you walk the tree by reading the kAXChildrenAttribute of each node.

Each element carries a set of attributes. The interesting ones for an agent are kAXRoleAttribute (button, text field, group, web area, menu item, etc.), kAXSubroleAttribute (close button, secure text field, search field), kAXTitleAttribute and kAXDescriptionAttribute (the human-readable label), kAXValueAttribute (current contents of a text field, slider position, checkbox state), kAXEnabledAttribute, kAXFocusedAttribute, kAXSelectedAttribute, and the geometry pair kAXPositionAttribute and kAXSizeAttribute.

For interaction, every element advertises a list of actions via AXUIElementCopyActionNames(). Common ones are kAXPressAction, kAXIncrementAction, kAXDecrementAction, kAXShowMenuAction, and kAXPickAction. Calling AXUIElementPerformAction(elem, kAXPressAction) is the programmatic equivalent of clicking that button. You can also write a value with AXUIElementSetAttributeValue(elem, kAXValueAttribute, str) to fill a text field directly, which is faster and more reliable than synthesizing keystrokes.

Notifications close the loop. Subscribe an AXObserver to the application element, register for events like kAXFocusedUIElementChangedNotification, kAXValueChangedNotification, or kAXWindowCreatedNotification, and the agent gets push-based updates when the UI changes. No polling loop, no screenshot diffing.

The Windows side mirrors this. UIA exposes an IUIAutomationElement tree with control patterns (InvokePattern, ValuePattern, SelectionPattern, TextPattern) that describe what an element supports, and FindFirst with property conditions for lookup. The shape of the API is different but the model is the same: a structured tree with roles, names, values, and actions.

Try a cowork agent built on accessibility APIs

Fazm reads the AX tree, performs AXPress, sets values directly, and listens to AXObserver notifications. No screenshot loop. Free and open source.

4. Stable element references and what makes them stable

The single biggest engineering advantage of accessibility APIs for cowork is that you get a reference to an element that does not depend on where the element happens to be drawn. An AXUIElement handle points to the same logical widget across a dragged window, a resized sidebar, a dark mode toggle, a zoom level change, or a Retina vs non-Retina display move. The button object did not move; only its pixels did.

Pixel coordinates are stable for the duration of one screenshot and almost nothing else. A 4-pixel shift in the toolbar from a new app version, a 1.25x display scaling change, a font weight tweak in a design system rollout, or even the user widening a sidebar by 8 points is enough to invalidate every coordinate the model produced from the previous screenshot. Pixel coordinates have to be recomputed every single action. Element references survive across actions and across many UI changes.

For scripting and replay this is decisive. A workflow that says "press the element with role AXButton, title Send, inside the window titled Inbox" will continue to work next month, after the user upgrades to dark mode, after the icon changes, after the developer renames the internal class name, and after the display switches between built-in and external. A workflow that says "click at (1284, 376)" will not survive any of those.

There are still cases where elements get replaced rather than updated, and the handle goes stale. A defensive cowork layer re-resolves elements by their semantic predicate (role + label + parent context) rather than caching raw handles forever, but the predicate itself is dramatically more stable than any pixel description of the same widget.

5. The specific failure modes of screenshot pipelines

Screenshot pipelines do not fail in one big way. They fail in a long tail of small ways that each look fixable in isolation and are exhausting in aggregate. Here are the ones that matter most for a cowork layer.

Retina and DPI scaling. macOS reports a logical point grid (e.g. 1440 x 900) while the underlying framebuffer is 2880 x 1800. A vision model that returns coordinates in image space has to be told which space to dispatch the click in, and mixed-DPI multi-monitor setups make this worse. AX positions are always in screen points; there is no conversion to do.

Dark mode and theme variants. The same button rendered light and dark is two different training distributions for the vision model. Custom themes, high contrast mode, and colorblind palettes compound this. AX does not care what color anything is.

Disabled and hidden state. A vision model can be tricked by a button that looks pressable but has kAXEnabledAttribute = false, or one that is scrolled offscreen but still in the bitmap due to overlay artifacts. AX exposes enabled and visible state directly.

Latency stack-up. A screenshot loop is capture, encode, upload, multimodal inference, decode coordinates, dispatch click, wait for repaint, recapture. A single action commonly costs 2 to 5 seconds end to end. A twenty-step workflow loses a minute to perception alone. AX calls are local IPC measured in single-digit milliseconds.

Token cost. A 1440 x 900 screenshot at a typical detail setting consumes thousands of input tokens. Every step of a workflow re-pays that cost. An AX subtree serialized to text for the same window is usually a few hundred tokens and can be cached and diffed cheaply between actions.

User collisions. In a true cowork setting the user is also using the machine. They move the window. They switch desktops. They drop a Slack notification on top of the target. A screenshot taken at frame N and acted on at frame N+1 can hit a different element entirely. AX-driven actions target an element handle, so an obscuring overlay does not redirect the click; at worst the action fails cleanly because the element is no longer hittable, which the agent can detect and recover from.

Animation and transient state. A modal that is mid-fade-in has indeterminate pixels. AX reports the modal as present and interactive the moment it is in the tree, regardless of opacity.

6. Building a cowork layer in practice

A cowork layer built on accessibility APIs has roughly four pieces. First, a perception module that, on demand or on notification, walks the AX tree of the focused application and serializes it into a compact text form for the model. The useful fields are role, subrole, title or description, value, enabled, focused, and a stable id assembled from the path. You usually want to drop decorative groups and leaf text nodes that are children of a labeled element, because the label already carries that information.

Second, an action module that takes a structured command from the model (for example, { ref: "el_27", action: "press" } or { ref: "el_44", action: "set_value", value: "hello" }) and dispatches it via AXUIElementPerformAction or AXUIElementSetAttributeValue. The module owns the mapping from the model's short refs to live AX handles, and it owns staleness detection: if a handle no longer resolves, re-snapshot and report.

Third, an event module that subscribes to AX notifications on the focused app and on the system-wide element, so the agent sees focus changes, value changes, and window lifecycle in real time. This lets the agent confirm an action took effect without asking the model to compare two screenshots.

Fourth, a permission and trust boundary. macOS gates this whole stack behind the System Settings → Privacy & Security → Accessibility toggle. The user has to explicitly grant an app the right to read other apps' UIs and synthesize input. This is good. It is the same gate that protects users from rogue screen readers, and it gives the cowork layer a clear, revocable consent surface that pixel-based agents do not have a clean equivalent for.

Fazm is one example of this architecture in the wild. It runs locally on macOS, asks for accessibility permission at first launch, walks the AX tree of whatever app the user is in, and performs press and value-set operations directly on AX handles instead of dispatching pixel clicks. The codebase is open source so you can read the perception and action modules end to end.

7. When you do still need vision

Accessibility APIs are not a complete answer on their own. There is a real long tail of surfaces where the AX tree is empty or misleading. Custom-rendered canvases (Figma, many games, some DAW plugin chrome) draw their UI as a single opaque view with no children. Older Carbon apps and some Java Swing apps expose very thin trees. Embedded web views sometimes report a single AXWebArea with no internal structure if the page disabled the accessibility tree.

A production cowork layer should treat vision as a fallback for those cases rather than the default. The decision rule is simple: if the AX subtree for the focused window has fewer interactive elements than expected for its visual complexity, fall back to a screenshot of just that window and ask the model to ground in pixels. This keeps the slow, expensive path off the hot loop while still letting the agent operate inside Figma when it has to.

The other place vision still matters is verification. After a sensitive action, a one-shot screenshot to confirm the result looks right is cheap insurance, even if the action itself was dispatched through AX. The point is that vision becomes a small, optional part of the loop instead of the loop itself.

Wrap-up

The cowork layer is the part of the agent stack where physics stops being abstract. The agent has to act on the same desktop the user is acting on, in real time, against UIs that change under it. Accessibility APIs are the substrate that gives that layer stable element references, push-based state, sub-millisecond perception, semantic roles, and a clean OS-level consent gate. Screenshot pipelines have their place, mostly as a fallback for apps that do not expose a tree, but they are the wrong default.

If you are building or evaluating a desktop agent stack right now, the question to ask any cowork product is not which model it uses or how clever its planner is. The question is what its perception substrate is, and what it falls back to when the substrate is thin. That answer determines whether the agent still works after the next OS update, the next dark mode toggle, and the next time the user resizes a window mid-task.

Try a cowork-layer agent built the right way

Fazm is a free, open-source AI agent for macOS. AX-tree perception, direct AXPress and value-set actions, AXObserver-driven event loop, vision only as fallback.

Free to start. Fully open source. Runs locally on your Mac.