Open Computer Agent: Why the Best Ones Skip Screenshots (2026)

1. How screenshot-based agents actually work

Open computer agents like Coasty AI's Open Computer Use, E2B Desktop Sandbox, and Agent-S all follow the same loop. The agent takes a screenshot of the desktop. That image gets encoded (usually as base64 PNG) and sent to a vision-language model. The model looks at the image and responds with a coordinate pair: "click at (742, 381)." The agent moves the mouse to those coordinates and clicks.

This works well enough in controlled benchmarks. Agent-S reports 72.6% on OSWorld. But the approach has inherent weaknesses. The model is literally guessing where to click based on what pixels look like. It has no semantic understanding of the UI. A button labeled "Submit" and a decorative image that happens to look like a button are equally valid click targets from the model's perspective.

Most of these agents also run inside a Docker container with a virtual Linux desktop. That means they can control browser windows and terminal apps inside the container, but they cannot touch native applications on your actual machine.

2. The accessibility tree: what it is and why it matters

Every modern operating system maintains an accessibility tree. This is a structured representation of every UI element currently on screen: buttons, text fields, menus, labels, sliders, checkboxes, and their relationships to each other. The tree exists so that screen readers (like VoiceOver on macOS) can describe the interface to users who cannot see it.

On macOS, the accessibility tree is exposed through the AXUIElement API. Any process with accessibility permissions can query it. For each element, you get its role (button, text field, menu item), its name or label, its position and size on screen, whether it is enabled or disabled, and its parent/child relationships in the element hierarchy.

This tree is not an approximation. It is the actual structure of the UI as the operating system understands it. When you read the tree, you know with certainty that there is a button called "Send" at a specific location, that it is currently enabled, and that it belongs to the Mail app's compose window. No vision model required.

3. Inside the architecture: AXUIElement, ref IDs, and MCP

Here is how Fazm actually reads and acts on the accessibility tree. This is the part that no other guide on open computer agents covers, because most agents do not work this way.

Fazm runs a compiled macOS binary called mcp-server-macos-use. This binary is a Model Context Protocol (MCP) server that makes native macOS API calls. When the agent needs to understand what is on screen, the MCP server calls AXUIElementCopyAttributeValue to walk the accessibility tree. It reads attributes like kAXRoleAttribute, kAXTitleAttribute, kAXPositionAttribute, and kAXSizeAttribute for every visible element.

The server converts this tree into a structured text format where each element gets a ref ID. A typical output line looks like:

[Button] "Send" x:842 y:614 w:72 h:32 visible [ref=e47]

When Claude decides to click that button, it does not say "click at pixel (878, 630)." It says click ref=e47. The MCP server looks up e47 in its element map, finds the actual AXUIElement handle, and calls AXUIElementPerformAction with kAXPressAction. The click lands on the exact element, every time.

This architecture (Swift desktop app, Node.js ACP bridge, MCP servers) means the agent can run multiple tools concurrently. It uses macos-use for native app control and Playwright MCP for browser automation, choosing the right tool for each task without the user having to specify which.

Try the accessibility-tree approach yourself

Fazm is a free, open-source Mac app. No Docker, no API keys, no developer setup. Just download and tell it what to do.

Try Fazm Free

4. What breaks with screenshots (and does not break with the tree)

Screenshot agents have several failure modes that accessibility-tree agents avoid entirely:

Resolution and scaling. A screenshot on a Retina display is 2x the pixel density of a standard display. The vision model's coordinate prediction may be off by the scaling factor if the training data did not include enough Retina screenshots. The accessibility tree reports positions in screen-coordinate space regardless of display scaling.

Overlapping elements. When a dropdown menu overlaps a button, a screenshot shows both stacked. The vision model might try to click the button underneath. The accessibility tree knows which element is in the foreground because the tree hierarchy reflects z-order.

Dynamic content. If a loading spinner replaces a button between the screenshot and the click, the agent clicks where the button used to be. An accessibility-tree agent checks element state before acting and can wait for the target element to appear.

Dark mode, themes, and custom styling. A screenshot agent trained primarily on light-mode UIs may struggle with dark-mode interfaces. The accessibility tree is style-agnostic: a button is a button regardless of its background color.

5. Native Mac apps: the gap no Docker agent can fill

This is the biggest practical difference. Open computer agents built on Docker (Coasty AI, E2B, and most others) run a virtual Linux desktop inside a container. They can automate Firefox or Chromium inside that container, and they can run terminal commands. But they cannot open Finder, compose an email in Apple Mail, edit a spreadsheet in Numbers, or interact with Slack's native Mac app.

Fazm runs directly on your Mac. When you ask it to "find the latest invoice PDF in my Downloads folder and attach it to an email to accounting," it opens Finder, navigates to Downloads, identifies the file, opens Mail, creates a new message, and attaches the file. Each step uses the accessibility tree of the respective app. No container boundary, no file-sharing hacks, no clipboard workarounds.

This also means it works with apps that have no web equivalent. Preview, Automator, System Settings, Xcode, Final Cut Pro. If the app has a window and exposes accessibility elements, Fazm can interact with it.

6. A consumer agent, not a developer framework

Most open computer agents are developer frameworks. Agent-S requires Python, a grounding model, and configuration of multiple sub-agents. E2B requires an API key and Docker. Coasty AI runs a Next.js frontend with a FastAPI backend. These are tools built for developers who want to build agent systems.

Fazm is a Mac app. You download it, grant accessibility permissions, and type what you want done in a floating bar that sits over your desktop. There is no terminal involved. There is no configuration file. The agent runs your task and shows you what it is doing in real time.

This distinction matters because the people who would benefit most from a computer agent are often not developers. They are the people spending hours on repetitive tasks across multiple apps, copy-pasting data between spreadsheets and CRMs, filing reports that require pulling data from three different tools. A framework that requires Docker and Python is not solving their problem.

Frequently asked questions

What is the difference between an accessibility-based computer agent and a screenshot-based one?

A screenshot-based agent takes a picture of your screen, sends the image to a vision model, and guesses pixel coordinates to click. An accessibility-based agent reads the operating system's accessibility tree, which contains every interactive element with its name, role, and exact position. It targets elements by structured ref IDs instead of guessing coordinates, making actions faster and more reliable.

Can an open computer agent control native Mac apps like Finder, Mail, or Excel?

Most open computer agents run inside a Linux Docker container and can only control browser tabs or terminal windows. Fazm uses macOS accessibility APIs (AXUIElement) to interact with any app that exposes an accessibility tree, including Finder, Mail, Numbers, and third-party apps like Slack or Figma.

Do I need to write code or set up a developer environment to use Fazm?

No. Fazm is a native Mac app you download and run. There is no Docker setup, no Python environment, and no API keys to configure. You type what you want done in a floating bar and the agent executes it across your apps.

How does the accessibility tree approach handle apps that do not expose accessibility data?

Some apps (especially games or custom-rendered canvases) do not expose a full accessibility tree. In those cases, Fazm can fall back to screenshot capture via ScreenCaptureKit. But for the vast majority of productivity apps on macOS, the accessibility tree provides complete element information because Apple requires accessibility support for App Store approval.

Is Fazm open source?

Yes. The desktop app, the ACP bridge, and the macos-use MCP server are all open source. You can inspect exactly how it reads the accessibility tree, what data it sends to the LLM, and how it executes actions on your behalf.

What LLM does Fazm use under the hood?

Fazm connects to Claude via Anthropic's Agent Client Protocol (ACP). The agent receives the accessibility tree as structured text, not as an image, which means it uses a text model rather than a vision model for most interactions. This reduces latency and token cost compared to screenshot-based agents that must encode and process large images.