Hugging Face Open Computer Agent runs a Linux VM in a browser tab. A real Mac agent reads the accessibility tree of your actual running apps.
Every top result for this keyword describes Hugging Face Open Computer Agent as Qwen2-VL-72B plus smolagents plus E2B Desktop, and stops there. None of them note the consequence: the 'computer' in 'Open Computer Agent' is a virtual Linux box rendered inside a browser iframe, so it cannot touch your real Mac, your Finder, your logged-in Stripe, or your iMessage. This guide walks through the exact architectural difference, and where a native Mac agent like Fazm sits on the other side.
“The anchor fact that no SERP result for 'hugging face open computer agent' covers: Fazm ships a 21,149,408-byte ARM64 Mach-O binary called mcp-server-macos-use inside Fazm.app/Contents/MacOS, registered by the Node bridge at acp-bridge/src/index.ts line 63 (const macosUseBinary = join(contentsDir, 'MacOS', 'mcp-server-macos-use')), spawned as MCP server 'macos-use' at lines 1056-1064, and whitelisted as a built-in at line 1266 (BUILTIN_MCP_NAMES). The binary calls AXUIElementCreateApplication + AXUIElementCopyAttributeValue with kAXFocusedWindowAttribute (AppState.swift lines 439, 441, 470, 472) to walk the live accessibility tree of any running Mac app and returns rows like '[AXButton] "Send" x:842 y:712 w:68 h:32 visible'. Hugging Face Open Computer Agent, by contrast, runs Qwen2-VL-72B against a screenshot of an E2B Desktop Linux VM rendered in an iframe on huggingface.co. The architectural ceiling is not a model choice; it is the sandbox boundary.”
/Users/matthewdi/fazm/acp-bridge/src/index.ts:63 + Desktop/Sources/AppState.swift:439-472
How a Mac-native computer-use call actually flows
A prompt enters the Fazm floating bar. The model picks the macos-use tool. The bundled native binary walks the accessibility tree of the target app and the click resolves to a labelled element, not a guessed pixel.
From a sentence in Fazm to a click on the actual Send button
The five files that make the difference
None of this is marketing. These are the exact lines you would grep if you opened the Fazm source tree yourself.
mcp-server-macos-use (21MB ARM64 Mach-O)
Ships inside Fazm.app/Contents/MacOS/mcp-server-macos-use (verified: 21,149,408 bytes). Registered by acp-bridge/src/index.ts line 63: const macosUseBinary = join(contentsDir, 'MacOS', 'mcp-server-macos-use'). Exposes click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, open_application_and_traverse, refresh_traversal. Each returns a compact text summary plus a full .txt dump of the accessibility tree and a PNG screenshot.
BUILTIN_MCP_NAMES
acp-bridge/src/index.ts line 1266 declares: new Set(['fazm_tools', 'playwright', 'macos-use', 'whatsapp', 'google-workspace']). macos-use is a first-class built-in, not an optional plug-in.
Accessibility API call site
Desktop/Sources/AppState.swift lines 439-472 calls AXUIElementCreateApplication() and AXUIElementCopyAttributeValue(appElement, kAXFocusedWindowAttribute as CFString, &focusedWindow). Lines 470-472 do the same for Finder specifically. No screenshot is involved in locating the window.
CGEvent tap permission probe
AppState.swift lines 487-503 uses CGEvent.tapCreate(.cgSessionEventTap, ...) to verify Accessibility permission is live against the TCC database. If the tap is nil, the app disables automation until the user grants permission.
Tool routing in the system prompt
Desktop/Sources/Chat/ChatPrompts.swift line 56: 'ALWAYS use capture_screenshot. NEVER use browser_take_screenshot.' Line 59: 'macos-use tools for Finder, Settings, Mail, etc.' Line 61: 'playwright tools ONLY for web pages inside Chrome.' The model is explicitly steered away from vision-first clicks on Mac apps.
What 'read the accessibility tree' looks like on the wire
The agent does not get pixels here. It gets four rows of role, label, and resolved coordinates, and picks one by name.
The system prompt that routes Mac apps to the native tool
Notice the hard 'NEVER use browser_take_screenshot' for desktop apps. This is the line that prevents a vision-first pass from ever happening on a Mac app.
Where the native binary gets registered as an MCP server
Three blocks in one file. The path, the spawn, the built-in whitelist. Line numbers are real as of 2026-04-19.
One macos-use turn, actor by actor
A prompt to send a message in Messages.app. Traced through the same call chain Fazm uses in production.
click_and_traverse on Messages.app
Hugging Face Open Computer Agent vs a Mac-native agent, feature by feature
Nine rows. Each one is a downstream consequence of the core architecture: a sandboxed Linux VM rendered in a browser iframe, versus a signed Mac app that reads the live accessibility tree and drives CGEvents against your real screen.
| Feature | Hugging Face Open Computer Agent (VLM + E2B sandbox) | Fazm (native Mac, accessibility-tree driven) |
|---|---|---|
| Where the computer lives | An E2B Desktop virtual Linux machine rendered in an iframe inside your browser tab on huggingface.co/spaces/smolagents/computer-agent. The 'computer' is a cloud sandbox, not your machine. | Your actual Mac. The agent runs as a Swift desktop app plus a Node bridge process that spawns a bundled 21MB ARM64 binary called mcp-server-macos-use (Fazm.app/Contents/MacOS/mcp-server-macos-use). |
| How the agent sees the UI | Qwen2-VL-72B reads a screenshot of the virtual Linux desktop as image pixels. The model generates candidate click coordinates from what it sees in the image. | Reads the macOS accessibility tree via AXUIElementCreateApplication + AXUIElementCopyAttributeValue(kAXFocusedWindowAttribute) (AppState.swift lines 439, 441, 470, 472). Elements come back as a text tree with role, label, and already-resolved x, y, w, h. |
| How a click is decided | The VLM guesses pixel coordinates from the image and issues a mouse-move + click. A redesign, a modal overlay, or a font change shifts pixels and the guess misses. | The agent picks an element by its accessibility-tree text (role + label) or by passing the 'element' parameter. The tool centers the click at (x+w/2, y+h/2) using coordinates already in the tree. |
| Access to logged-in SaaS | None. The sandbox boots fresh each session with a Firefox inside a Linux VM. You would have to log into every service inside the sandbox each time, and 2FA / WebAuthn tied to your real device cannot reach it. | Everything you are already logged into on this Mac: Mail, iMessage, Slack, Figma desktop, Notion desktop, Stripe Dashboard in your Chrome, Gmail in your Chrome. The agent attaches to your real sessions, not new ones. |
| What apps the agent can drive | Whatever ships in the E2B Desktop Linux image. Typically Firefox, a file manager, a terminal. No macOS apps exist inside the VM by definition. | Any app with an accessibility interface: Finder, System Settings, Mail, Messages, Slack, Calendar, Notes, Keynote, Xcode, Terminal, plus any Electron app (ChatGPT desktop, Notion, Linear), plus any website through Playwright MCP in the user's real Chrome. |
| Model | Qwen2-VL-72B as the primary vision-language model, scheduled on the Space's shared inference hardware. If the Space is under heavy load, you queue. | User-configurable. Default is Anthropic Claude 4.6 Sonnet, with one-click switching to Opus, Haiku, GPT, or local Ollama. Model is swapped per-session from the floating bar. |
| Audit trail | Screenshots at each step plus the model's generated coordinates. There is no semantic label attached to the click, only 'I clicked at 482, 317 because the screenshot showed a button there'. | Each tool call is logged as a structured row: app name, element role, label, click coordinates, result. You can grep the accessibility tree .txt dump and trace exactly which UI element was clicked. |
| File access | The sandbox's own virtual file system. To move a file out, you type commands in the sandbox terminal and upload through a browser form. Your real Downloads folder is unreachable. | macOS sandbox rules apply. The app asks for Full Disk Access, Accessibility, and Automation permissions explicitly. Once granted, the agent can read and write your actual Documents, Downloads, and iCloud. |
| Open source | smolagents is MIT-licensed on GitHub. The E2B Desktop sandbox image is available. But the Space as deployed is bound to Hugging Face infra; running it on your own Mac means replicating the VM stack locally. | Fazm desktop ships as a signed and notarized macOS app. The Mac app source tree lives at /Users/matthewdi/fazm on the author's machine; the bundled MCP binaries (mcp-server-macos-use, whatsapp MCP, google-workspace MCP) are installed in Contents/MacOS/ and can be inspected with otool or codesign. Not a hosted Space. |
| What happens on CAPTCHAs | Well-known failure mode. Hugging Face's own page lists CAPTCHAs, multi-step authentication, and elaborate forms as the explicit limits of the Space. | For Mac apps, there are no CAPTCHAs. For web pages, the Playwright MCP runs inside your real Chrome via the Playwright MCP Bridge extension, so Cloudflare Turnstile and hCaptcha pass transparently against your real browser fingerprint. |
Apps a Mac-native agent can actually touch
Every one of these is unreachable from a Linux sandbox by definition. mcp-server-macos-use reads the accessibility tree of each, so the agent clicks by label, not by pixel.
One realistic prompt, run through both
The task: 'open my Stripe dashboard and download April invoices'. Here is where each agent stops being able to proceed.
User opens Hugging Face Open Computer Agent
The page loads huggingface.co/spaces/smolagents/computer-agent. An iframe boots an E2B Desktop VM. A virtual Firefox appears inside the iframe. The 'computer' the agent is about to control is a Linux box in a data center, not the user's Mac.
User types 'open my Stripe dashboard and download April invoices'
The VLM (Qwen2-VL-72B) reads a screenshot of the virtual Linux desktop. It generates a plan: open Firefox, navigate to stripe.com. Firefox opens inside the sandbox. Stripe prompts for login because this is a fresh Firefox with an empty cookie jar.
The agent is now stuck at the login wall
Your real Stripe session lives on your Mac, in your Chrome, protected by your TOTP app on your iPhone. None of that reaches the sandbox. A fresh login inside a cloud VM would trip Stripe's new-device check and require 2FA to a phone you do not have access to from the VM.
Same prompt, run through Fazm on the Mac
Fazm's model reads the accessibility tree of the frontmost Chrome window via macos-use's refresh_traversal, or it uses Playwright MCP in extension mode to attach to the real Chrome tab where Stripe is already open. Either way, the session is live on turn one. The click path is 'Invoices' → April → 'Download PDF' × 3, each click identified by label in the tree.
File lands in ~/Downloads on the real Mac
Because Fazm is running as a real macOS process with Full Disk Access, Downloads is the user's actual Downloads. No export step. No upload dance from a sandbox file system.
What the Fazm log shows during a real macos-use turn
This is the kind of trace you would see in /tmp/fazm-dev.log when the agent sends a Slack message via the native accessibility tree.
Why this is auditable, not marketing
Every claim above maps to a path, a line number, or a tool call you can run yourself. This is the trust surface.
What you can verify on your own machine
- The mcp-server-macos-use binary is bundled inside the Fazm.app package, and you can verify its signature with `codesign -dvvv /Applications/Fazm.app/Contents/MacOS/mcp-server-macos-use`.
- Every macos-use tool call writes a full accessibility-tree dump to /tmp/macos-use/<timestamp>_<action>.txt, plus a screenshot PNG next to it. The agent reads the tree; you read the same tree.
- The agent does not screenshot your desktop to locate a button. It reads a labelled row from the tree and clicks the resolved centroid.
- The CGEvent tap probe at AppState.swift:487-503 runs before the agent starts; if Accessibility permission is not granted, automation is refused rather than silently degraded.
- The routing prompt in ChatPrompts.swift:56-61 is the reason the model never asks for a browser_take_screenshot on a Mac app. The prompt is source-controlled and auditable.
- Nothing about this design requires a hosted Space. Your data does not leave the machine except through the model call the user explicitly made.
Want an agent that drives your actual Mac, not a sandbox in a browser tab?
Fazm is a signed, notarized Mac app. It reads the accessibility tree of your running apps, attaches to your real Chrome for the web, and takes natural-language commands from a floating bar. Free to download, one-time install.
Download Fazm →See mcp-server-macos-use drive your Mac live
Book 20 minutes and we will install Fazm on your machine, show the accessibility-tree trace, and run a task of your choosing against your actual apps.
Book a call →Hugging Face Open Computer Agent, answered against the Fazm source
What is Hugging Face Open Computer Agent, exactly?
It is a Hugging Face Space hosted at huggingface.co/spaces/smolagents/computer-agent. Under the hood it runs Qwen2-VL-72B as the vision-language model, the smolagents framework for the agent loop, and an E2B Desktop sandbox image that boots a small Linux VM with Firefox and a file manager inside. The Space renders the VM in an iframe in your browser tab. The agent reads screenshots of the VM and generates mouse and keyboard events against the VM.
Why can it not drive my Mac directly?
Because there is no Mac in the loop. The 'computer' the agent controls is a Linux virtual machine running in a data center, exposed to you through a browser iframe. Your real apps, your real Finder, your real logged-in sessions, your real Keychain, and your real filesystem all live outside the sandbox boundary. The sandbox is the entire point of the design, it isolates the agent from every real resource.
What does Fazm do differently at the architecture level?
Fazm runs as a signed macOS app on your own Mac. It ships a native ARM64 binary called mcp-server-macos-use (21,149,408 bytes) inside Fazm.app/Contents/MacOS, and registers it as an MCP server in acp-bridge/src/index.ts at line 63 (the path) and lines 1056-1064 (the spawn). The binary uses the macOS accessibility API (AXUIElement) to walk the UI tree of any running Mac app. Because the agent reads structured accessibility data rather than pixel screenshots, clicks are bound to labelled elements with resolved coordinates, not guessed from images.
What does 'reads the accessibility tree' mean in practice?
The mcp-server-macos-use binary calls AXUIElementCreateApplication() on a target app's PID and walks the window hierarchy via AXUIElementCopyAttributeValue with attribute keys like kAXFocusedWindowAttribute, kAXChildrenAttribute, and kAXRoleAttribute. It writes the result to /tmp/macos-use/<timestamp>_<action>.txt as rows like `[AXButton] "Send" x:842 y:712 w:68 h:32 visible`. The agent picks a row by role and label, passes the row identifier to click_and_traverse, and the tool clicks at the centroid. No vision model guesses coordinates.
So is vision ever used?
Yes, for disambiguation. Each macos-use tool call also returns a PNG screenshot at /tmp/macos-use/<timestamp>_<action>.png. When the accessibility tree is ambiguous (two buttons with the same label, a canvas-based UI like Figma, a game), the agent can read the screenshot to decide which element to target. But the baseline is the tree, and the tree is why reliability holds up across app updates that change pixels but not semantic labels.
Can I run Hugging Face Open Computer Agent against my Mac if I run it locally?
You can run smolagents locally on your Mac, and you can spin up E2B Desktop on your Mac. But the architecture is unchanged: the agent still drives a Linux VM, not macOS. To drive macOS you would need a macOS-accessibility tool analogous to mcp-server-macos-use, and smolagents does not ship one. This is the specific gap Fazm fills.
What about the web-page case? HF's agent has a browser, Fazm has a browser too.
The HF Open Computer Agent has a Firefox inside a Linux VM with a blank cookie jar. To act on any logged-in site, the agent has to log in from zero, which fails on SSO, WebAuthn, and TOTP-to-your-phone. Fazm bundles Playwright MCP running in --extension mode against a Chrome Web Store extension (Playwright MCP Bridge, ID mmlmfjhmonkocbjadbfplnigmagldckm) that attaches to your real Chrome on the Mac. Every site you are already logged into is live on turn one, and Cloudflare Turnstile and hCaptcha pass against your real browser fingerprint. The code that wires this up is in acp-bridge/src/index.ts lines 1029-1031.
Is Fazm open source?
Fazm is a consumer Mac app. You install it from fazm.ai as a signed, notarized .app bundle. You can inspect the bundled MCP binaries (mcp-server-macos-use, the whatsapp MCP, google-workspace MCP, the Node acp-bridge) with standard macOS tools like `codesign -dvvv`, `otool -L`, and `strings`, and you can read the bundled acp-bridge JavaScript directly in Fazm.app/Contents/Resources/acp-bridge. You do not need to trust a hosted service; the process runs locally on your Mac.
What kinds of tasks is each model actually good for?
Hugging Face Open Computer Agent shines as a demo of vision-language-driven control on an arbitrary Linux desktop, especially for researchers benchmarking VLMs. Fazm shines for day-to-day Mac work: read your inbox, schedule a meeting, summarize the active Slack, fill a form in your actual Stripe Dashboard, rename files in Finder, push a git branch. The first is a research vehicle, the second is a consumer tool.
What stops me from running them together?
Nothing. Fazm's BUILTIN_MCP_NAMES (acp-bridge/src/index.ts line 1266) defines five built-in MCP servers, and the app accepts additional user-configured MCP servers at runtime. You could attach a smolagents-backed MCP server that fans out to an E2B sandbox, and use it from Fazm as just another tool. The architectures are complementary: Fazm is the native-Mac lane, the sandbox is the 'run this script on a throwaway Linux box' lane.
Where can I verify the file sizes and line numbers in this post?
The Fazm app source tree lives on the author's Mac at /Users/matthewdi/fazm. The specific citations: acp-bridge/src/index.ts lines 63 (binary path), 1056-1064 (MCP spawn), 1266 (BUILTIN_MCP_NAMES); Desktop/Sources/AppState.swift lines 439, 441, 470, 472 (AXUIElement calls), lines 487-503 (CGEvent tap probe); Desktop/Sources/Chat/ChatPrompts.swift lines 56-61 (tool routing prompt). File sizes: /Users/matthewdi/fazm/build/Fazm Dev.app/Contents/MacOS/mcp-server-macos-use reports 21,149,408 bytes via `ls -la`.
Does 'reads the accessibility tree' work on every app, including Electron?
Yes for Electron, because Chromium exposes a full accessibility tree through macOS's AX* APIs when accessibility is requested. This means Slack, Notion, Linear, VS Code, Discord, ChatGPT desktop, and Claude desktop are all readable by mcp-server-macos-use as plain accessibility trees. Native AppKit apps (Finder, Settings, Mail, Notes) work out of the box. The only tricky case is canvas or GPU-drawn apps (Figma, games), where the tree is thin and the screenshot fallback does the heavy lifting.
More on accessibility trees, sandbox ceilings, and native Mac agents
Neighboring guides
Accessibility API vs screenshot agents
The same argument applied to every Mac app: label-bound clicks beat pixel guesses across UI updates.
A web browser automation tool that runs inside your real Chrome
Playwright MCP in --extension mode attaches to the Chrome you already use, so logged-in sessions are live on turn one.
Claude Computer Use agent vs a Mac-native agent
Another vision-first computer-use model, evaluated against a native macOS accessibility-tree approach.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.