ACCESSIBILITY TREE / FIVE MCP TOOLS / CONSUMER APP

A browser automation tool where the browser is one tool of five, and none of them click by screenshot

The top ten results for this keyword all review browser-only libraries or screenshot-driven agents. Fazm is neither. It is a signed Mac app that runs five MCP servers as peers (fazm_tools, playwright, macos-use, whatsapp, google-workspace), reads the macOS accessibility tree instead of pixels, and asks the language model to pick a numbered ref, not a coordinate. Here is what that actually looks like in the source.

Matthew Diakonov, Fazm

Published April 21, 20269 min read

4.9from Written from the Fazm source tree

Five MCP servers

AXUIElement on any app

No screenshot clicks

Chrome Web Store extension

macOS native

Browser automation tool, reframed

The browser is one of five peers. The clicks are by ref, not pixel.

Five built-in MCP servers, not one

Accessibility tree, not screenshots

Click by ref=e1, not (x, y)

Works in Chrome, Finder, Mail, WhatsApp

No code, no CLI, no selectors

0:00 / 0:05

THE TWO EXISTING CATEGORIES

Every browser automation tool on the SERP is one of these two

Read the top ten results for this keyword back to back and the shape is obvious. Tools come in two flavors, both bounded, both with a known failure mode.

Category A: code frameworks

Playwright, Puppeteer, Selenium, WebDriverIO. You write code against a browser driver. Deterministic, fast, robust in CI. The tradeoff: they are bound to a single tab, and any task that leaves the tab (desktop app, OS dialog, local file) is out of scope. You also have to write the script, which is not a tool your parents can use.

Category B: vision AI agents

Browser-Use, Nanobrowser, Multi-On, Adept. A vision model looks at a screenshot of the page, labels buttons, clicks by coordinates. Works with no code, which is great. Breaks the moment the page scrolls, re-renders, or has a button the vision model mislabels. And still bound to a tab.

Category C, Fazm: structural, multi-app, consumer

No recorded script, no vision click. The agent reads a YAML accessibility snapshot of the page (Playwright is launched with --image-responses omit explicitly, acp-bridge/src/index.ts:1033) and picks a ref. When the task leaves the tab, the agent switches to the macos-use MCP server and keeps going against the native app via the real AX API. One download, one permission grant, one English sentence.

fazm_toolsplaywrightmacos-usewhatsappgoogle-workspaceAXUIElementCreateApplicationkAXFocusedWindowAttributeref=e1 not (x,y)--image-responses omitYAML snapshotsClaude Agent SDKChrome Web Store bridge

THE ANCHOR FACT

Five names in one Set, in one file

This is the thing that makes Fazm structurally different from every other result on this page. In the bridge that connects the Fazm Mac app to the Claude Agent SDK, there is a single source of truth for which tools are built in. It is a JavaScript Set with five names in it. The browser is one of them.

acp-bridge/src/index.ts (line 1266)

Every result the agent produces is routed through one of these five. If the task is "email the PDF to Sam", the agent calls google-workspace. If it is "rename the folder in Finder", it calls macos-use. If it is "fill the form on this site", it calls playwright. Same message, same agent, different peer. You cannot build that from a browser library, because a browser library starts from the assumption that the world ends at the URL bar.

THE SHAPE OF IT

One agent, five peers, one accessibility tree

How one English sentence picks a tool

WHAT EACH ONE DOES

The five peers, broken out

playwright

Chrome over Playwright. Launched with --output-mode file --image-responses omit so the model sees a YAML snapshot, not a screenshot. In extension mode, attaches to your real signed-in Chrome via the Chrome Web Store extension Playwright MCP Bridge.

macos-use

A native binary bundled inside the Mac app at Contents/MacOS/mcp-server-macos-use. Reads the accessibility tree of whatever app you tell it (Finder, Settings, Mail, Notes, Preview, VSCode). Works on anything that responds to the AX API.

A separate native MCP binary for WhatsApp, because WhatsApp on Mac is a Catalyst app and its AX surface is quirky enough to deserve its own driver.

google-workspace

A Python MCP (bundled with its own venv and PYTHONHOME) that wraps the Gmail, Drive, and Calendar APIs. Credentials live in ~/.google_workspace_mcp/credentials/, owned by the user, not by a cloud service.

fazm_tools

The Fazm app's own API surface. Read-only SQL against the local database, screenshot capture for context (not for clicking), permission checks, ai-browser-profile queries, file indexing, skill installation.

User-added MCPs

After the five built-ins, the bridge reads ~/.fazm/mcp-servers.json and appends user-defined entries in the same shape Claude Code uses. A custom MCP for Linear, Jira, or a home-made API is a file away.

THE PIXEL REFUSAL

Why Fazm deliberately strips screenshots from the agent's view

This is the second half of the differentiation, and it is easy to miss. On the browser leg specifically, Fazm does not allow the model to see the page as an image.

acp-bridge/src/index.ts (line 1033)

The effect is that when the agent asks Playwright for a page snapshot, it gets a YAML file on disk with a hierarchical tree of elements, each with a stable structural reference like ref=e12. The agent says "click ref=e12". Playwright maps that ref to a DOM element and clicks it. No base64 image ever enters the context window.

What this buys you

Context stays small, so multi-step tasks do not hit the token ceiling
Clicks do not drift when a page scrolls or re-renders
The same ref works across a second message, because it is in the snapshot file
No OCR errors, no misreads of icon-only buttons
The agent logic ports one-to-one to the native macos-use tool, which uses the same tree shape

WHAT HAPPENS ON A SINGLE REQUEST

One message, five tools on standby

rename and email flow

Against the two dominant categories

Feature	browser-only tools	Fazm
How the agent sees the page	DOM selectors (Playwright/Selenium) or screenshot plus vision labels (Browser-Use)	YAML accessibility snapshot on disk, structural refs (ref=e1..eN)
What it can drive	Only the browser tab	Chrome, Finder, Mail, Notes, Settings, WhatsApp, any AX-compliant Mac app
Setup	pip/npm install, write a script, run it	Download signed Mac app, grant Accessibility permission, type in English
Session fingerprint	Fresh Chromium (no cookies) or a separate user profile	Real Chrome via Chrome Web Store extension, or fresh Chromium, user's choice
Architecture	One tool, one runtime	Five peer MCP servers plus user-defined extensions in ~/.fazm/mcp-servers.json
Pixel dependence	High for vision agents, none for code frameworks	None. --image-responses omit is explicit in the Playwright launch args

THE SELF-TEST AT BOOT

How the app proves to itself that Accessibility is live

This is a small detail that says something about the architecture. Every time Fazm starts, it runs a real AX call against whatever app is frontmost, and if that fails it runs a secondary check against Finder, and if that fails it probes the live TCC database via a CGEvent tap. The primitive is treated as load-bearing, not decorative.

Desktop/Sources/AppState.swift (line 433)

WHAT YOU ACTUALLY SEE IN THE LOGS

What boot looks like from the bridge side

~/Library/Logs/Fazm/bridge.log

THE NUMBERS

Counts you can check in the repo right now

built-in MCP servers

BUILTIN_MCP_NAMES

bundled skills

BundledSkills/*.skill.md

screenshots in agent context

--image-responses omit

AX self-test fallbacks

AX → Finder → EventTap

0 / ∞

“All popular browser automation tools reviewed in the top SERP for this keyword are either code frameworks bound to a tab or vision agents clicking by pixel. Fazm is neither, by design.”

Fazm repo, acp-bridge/src/index.ts:1033 and :1266

HOW TO TRY IT

What installing this actually looks like

Download the Mac app

A signed, notarized .dmg from fazm.ai. Drag into /Applications. This is the same distribution shape as any other consumer Mac app, nothing Homebrew-adjacent.

Grant Accessibility and Screen Recording

The onboarding flow prompts for both and verifies they stuck. The AX self-test at Desktop/Sources/AppState.swift:433 runs on every launch and tells you if something regressed.

Optionally connect the real Chrome bridge

Install the Chrome Web Store extension Playwright MCP Bridge, paste the base64url token into Fazm. After that, the playwright leg drives your logged-in Chrome instead of a fresh Chromium. Skippable.

Type a sentence

No scripts, no selectors, no YAML. The agent picks which of the five peers to call, and you watch it happen against the apps you already use.

Want a walkthrough across the five tools?

20 minutes, we run one request that touches Chrome, Finder, and Gmail in a single agent turn.

Book a call →

Frequently asked questions

What makes Fazm a different kind of browser automation tool?

Two things. First, the browser is not the whole product. Fazm registers five built-in MCP servers at runtime (fazm_tools, playwright, macos-use, whatsapp, google-workspace) and the browser is one of them, not the top of the hierarchy. You can watch this directly in acp-bridge/src/index.ts at line 1266, where BUILTIN_MCP_NAMES is literally new Set with those five names. Second, the browser leg does not operate on screenshots. Playwright is launched with --image-responses omit (acp-bridge/src/index.ts:1033), so the agent reads a YAML accessibility snapshot of the page and clicks by structural reference, not by pixel coordinates.

How is this different from Playwright, Puppeteer, or Selenium?

Those are libraries. You import a package, write code, run it in CI or from a terminal. They do exactly what you programmed, and they only drive a browser. Fazm is a signed, notarized macOS app you download and double-click. You type in English, a local AI agent decides which of the five MCP tools to call, and one of those tools is Playwright under the hood. The difference is not about raw capability on a single tab. It is that the tab is one surface in a larger plane and you are not writing code to cross it.

How is this different from Browser-Use, Nanobrowser, or other vision-based browser agents?

Browser-Use style agents take a screenshot of the page, ask a vision LLM to label the buttons, and click by pixel coordinates. That works until the page re-renders, scrolls, or the model mislabels a control. Fazm asks Playwright and macos-use for the accessibility tree of the current target, gets back a YAML snapshot with numbered refs (ref=e1, ref=e2), and asks the language model to pick a ref. No OCR, no vision step. On the Mac side the same approach uses AXUIElementCreateApplication and AXUIElementCopyAttributeValue against whatever app is frontmost, which you can see in Desktop/Sources/AppState.swift:439.

What do you mean by 'accessibility tree' and why does it matter here?

Every macOS app (including Chrome, Safari, Finder, Mail, Notes, Settings, VSCode, and Catalyst apps like WhatsApp) publishes a typed tree of UI elements to the system accessibility API. Each element has a role (button, text field, menu item), a label, bounds, and children. That tree is the same data a screen reader uses. Fazm uses it as the canonical ground truth for what is on screen, because it is structured, it does not change when a user scrolls, and it maps to a stable click target. Screenshots are lossy and nondeterministic. The tree is neither.

Is Fazm a developer framework?

No. There is no CLI, no package to install, no YAML config, no test runner. You download a signed Mac app, it walks you through granting Accessibility and Screen Recording permissions, and you talk to it. Under the hood it is running Claude Agent SDK with five bundled MCP servers, but you do not see any of that unless you go read the source.

What are the five built-in MCP servers the agent can call?

They are hardcoded in acp-bridge/src/index.ts at line 1266 as BUILTIN_MCP_NAMES = new Set(['fazm_tools', 'playwright', 'macos-use', 'whatsapp', 'google-workspace']). fazm_tools is Fazm's own helper API (database queries, screenshots for context, permissions, the ai-browser-profile). playwright is the browser leg. macos-use is a native bundled binary at Contents/MacOS/mcp-server-macos-use that drives any AX-compliant Mac app. whatsapp is a dedicated native MCP for the WhatsApp Catalyst app. google-workspace is a Python MCP for Gmail, Drive, and Calendar APIs. They are peers. The agent picks whichever one fits the sentence you typed.

Why would I want something other than a browser driven?

Because the tasks that hurt most are the ones that cross apps. Pull a receipt from Gmail, upload it to a vendor portal in Chrome, rename the file in Finder, log the amount in a Numbers sheet. Any browser-only tool stops being useful the moment the task leaves the tab, and you are back to copy-pasting. A tool that treats the browser as one peer automates the whole chain in one request.

Does it actually strip screenshots from the agent's view?

Yes, explicitly. In acp-bridge/src/index.ts on line 1033 the Playwright MCP is started with the flags --output-mode file --image-responses omit --output-dir /tmp/playwright-mcp. The YAML snapshot is written to a file on disk that the agent can reference, and inline base64 screenshots are dropped before they reach the model context. This is a deliberate architecture choice, not a side effect. Context stays clean and the model picks moves by ref=eN instead of guessing at pixels.

Does the browser leg use my real Chrome or a fresh Chromium?

Both modes are supported. If PLAYWRIGHT_USE_EXTENSION is set to true (see acp-bridge/src/index.ts around line 1029), the Playwright MCP attaches to the Chrome you are already signed into via the Chrome Web Store extension 'Playwright MCP Bridge'. If not, Playwright launches its own headless Chromium. Real-Chrome mode is the upgrade: your cookies, SSO sessions, and browser fingerprint come along. But even fresh-Chromium mode still benefits from the structured-snapshot approach.

How does Fazm verify that the Accessibility API actually works at boot?

Inside Desktop/Sources/AppState.swift there is a function called testAccessibilityPermission (line 433) that calls AXUIElementCreateApplication on whatever app is frontmost, then reads kAXFocusedWindowAttribute back with AXUIElementCopyAttributeValue. If that round trip fails with .cannotComplete, the code does a secondary check against Finder (lines 468 to 485) to distinguish a real permission problem from an app that does not implement AX. If Finder also fails, it falls back to a CGEvent tap probe that checks the live TCC database, bypassing the per-process cache that stales on macOS 26. That self-test is why Accessibility as a primitive is trustworthy in practice.

Can I add my own MCP servers on top of the five built-ins?

Yes. Right after the five built-ins are registered, acp-bridge/src/index.ts reads ~/.fazm/mcp-servers.json and appends any entries with the same shape Claude Code uses: name, command, args, env, enabled. If you write a custom MCP for, say, Linear or Jira, Fazm picks it up on next launch with no code change and the agent gets a new tool immediately.

Why is this a consumer app and not a developer tool?

Because the user surface is the Mac app and English sentences, not a library import. The target user is somebody who wants their Mac to automate a thing, not somebody who wants to write a scraper. The fact that there is Playwright inside is an implementation detail. You would not describe Apple Shortcuts as 'a workflow engine wrapping AppleScript, Automator, and Focus'. Same framing here.

Adjacent reads

architecture

Automation web browser, the snapshot-first approach

How the playwright leg reads the page as YAML, clicks by ref, and leaves the tab without losing state.

Read

setup

Browser automation extension, the real-Chrome bridge

The Chrome Web Store listing, the token handshake, and why a real signed-in Chrome matters.

Read

context

Advantages of business process automation

Where cross-app agents save time that browser-only tools cannot.

Read