TWO ACCESSIBILITY TREES / ONE SESSION / NO PIXEL CLICKS

Browser testing automation that does not stop at the edge of the tab

Every playbook on this topic compares Selenium, Cypress, Playwright, and BrowserStack. All four are locked inside the browser viewport by construction. The problem is that real product flows are not. An upload opens a native file picker. A login triggers Keychain. An export drops a file in Finder. A notification surfaces in desktop Slack. Fazm is built around a different premise: the same agent turn should be able to click across the boundary, because macOS already exposes an accessibility tree for every one of those surfaces. You just need a runtime that reads both.

Matthew Diakonov, Fazm

Published April 23, 202611 min read

4.9from Written from the Fazm source tree

Playwright MCP on the page

macos-use MCP on the desktop

Aria refs, not pixels

Real Chrome via extension

Signed consumer Mac app

Browser testing that crosses the window

One agent, two accessibility trees, same aria vocabulary

Playwright MCP for the page

macos-use MCP for the OS

Hand off mid-flow when the test leaves the tab

Every click resolved against a role tree, never a pixel

0:00 / 0:05

THE PROBLEM

Browser-only runners are a partial answer

A browser automation tool that cannot click outside the browser has a shape problem, not a feature problem. No amount of waits, retries, or selector hardening lets Cypress dismiss a native Keychain prompt. No Playwright locator reaches the macOS file picker that pops over Chrome the instant you click an upload button. BrowserStack can reach a remote Windows VM, but its world is still one viewport. The moment the flow under test leaves the tab, the runner is guessing.

What you actually want is the same aria vocabulary on both sides: page and desktop. Accessibility roles already exist on macOS (AXButton, AXTextField, AXMenuItem, AXWindow), and they already exist on the web (button, textbox, menuitem, dialog). They were designed, originally, for screen readers. They happen to be a perfect automation surface.

AXButtonrole=buttonAXTextFieldrole=textboxAXMenuItemrole=menuitemAXWindowrole=dialogref=e14/tmp/playwright-mcp/tmp/macos-usemmlmfjhmonkocbjadbfplnigmagldckm

HOW IT IS WIRED

Five MCP servers per session, two of them accessibility-based

Every Fazm agent session calls buildMcpServers in acp-bridge/src/index.ts and boots five built-in MCP servers. Three of them are product plumbing (fazm_tools, whatsapp, google-workspace). Two are the ones that matter for this topic.

One agent turn, two accessibility surfaces

acp-bridge/src/index.ts (lines 1027 to 1064)

The three flags on the Playwright line (--output-mode file, --image-responses omit, --output-dir /tmp/playwright-mcp) are not incidental. They are what forces structural targeting. Playwright MCP writes the page as a YAML aria snapshot with ref=eN identifiers to disk on every capture, and strips base64 screenshots out of the model context. The macos-use binary alongside it does the same thing for the rest of the Mac: its traversals are saved as .txt files in /tmp/macos-use/ with role, title, and explicit x, y, w, h pulled from the AX tree.

0built-in MCP servers per agent session

0that expose a live accessibility tree

0pixel comparisons in the assertion path

WALKTHROUGH

A flow that crosses the boundary, step by step

Consider a real one. The test sentence is: "upload the monthly report CSV on our admin dashboard, wait for the PDF to generate, confirm it landed in Downloads, then check that the ops Slack channel got the success ping." In a browser-only runner, step two is flaky, step three is unwritable, and step four needs a completely separate tool. In Fazm, it is one session.

One session, two MCPs, four surfaces

Every handoff is symmetric. The agent does not know or care that the Upload click resolved through Playwright's web aria tree and the Choose click resolved through macOS AXUIElement. It sees role, name, and a ref, reasons against the tree, and picks the next step. That is the point.

WHAT BROWSER-ONLY RUNNERS MISS

The cross-surface checks that keep breaking in CI

These are the real flakes. They do not get mentioned in any comparison between Selenium, Cypress, and Playwright because none of those tools can write the check at all. That is the gap Fazm is built to fill.

Native file pickers

Click Upload → native Open dialog → navigate and select. Fazm hands off to macos-use for the dialog, Playwright resumes on the resulting page toast.

Keychain unlock

When a page triggers a Keychain prompt, browser runners hang. macos-use reads the prompt from AXUIElement and can dismiss or approve.

Password manager

1Password and Bitwarden popups float over Chrome but are not inside it. AX tree sees them; viewport runners do not.

Downloads verification

'Did the CSV actually land on disk?' is a filesystem question. macos-use opens Finder at ~/Downloads, reads the list, confirms the file.

Desktop Slack / Notion

In-product notifications often arrive via the desktop client, not email. Fazm activates the app and reads the latest message via AX.

OS permission prompts

Screen Recording, Microphone, Full Disk Access. These prompts are modal over the entire OS. macos-use handles them in the same agent turn.

RUNTIME TRACE

What the session actually writes to disk

After any Fazm session that touched the browser and the desktop, two directories get populated. Both are auditable. Both are plain text. You can open them, diff them across runs, and see exactly what the agent saw.

ls /tmp/playwright-mcp and /tmp/macos-use after a cross-surface run

HOW FAZM PICKS A SIDE

The decision the agent makes on every turn

The agent does not toggle a flag between "browser mode" and "desktop mode". Both MCP servers are live for the whole session. The turn-level prompt (ChatPrompts.swift in the Fazm source) lays out the split as a routing rule, not a state switch.

Step is about a web page

Navigate, click a link in Chrome, fill a form, read a page. Agent picks Playwright. Snapshots land as YAML in /tmp/playwright-mcp. Target by role plus accessible name plus ref=eN.

Step is about a desktop app

Finder, System Settings, Mail, a native file picker, a Keychain prompt, Slack. Agent picks macos-use. Traversals land as .txt in /tmp/macos-use alongside a screenshot for sanity check. Target by role + title + explicit x, y, w, h.

Step is about screenshots specifically

If a human-visible screenshot is the output, ALWAYS use the native capture tool (screen or window), NEVER browser_take_screenshot. Browser screenshot only sees the viewport, not the desktop. This is spelled out verbatim in ChatPrompts.swift line 66.

Step needs tab hygiene

Before navigating to a new URL, the agent checks tabs with browser_tabs action 'list', matches by domain, and reuses or switches rather than opening a new tab. Tabs the agent opened during the session get closed when the turn ends. Written explicitly into the prompt at line 73.

Fazm vs a browser-only runner

Same check, different surface coverage

Feature	Playwright / Selenium / Cypress	Fazm
Click inside the page by aria role	Yes	Yes, via Playwright MCP
Click the native file picker after upload	Blocked at the browser boundary	Yes, via macos-use MCP
Confirm a downloaded file landed on disk	Needs a separate filesystem runner	Yes, via macos-use + Finder AX tree
Read a desktop Slack notification	Not supported	Yes, via macos-use on the native Slack app
Handle Keychain / 1Password / SSO popups	Hangs or needs manual intervention	Yes, AX tree sees the modal
Run against real logged-in Chrome session	Separate profile, custom launch args	Yes, via extension id mmlmfjhmonkocbjadbfplnigmagldckm
Structural targeting (role + name + ref)	CSS or XPath selectors	aria role + accessible name + ref=eN, on both surfaces
Screenshot-based targeting	Optional (visual regression tools)	Disabled by default (--image-responses omit)

ANCHOR FACT

The five MCP servers, verbatim

If you doubt the architecture claim, this constant is the receipt. It is declared in acp-bridge/src/index.ts at line 1266 and guards the branch that separates built-in from user-configured MCP servers.

acp-bridge/src/index.ts (line 1266)

Three of them (playwright, macos-use, whatsapp) are accessibility-based automation surfaces. The second and third read the same underlying macOS AX tree; the first reads the page-level aria tree that Chrome already exposes for screen readers. One idiom, three surfaces, one agent turn.

NOT A FRAMEWORK

This is a signed Mac app, not an npm package

There is no import, no chromedriver, no Playwright install step, no CI job to configure. You download Fazm from fazm.ai, grant Accessibility permission once, install the Chrome extension (id mmlmfjhmonkocbjadbfplnigmagldckm) through the onboarding window, and type your check in plain English. The agent picks from five built-in MCP servers for every step.

This is the consumer-app side of the tradeoff. You give up the frozen, repeat-for-every-commit discipline of a CI test suite, and you get back the ability to verify cross-surface flows on your real machine, against your real cookies, in one run.

WHO THIS IS FOR

Real people, real flows

If any of these describe your last broken automation, you are the target

A test that worked in CI but breaks when a user actually runs it, because a native file picker opened over Chrome
A flow that needs 1Password or Keychain to unlock, and your runner freezes at the OS prompt
An assertion like 'the PDF really did land in ~/Downloads' that no browser-only runner can write
A cross-app verification: fire an action on the web, confirm the desktop Slack, Notion, or Mail client received the side effect
Any automation where the selector matters less than the real aria role the screen reader would hit
A team that wants to verify an agent can actually complete a weekly workflow on a real machine, not in a sandbox

0 screenshots

“aria role plus accessible name is stable across scroll, theme, viewport size, and most minor UI tweaks. A pixel-based agent can click at (420, 710) on the first run and (418, 702) on the second.”

From the Fazm engineering notes on why both surfaces use structural targeting

See a cross-surface flow run on your machine

Walk through one of your own end-to-end flows with me: browser plus desktop, one session, aria refs on both sides.

Book a call →

Frequently asked questions

What makes this different from Selenium, Cypress, or a Playwright script?

Selenium, Cypress, and a Playwright script are all sandboxed to the browser viewport. The moment a flow hits a native file dialog, a Keychain unlock, a 1Password quick-access popup, or a desktop Slack notification, they are blind. Fazm is not a library you import; it is a signed Mac app whose agent runtime registers five built-in MCP servers per session (fazm_tools, playwright, macos-use, whatsapp, google-workspace) at acp-bridge/src/index.ts line 1266. Two of them, playwright and macos-use, both expose an aria-style accessibility tree. So the same agent turn that clicks 'Upload' inside the browser can also click 'Choose' inside the macOS file dialog one millisecond later, and it sees both as structured role trees, not pixels.

Where in the code are the two accessibility surfaces actually registered?

In acp-bridge/src/index.ts, the function buildMcpServers (line 992) is called for every ACP session. Starting at line 1027 it pushes the Playwright MCP server with the exact flags --output-mode file --image-responses omit --output-dir /tmp/playwright-mcp on line 1033. Then at line 1057, if the macosUseBinary exists on disk, it pushes a second MCP server named macos-use which runs a native binary that speaks macOS AXUIElement directly. Both servers are live for the whole session. The agent picks which one to call based on what the proposed step looks like: page-level means Playwright, OS-level means macos-use.

Why is skipping screenshots a feature, not a limitation?

Because any check that compares pixels is, by construction, flaky. The three flags on line 1033 (--output-mode file, --image-responses omit, --output-dir /tmp/playwright-mcp) cooperate to make the Playwright MCP server write its aria snapshots to YAML files on disk and strip inline base64 screenshots from the model context. That forces the agent to pick its next click against a stable structural tree (role + accessible name + ref=eN) instead of an image. Re-runs on the same page pick the same refs. Scrolling, theme flips, banner shifts, and 1 px layout jitter do not change role trees, so they do not change test outcomes.

How does it hook into my real Chrome rather than a fresh Chromium?

Via a Chrome Web Store extension with the literal id mmlmfjhmonkocbjadbfplnigmagldckm (Desktop/Sources/BrowserExtensionSetup.swift line 678). When the environment variable PLAYWRIGHT_USE_EXTENSION is set to true, acp-bridge/src/index.ts lines 1029 to 1031 append --extension to the Playwright CLI args, which tells Playwright MCP to attach to the running Chrome session rather than launching its own. That means the tests run against your actual logged-in cookies, real 2FA state, and real extensions. A test that says 'check my Gmail inbox for the verification code' can just work, because the cookie is there.

What does a cross-surface flow actually look like end to end?

Take a typical broken flow: 'upload a CSV on the admin dashboard, verify that the generated PDF landed in ~/Downloads, then check Slack for the success notification'. Browser testing automation tools handle step one and give up on two and three. Fazm handles all three: Playwright MCP clicks the visible Upload button in the dashboard, the OS file picker opens, macos-use takes over and navigates Finder to the CSV, Playwright watches for the in-page toast, then macos-use opens Finder again, confirms the PDF exists in ~/Downloads, and finally opens the Slack desktop app to read the latest notification in the target channel. Same agent, same aria idiom, different MCP servers.

Is this a framework or a product I install?

It is a product. Fazm is a signed Mac app you download from fazm.ai and double-click. There is no npm install, no driver binary to keep in sync with Chrome, no test runner to configure, no CI hookup required. You describe the check in plain English, the agent picks from its five MCP servers, and every proposed step is resolved against an accessibility tree. That is the tradeoff: the cycle time of typing English beats writing code, but you are not building a frozen regression suite for CI. If you need 500 deterministic tests on every pull request, keep Playwright in CI and use Fazm for the flows a script cannot reach.

What counts as a failure mode this setup catches that a browser-only runner does not?

Several. One: the test flow requires an OS-level permission prompt (Screen Recording, Microphone, Full Disk Access) that cannot be dismissed from the browser. Two: the flow redirects to a native SSO prompt hosted by Keychain or 1Password. Three: the flow depends on an attachment actually existing on disk after download, and the browser runner has no filesystem checker. Four: the flow posts to desktop Slack via a native notification, and the assertion is 'did the notification actually arrive'. Five: any flow where the real browser extension ecosystem matters (ad blocker, password manager, MCP bridge). Fazm catches all five because macos-use talks to the real macOS accessibility tree, not a headless sandbox.

Where do the aria snapshots land, and can I audit them after a run?

They land in /tmp/playwright-mcp as YAML files. Each file is a full role tree snapshot of the page at the moment the agent captured it, with ref=eN identifiers on every element. After a session, you can cd /tmp/playwright-mcp and open the files directly to see exactly which roles the agent reasoned against. That is the closest thing to a deterministic replay log without instrumenting your own page. Native macOS accessibility traversals from macos-use are logged the same way, as .txt files in /tmp/macos-use/, each with a timestamped screenshot alongside for visual verification.

Does the agent run in the browser tab I am watching, or in a separate one?

In the one you are watching. Once the Chrome extension is connected, Playwright MCP attaches to your real running Chrome. Fazm injects a visible overlay (a full-viewport div with id 'fazm-overlay' at z-index 2147483647, defined in acp-bridge/browser-overlay-init.js) with a centered pill reading 'Browser controlled by Fazm'. The overlay uses pointer-events:none, so you can still click, switch tabs, and keep working while the agent operates. Every page it touches gets this overlay so you can always tell whether the agent has control.

Can I still use Playwright, Selenium, or Cypress alongside Fazm?

Yes, and you probably should. The two modes solve different problems. Playwright in CI gives you frozen, deterministic regression on every pull request. Fazm on your Mac gives you a way to check a real end-to-end flow against your actual signed-in sessions, today, across both the browser and the rest of the desktop. If your production automation contract is 'ship with green CI', keep Playwright. If you also need 'verify this cross-app flow really works on my machine before I ship', Fazm complements it. They do not replace each other.

Does the macos-use side ever fall back to screenshots?

Only for verification, never for targeting. macos-use targets elements from the accessibility tree output, which is stored as a plain .txt file per traversal. Each element has role, title, and explicit x, y, w, h coordinates read from the AX tree itself. Clicks are auto-centered at (x+w/2, y+h/2). A screenshot is saved alongside only so a human (or the agent) can visually double-check that the action had the intended effect. Targeting is still structural; the screenshot is a sanity check, not a selector.

What do I actually need on my machine for this to work?

A Mac, Google Chrome, the signed Fazm app, and Accessibility permission granted in System Settings. At first launch the app runs a three-stage Accessibility self-test (testAccessibilityPermission in Desktop/Sources/AppState.swift) to confirm the runtime is alive before the agent attempts any work, then opens the Playwright MCP Bridge onboarding window to install the Chrome extension with id mmlmfjhmonkocbjadbfplnigmagldckm. After that, no drivers, no PATH edits, no test runner config. The agent uses whichever MCP server the step needs, per turn.

Related guides

Deep dive

A browser automation test that lives in the aria tree

How Fazm's three-stage Accessibility self-test runs before the agent touches a single tab, and why clicks resolve against ref=eN instead of pixels.

Read

How it works

Browser automation in Chrome without a separate driver

The Playwright MCP Chrome extension, what it hooks into, and how Fazm uses it to drive your real logged-in session.

Read

Concept

AI browser automation that reads the page like a screen reader

Why aria roles beat CSS selectors for agent-driven browsing, and how Fazm composes browser and desktop tools per turn.

Read