Browser testing automation that does not stop at the edge of the tab
Every playbook on this topic compares Selenium, Cypress, Playwright, and BrowserStack. All four are locked inside the browser viewport by construction. The problem is that real product flows are not. An upload opens a native file picker. A login triggers Keychain. An export drops a file in Finder. A notification surfaces in desktop Slack. Fazm is built around a different premise: the same agent turn should be able to click across the boundary, because macOS already exposes an accessibility tree for every one of those surfaces. You just need a runtime that reads both.
THE PROBLEM
Browser-only runners are a partial answer
A browser automation tool that cannot click outside the browser has a shape problem, not a feature problem. No amount of waits, retries, or selector hardening lets Cypress dismiss a native Keychain prompt. No Playwright locator reaches the macOS file picker that pops over Chrome the instant you click an upload button. BrowserStack can reach a remote Windows VM, but its world is still one viewport. The moment the flow under test leaves the tab, the runner is guessing.
What you actually want is the same aria vocabulary on both sides: page and desktop. Accessibility roles already exist on macOS (AXButton, AXTextField, AXMenuItem, AXWindow), and they already exist on the web (button, textbox, menuitem, dialog). They were designed, originally, for screen readers. They happen to be a perfect automation surface.
HOW IT IS WIRED
Five MCP servers per session, two of them accessibility-based
Every Fazm agent session calls buildMcpServers in acp-bridge/src/index.ts and boots five built-in MCP servers. Three of them are product plumbing (fazm_tools, whatsapp, google-workspace). Two are the ones that matter for this topic.
One agent turn, two accessibility surfaces
The three flags on the Playwright line (--output-mode file, --image-responses omit, --output-dir /tmp/playwright-mcp) are not incidental. They are what forces structural targeting. Playwright MCP writes the page as a YAML aria snapshot with ref=eN identifiers to disk on every capture, and strips base64 screenshots out of the model context. The macos-use binary alongside it does the same thing for the rest of the Mac: its traversals are saved as .txt files in /tmp/macos-use/ with role, title, and explicit x, y, w, h pulled from the AX tree.
WALKTHROUGH
A flow that crosses the boundary, step by step
Consider a real one. The test sentence is: "upload the monthly report CSV on our admin dashboard, wait for the PDF to generate, confirm it landed in Downloads, then check that the ops Slack channel got the success ping." In a browser-only runner, step two is flaky, step three is unwritable, and step four needs a completely separate tool. In Fazm, it is one session.
One session, two MCPs, four surfaces
Every handoff is symmetric. The agent does not know or care that the Upload click resolved through Playwright's web aria tree and the Choose click resolved through macOS AXUIElement. It sees role, name, and a ref, reasons against the tree, and picks the next step. That is the point.
WHAT BROWSER-ONLY RUNNERS MISS
The cross-surface checks that keep breaking in CI
These are the real flakes. They do not get mentioned in any comparison between Selenium, Cypress, and Playwright because none of those tools can write the check at all. That is the gap Fazm is built to fill.
Native file pickers
Click Upload → native Open dialog → navigate and select. Fazm hands off to macos-use for the dialog, Playwright resumes on the resulting page toast.
Keychain unlock
When a page triggers a Keychain prompt, browser runners hang. macos-use reads the prompt from AXUIElement and can dismiss or approve.
Password manager
1Password and Bitwarden popups float over Chrome but are not inside it. AX tree sees them; viewport runners do not.
Downloads verification
'Did the CSV actually land on disk?' is a filesystem question. macos-use opens Finder at ~/Downloads, reads the list, confirms the file.
Desktop Slack / Notion
In-product notifications often arrive via the desktop client, not email. Fazm activates the app and reads the latest message via AX.
OS permission prompts
Screen Recording, Microphone, Full Disk Access. These prompts are modal over the entire OS. macos-use handles them in the same agent turn.
RUNTIME TRACE
What the session actually writes to disk
After any Fazm session that touched the browser and the desktop, two directories get populated. Both are auditable. Both are plain text. You can open them, diff them across runs, and see exactly what the agent saw.
HOW FAZM PICKS A SIDE
The decision the agent makes on every turn
The agent does not toggle a flag between "browser mode" and "desktop mode". Both MCP servers are live for the whole session. The turn-level prompt (ChatPrompts.swift in the Fazm source) lays out the split as a routing rule, not a state switch.
Step is about a web page
Navigate, click a link in Chrome, fill a form, read a page. Agent picks Playwright. Snapshots land as YAML in /tmp/playwright-mcp. Target by role plus accessible name plus ref=eN.
Step is about a desktop app
Finder, System Settings, Mail, a native file picker, a Keychain prompt, Slack. Agent picks macos-use. Traversals land as .txt in /tmp/macos-use alongside a screenshot for sanity check. Target by role + title + explicit x, y, w, h.
Step is about screenshots specifically
If a human-visible screenshot is the output, ALWAYS use the native capture tool (screen or window), NEVER browser_take_screenshot. Browser screenshot only sees the viewport, not the desktop. This is spelled out verbatim in ChatPrompts.swift line 66.
Step needs tab hygiene
Before navigating to a new URL, the agent checks tabs with browser_tabs action 'list', matches by domain, and reuses or switches rather than opening a new tab. Tabs the agent opened during the session get closed when the turn ends. Written explicitly into the prompt at line 73.
Fazm vs a browser-only runner
Same check, different surface coverage
| Feature | Playwright / Selenium / Cypress | Fazm |
|---|---|---|
| Click inside the page by aria role | Yes | Yes, via Playwright MCP |
| Click the native file picker after upload | Blocked at the browser boundary | Yes, via macos-use MCP |
| Confirm a downloaded file landed on disk | Needs a separate filesystem runner | Yes, via macos-use + Finder AX tree |
| Read a desktop Slack notification | Not supported | Yes, via macos-use on the native Slack app |
| Handle Keychain / 1Password / SSO popups | Hangs or needs manual intervention | Yes, AX tree sees the modal |
| Run against real logged-in Chrome session | Separate profile, custom launch args | Yes, via extension id mmlmfjhmonkocbjadbfplnigmagldckm |
| Structural targeting (role + name + ref) | CSS or XPath selectors | aria role + accessible name + ref=eN, on both surfaces |
| Screenshot-based targeting | Optional (visual regression tools) | Disabled by default (--image-responses omit) |
ANCHOR FACT
The five MCP servers, verbatim
If you doubt the architecture claim, this constant is the receipt. It is declared in acp-bridge/src/index.ts at line 1266 and guards the branch that separates built-in from user-configured MCP servers.
Three of them (playwright, macos-use, whatsapp) are accessibility-based automation surfaces. The second and third read the same underlying macOS AX tree; the first reads the page-level aria tree that Chrome already exposes for screen readers. One idiom, three surfaces, one agent turn.
NOT A FRAMEWORK
This is a signed Mac app, not an npm package
There is no import, no chromedriver, no Playwright install step, no CI job to configure. You download Fazm from fazm.ai, grant Accessibility permission once, install the Chrome extension (id mmlmfjhmonkocbjadbfplnigmagldckm) through the onboarding window, and type your check in plain English. The agent picks from five built-in MCP servers for every step.
This is the consumer-app side of the tradeoff. You give up the frozen, repeat-for-every-commit discipline of a CI test suite, and you get back the ability to verify cross-surface flows on your real machine, against your real cookies, in one run.
WHO THIS IS FOR
Real people, real flows
If any of these describe your last broken automation, you are the target
- A test that worked in CI but breaks when a user actually runs it, because a native file picker opened over Chrome
- A flow that needs 1Password or Keychain to unlock, and your runner freezes at the OS prompt
- An assertion like 'the PDF really did land in ~/Downloads' that no browser-only runner can write
- A cross-app verification: fire an action on the web, confirm the desktop Slack, Notion, or Mail client received the side effect
- Any automation where the selector matters less than the real aria role the screen reader would hit
- A team that wants to verify an agent can actually complete a weekly workflow on a real machine, not in a sandbox
“aria role plus accessible name is stable across scroll, theme, viewport size, and most minor UI tweaks. A pixel-based agent can click at (420, 710) on the first run and (418, 702) on the second.”
From the Fazm engineering notes on why both surfaces use structural targeting
See a cross-surface flow run on your machine
Walk through one of your own end-to-end flows with me: browser plus desktop, one session, aria refs on both sides.
Book a call →Frequently asked questions
What makes this different from Selenium, Cypress, or a Playwright script?
Selenium, Cypress, and a Playwright script are all sandboxed to the browser viewport. The moment a flow hits a native file dialog, a Keychain unlock, a 1Password quick-access popup, or a desktop Slack notification, they are blind. Fazm is not a library you import; it is a signed Mac app whose agent runtime registers five built-in MCP servers per session (fazm_tools, playwright, macos-use, whatsapp, google-workspace) at acp-bridge/src/index.ts line 1266. Two of them, playwright and macos-use, both expose an aria-style accessibility tree. So the same agent turn that clicks 'Upload' inside the browser can also click 'Choose' inside the macOS file dialog one millisecond later, and it sees both as structured role trees, not pixels.
Where in the code are the two accessibility surfaces actually registered?
In acp-bridge/src/index.ts, the function buildMcpServers (line 992) is called for every ACP session. Starting at line 1027 it pushes the Playwright MCP server with the exact flags --output-mode file --image-responses omit --output-dir /tmp/playwright-mcp on line 1033. Then at line 1057, if the macosUseBinary exists on disk, it pushes a second MCP server named macos-use which runs a native binary that speaks macOS AXUIElement directly. Both servers are live for the whole session. The agent picks which one to call based on what the proposed step looks like: page-level means Playwright, OS-level means macos-use.
Why is skipping screenshots a feature, not a limitation?
Because any check that compares pixels is, by construction, flaky. The three flags on line 1033 (--output-mode file, --image-responses omit, --output-dir /tmp/playwright-mcp) cooperate to make the Playwright MCP server write its aria snapshots to YAML files on disk and strip inline base64 screenshots from the model context. That forces the agent to pick its next click against a stable structural tree (role + accessible name + ref=eN) instead of an image. Re-runs on the same page pick the same refs. Scrolling, theme flips, banner shifts, and 1 px layout jitter do not change role trees, so they do not change test outcomes.
How does it hook into my real Chrome rather than a fresh Chromium?
Via a Chrome Web Store extension with the literal id mmlmfjhmonkocbjadbfplnigmagldckm (Desktop/Sources/BrowserExtensionSetup.swift line 678). When the environment variable PLAYWRIGHT_USE_EXTENSION is set to true, acp-bridge/src/index.ts lines 1029 to 1031 append --extension to the Playwright CLI args, which tells Playwright MCP to attach to the running Chrome session rather than launching its own. That means the tests run against your actual logged-in cookies, real 2FA state, and real extensions. A test that says 'check my Gmail inbox for the verification code' can just work, because the cookie is there.
What does a cross-surface flow actually look like end to end?
Take a typical broken flow: 'upload a CSV on the admin dashboard, verify that the generated PDF landed in ~/Downloads, then check Slack for the success notification'. Browser testing automation tools handle step one and give up on two and three. Fazm handles all three: Playwright MCP clicks the visible Upload button in the dashboard, the OS file picker opens, macos-use takes over and navigates Finder to the CSV, Playwright watches for the in-page toast, then macos-use opens Finder again, confirms the PDF exists in ~/Downloads, and finally opens the Slack desktop app to read the latest notification in the target channel. Same agent, same aria idiom, different MCP servers.
Is this a framework or a product I install?
It is a product. Fazm is a signed Mac app you download from fazm.ai and double-click. There is no npm install, no driver binary to keep in sync with Chrome, no test runner to configure, no CI hookup required. You describe the check in plain English, the agent picks from its five MCP servers, and every proposed step is resolved against an accessibility tree. That is the tradeoff: the cycle time of typing English beats writing code, but you are not building a frozen regression suite for CI. If you need 500 deterministic tests on every pull request, keep Playwright in CI and use Fazm for the flows a script cannot reach.
What counts as a failure mode this setup catches that a browser-only runner does not?
Several. One: the test flow requires an OS-level permission prompt (Screen Recording, Microphone, Full Disk Access) that cannot be dismissed from the browser. Two: the flow redirects to a native SSO prompt hosted by Keychain or 1Password. Three: the flow depends on an attachment actually existing on disk after download, and the browser runner has no filesystem checker. Four: the flow posts to desktop Slack via a native notification, and the assertion is 'did the notification actually arrive'. Five: any flow where the real browser extension ecosystem matters (ad blocker, password manager, MCP bridge). Fazm catches all five because macos-use talks to the real macOS accessibility tree, not a headless sandbox.
Where do the aria snapshots land, and can I audit them after a run?
They land in /tmp/playwright-mcp as YAML files. Each file is a full role tree snapshot of the page at the moment the agent captured it, with ref=eN identifiers on every element. After a session, you can cd /tmp/playwright-mcp and open the files directly to see exactly which roles the agent reasoned against. That is the closest thing to a deterministic replay log without instrumenting your own page. Native macOS accessibility traversals from macos-use are logged the same way, as .txt files in /tmp/macos-use/, each with a timestamped screenshot alongside for visual verification.
Does the agent run in the browser tab I am watching, or in a separate one?
In the one you are watching. Once the Chrome extension is connected, Playwright MCP attaches to your real running Chrome. Fazm injects a visible overlay (a full-viewport div with id 'fazm-overlay' at z-index 2147483647, defined in acp-bridge/browser-overlay-init.js) with a centered pill reading 'Browser controlled by Fazm'. The overlay uses pointer-events:none, so you can still click, switch tabs, and keep working while the agent operates. Every page it touches gets this overlay so you can always tell whether the agent has control.
Can I still use Playwright, Selenium, or Cypress alongside Fazm?
Yes, and you probably should. The two modes solve different problems. Playwright in CI gives you frozen, deterministic regression on every pull request. Fazm on your Mac gives you a way to check a real end-to-end flow against your actual signed-in sessions, today, across both the browser and the rest of the desktop. If your production automation contract is 'ship with green CI', keep Playwright. If you also need 'verify this cross-app flow really works on my machine before I ship', Fazm complements it. They do not replace each other.
Does the macos-use side ever fall back to screenshots?
Only for verification, never for targeting. macos-use targets elements from the accessibility tree output, which is stored as a plain .txt file per traversal. Each element has role, title, and explicit x, y, w, h coordinates read from the AX tree itself. Clicks are auto-centered at (x+w/2, y+h/2). A screenshot is saved alongside only so a human (or the agent) can visually double-check that the action had the intended effect. Targeting is still structural; the screenshot is a sanity check, not a selector.
What do I actually need on my machine for this to work?
A Mac, Google Chrome, the signed Fazm app, and Accessibility permission granted in System Settings. At first launch the app runs a three-stage Accessibility self-test (testAccessibilityPermission in Desktop/Sources/AppState.swift) to confirm the runtime is alive before the agent attempts any work, then opens the Playwright MCP Bridge onboarding window to install the Chrome extension with id mmlmfjhmonkocbjadbfplnigmagldckm. After that, no drivers, no PATH edits, no test runner config. The agent uses whichever MCP server the step needs, per turn.
Related guides
A browser automation test that lives in the aria tree
How Fazm's three-stage Accessibility self-test runs before the agent touches a single tab, and why clicks resolve against ref=eN instead of pixels.
Browser automation in Chrome without a separate driver
The Playwright MCP Chrome extension, what it hooks into, and how Fazm uses it to drive your real logged-in session.
AI browser automation that reads the page like a screen reader
Why aria roles beat CSS selectors for agent-driven browsing, and how Fazm composes browser and desktop tools per turn.