Browser test automation that does not stop at the tab

Every popular guide on this topic lands on Playwright, Selenium, or Cypress, which all live inside a single browser tab. This one covers what happens when the flow leaves the tab: a native OAuth sheet, a Finder prompt, a system permission dialog, a Slack desktop notification. On a Mac you can drive all of it through one accessibility tree, and the binary that does it ships inside Fazm.

Matthew Diakonov

Published April 22, 20269 min read

4.8from Shipped with every Fazm release since 1.4

Six-tool Swift MCP binary, signed and notarized

Same interface for Chrome, Finder, Slack, system dialogs

Works on your real Mac, not a CI grid

One tree, not one tab

Browser plus everything around it

The browser is one AXUIElement tree.

Finder, Slack, Mail, system dialogs are more trees.

Fazm walks all of them with the same six tools.

A click that crosses apps is a normal step.

0:00 / 0:05

What every other guide on this skips

If you read the ten most-linked articles on this topic, they all converge on the same answer: pick a browser driver, write a spec, run it in CI. Playwright versus Selenium versus Cypress versus Puppeteer versus a couple of no-code wrappers around the same engines. That answer is fine for one kind of flow, a flow that lives entirely inside a single tab. For any real product, a sizeable fraction of the flow lives outside the tab: OAuth sheets that open as native windows, Finder pickers, permission dialogs, desktop notifications, clipboard handoffs to the Mail or Numbers app, 3DS challenges that pop a separate window, system menu bar actions. Browser drivers go blind the moment control crosses that boundary. Teams that care about those flows usually bolt on a second stack: AppleScript, pyautogui, image-match tools, sometimes a third for assertions. Three automation layers, one test. That is the gap. The rest of this page shows what it looks like when you replace the stack with one abstraction that treats the browser as a peer of every other app on your Mac.

Where control goes when a sign-in button is clicked

// Playwright / Selenium / Cypress world
// Lives inside a single browser tab.
await page.goto("https://example.com/login")
await page.fill("input[name=email]", "matt@fazm.ai")
await page.click("button:has-text('Sign in with Google')")

// A native macOS window now has focus.
// page.click() has nothing to click. The driver is blind.
// The test either times out, or you bolt on a second
// automation stack (AppleScript, pyautogui, image match)
// that doesn't share cookies, refs, or a test runner.
await page.waitForURL("**/dashboard", { timeout: 30000 })

-33% cross-process steps in one surface

The six tools that make up the entire surface

The bundled Swift binary, mcp-server-macos-use, exposes six tools and nothing else. Every step in a flow is a call to one of these six, regardless of whether the target is Chrome, Finder, a macOS dialog, or a Slack window. The entire surface fits on one screen.

open_application_and_traverse

Opens or activates an app by name, bundle ID, or path, then walks its accessibility tree. Returns a file path to the full tree plus a sample of visible elements with coordinates.

click_and_traverse

Click by (x, y) or by element text. Auto-centers at (x+w/2, y+h/2). Supports chained text and pressKey params, so click then type then Return is one call.

type_and_traverse

Types into the app specified by PID. Also accepts an optional pressKey to fire after the text, which covers the typical search-box or chat-box flow in one round trip.

press_key_and_traverse

Fires Return, Escape, Tab, arrow keys, or any character with modifier flags. Valid flags: CapsLock, Shift, Control, Option, Command, Function, NumericPad, Help.

scroll_and_traverse

Scroll wheel event at coordinates inside the target app. Re-traverses after so the diff shows what newly entered the viewport.

refresh_traversal

Pure read. No input. Emits the current accessibility tree for a PID. Useful as a checkpoint or to diff against a later state without changing the screen.

Inputs, binary, outputs

The anchor fact, in eleven lines of Swift

The part that makes a cross-app flow actually work is the role filter plus the click-centering rule. Here they are, verbatim from the binary. Notice there are only 11 interactive role prefixes. That is the whole set the binary promotes into the compact visible_elements summary. Anything else stays in the full tree file and is reachable by grep, but the model-facing summary stays tight enough that a single prompt can scan it.

mcp-server-macos-use/Sources/MCPServer/main.swift

0MCP tools exposed

0interactive AX role prefixes filtered

0lines of Swift in the bundled MCP server

0modifier flags supported on press_key

0 pixels passed

“The tool auto-centers the click at (x+w/2, y+h/2).”

main.swift, tool description, line ~1432

A real flow, start to finish

Here is what a Gmail login looks like when the test crosses three processes (Chrome, a native OAuth window, back to Chrome). Every step is one of the six tools. No second automation stack, no image matching, no pixel coordinates in the call site.

Six tool calls, three processes

App launches, Accessibility permission is asked once

Fazm checks permission with a three-layer probe that works around the stale TCC cache on macOS 26 Tahoe. If the live probe fails, the bundled binary cannot walk any tree and every downstream step is a no-op.

The bundled Swift MCP binary starts

On build, build.sh copies mcp-server-macos-use into $APP_BUNDLE/Contents/MacOS/mcp-server-macos-use. Codemagic signs and notarizes it. At runtime, Fazm spawns it over stdio using the config declared in .mcp.json.

Chrome is opened and traversed in one call

open_application_and_traverse with identifier 'Google Chrome' brings it to the front and writes a .txt file of every AXRole element on the focused window. Visible controls arrive with x:N y:N w:W h:H so the next step never guesses pixels.

A click crosses from Chrome into a native OAuth window

click_and_traverse with element='Sign in with Google' centers at (x+w/2, y+h/2) and fires. The response reports an app_switch event when the frontmost PID changes, and re-traverses the new window automatically.

The native sheet is typed into like any other element

type_and_traverse targets the new PID, types the email, presses Return, and re-reads the tree. No protocol switch, no second automation framework, no image matching. Same tool surface, different app.

Control returns to Chrome, refs stay stable

Once OAuth redirects back, Chrome becomes frontmost again. A final click_and_traverse against the returned Chrome PID completes the flow. Test crossed three processes on one accessibility tree abstraction.

fazm automate (verbose)

What actually ships inside the app

The binary is not a separate install. It is copied into the app bundle at build time and signed as part of notarization, so the consumer never sees a second install prompt and never has to wire up a server. Here is the literal copy command from Fazm's build script, plus the codesign line that signs it alongside the app.

fazm/build.sh + codemagic.yaml

The shape of the call, not the shape of the target

The key design decision is that the tool interface does not change when the target app changes. Chrome uses the same click_and_traverse that Finder uses. A macOS system dialog uses the same type_and_traverse that a chat box in Slack uses. There is no driver per app, no grid, no protocol selection. The browser is a citizen of the accessibility tree, not the thing the tree was designed for.

What fits where

A traditional browser driver is still the right pick for single-tab spec flows on CI. Fazm is the right pick for flows that cross process boundaries or need your real logins.

Feature	Browser driver (Playwright / Selenium)	Fazm (accessibility tree on Mac)
Scope of what can be clicked	Elements inside one browser tab	Every AXUIElement on your Mac
Cross-process flows	Needs a second automation stack for native windows	Same six tools work on Chrome, Finder, Slack, system dialogs
How a target is named	CSS selector or XPath	AXRole + text, or partial case-insensitive text match
Click math	x, y in page or viewport coordinates	auto-centered at (x+w/2, y+h/2) from the element frame
Native OAuth sign-in windows	Opaque, timeout territory	Reported as app_switch, re-traversed automatically
Who is the user	QA engineer writing a spec file	Anyone with a Mac who wants a repeatable flow
Where it runs	CI node, headless Chromium	Your Mac, against your real apps with your real logins

0 scenarios that break a browser driver and pass this one

log into Gmail with Google OAuth

approve a Finder file-picker sheet

verify a desktop Slack notification

walk through a Stripe 3DS challenge

download a CSV and open it in Numbers

trigger a macOS system dialog and dismiss it

switch between Chrome profiles in the picker

paste a 2FA code from the Mail app

When a browser driver is still the better answer

If your flow lives entirely inside one tab, you write spec files, your tests run in CI on Linux, and you have no appetite for an extra desktop dependency, a browser driver is still the right tool. Playwright in particular has excellent trace viewers, per-test isolation, and a strong story for parallelism. Fazm is not trying to replace it on that axis. What Fazm replaces is the moment you realize your flow leaves the tab, or you need to drive it on a real Mac with real logins (because a CI node does not have your 1Password or your Google account), or the user running the flow is not an engineer at all. Those are the cases where one accessibility tree wins over one tab.

See a browser-plus-native flow run on your own Mac

Fifteen minutes, your laptop, a real flow. We will walk the same six tools against whatever app stack you live in.

Book a call →

Frequently asked questions

What makes this different from Playwright, Selenium, Puppeteer, or Cypress?

Those four are browser drivers. They speak WebDriver or CDP, and their world ends at the edges of a single browser tab. Any flow that leaves the tab (a native OAuth window, a Finder open-file sheet, a system permission prompt, a desktop Slack notification) is out of reach. Fazm drives the browser using the exact same tool surface it uses for every other app on the Mac: the macOS Accessibility APIs. The bundled mcp-server-macos-use binary exposes six tools (open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, refresh_traversal) and each one works on Chrome, Safari, Finder, Slack, Mail, and a system dialog with no change in interface.

How does it target a button without pixel coordinates?

click_and_traverse accepts either (x, y) or a text string. When you pass text, the binary does a case-insensitive partial match against the visible elements in the current accessibility tree, picks the first hit, reads its frame, and centers the click at (x+w/2, y+h/2). That math is documented in the tool's own description string in Sources/MCPServer/main.swift around line 1432. The practical consequence: you can write 'click Sign in with Google' and the binary resolves it against the actual on-screen frame, even after a layout shift.

Which accessibility roles does it consider 'interactive'?

Exactly 11 prefixes: AXButton, AXLink, AXTextField, AXTextArea, AXCheckBox, AXRadioButton, AXPopUpButton, AXComboBox, AXSlider, AXMenuItem, AXMenuButton, AXTab. The list is a private constant (interactiveRolePrefixes) in main.swift at line 916. When the tool summarizes a visible_elements section, it caps interactive hits at 30 per call and static text at 10, which keeps the payload tight enough that a model can still pick one. Non-interactive elements are filtered out of the compact summary but kept in the full .txt traversal file for grep.

What actually ships inside the app?

Two native binaries, bundled at build time. build.sh lines 131-139 copy ~/mcp-server-macos-use/.build/release/mcp-server-macos-use into $APP_BUNDLE/Contents/MacOS/mcp-server-macos-use. codemagic.yaml lines 200-230 build it as a universal (arm64 + x86_64) binary, cache it by version, and line 534 signs it alongside the app. The source is 1,917 lines of Swift in Sources/MCPServer/main.swift. A smaller ScreenshotHelper binary is built the same way for the occasional case where a visual snapshot beats an AX dump.

How is a cross-app flow reported back to the caller?

Every tool response includes an optional app_switch block. If the frontmost app changes during a click, the tool records the new PID, the new app name, and the full traversal of the new window. You see that in the serializer at main.swift line 872 onward: lines.append('app_switch: ...'); lines.append('app_switch_elements: X total, Y visible'); and an app_switch_visible_elements section. So when you click 'Sign in with Google' in Chrome and a native sheet takes focus, the response tells you exactly what to target next without a second tool call to figure out where you are.

What modifier keys are supported on press_key_and_traverse?

Eight flags, all documented in the tool's description: CapsLock (or Caps), Shift, Control (or Ctrl), Option (or Opt or Alt), Command (or Cmd), Function (or Fn), NumericPad (or Numpad), and Help. They combine. Command plus Shift plus 4 still fires a screenshot hotkey. The Swift side maps the strings to CGEventFlags, so anything CoreGraphics accepts as a modifier is reachable.

Does this really work for typical browser-era test cases, like filling a form and asserting a toast?

Yes, with one caveat. The binary returns a .txt file of the accessibility tree plus a compact visible_elements summary. You assert by grepping the tree: text that appears in a toast (for example, 'Saved') will show up as an AXStaticText element with a frame and an in_viewport flag. You do not need to wait for a selector; the tree is a single snapshot. The caveat: if a page renders the toast inside a canvas instead of text, AX will not see it. That is the one case where the bundled ScreenshotHelper is useful: dump a PNG of the focused window and let a vision model read it. Fazm keeps both options on the same tool surface.

Who should actually use this instead of a traditional framework?

People whose flows already cross process boundaries. If your check is 'log into Gmail with Google OAuth and confirm the verification email landed', you need browser plus native plus system notifications. If your team lives inside a spec runner on CI and never leaves the tab, a browser driver is still the right tool. Fazm is a consumer app, not a pytest plugin. You install it on your Mac, you say what you want in English, and the six-tool surface executes the flow. No spec file, no grid, no Docker. The fit is someone who thinks of browser automation as a means to an end, not a thing they maintain.

What happens when Accessibility permission is not granted?

Every tool call fails upstream because AXUIElementCopyAttributeValue returns apiDisabled or cannotComplete. Fazm runs a three-layer probe on startup that checks the live TCC database through a CGEvent tap, because on macOS 26 Tahoe AXIsProcessTrusted sometimes caches a stale 'granted' answer inside the process. If the probe fails, the app surfaces a specific onboarding screen instead of silently accepting empty traversals. Details on that probe are in AppState.swift lines 431-504, and we walk through it in the April 2026 news piece on this site.

Is this open source? Can I read the binary's code?

The MCP server is open at github.com/mediar-ai/mcp-server-macos-use. The full tool surface, the role filter, the click math, and the app_switch reporting all live in Sources/MCPServer/main.swift (currently 1,917 lines). The consumer app around it is closed; the automation core that actually drives the Mac is not.

Adjacent writing on how Fazm handles the screen