A browser automation test that lives in the aria role tree, not in a screenshot diff
Most guides on this topic walk you through Selenium or Playwright scripts and the flakiness rituals that follow. This one does not. Fazm is a Mac app, not a library, and its approach to testing a browser automation has two unusual properties: the runtime self-tests before it runs anything, and the agent sees the page as a YAML aria tree with structural refs, not as pixels. Both choices are load-bearing, and both show up in the source.
THE PREMISE
What is a browser automation test actually trying to prove?
Three things, in order of how much they matter. First, that the automation still runs at all on this machine, against this build of the browser, with these permissions. Second, that it produces the same result twice in a row. Third, that it fails loudly and precisely when the target page changes. Most advice online collapses all three into "write a Playwright script and watch it", which answers the first, punts on the second, and gets graded by you on the third.
Fazm is designed from the opposite end. The first layer is a runtime self-test. The second layer is a structural assertion surface (an aria snapshot). The third layer is the agent itself, which logs every ref it clicked against in a file you can read after the session.
LAYER ONE
The three-stage Accessibility self-test
Before a single click reaches a page, Fazm proves that the Accessibility API is actually alive. The check runs on every boot and on a recurring timer. It is not a read of the permission toggle in System Settings; it is a real API round trip. On macOS 26 the two answers can diverge, and the app trusts the round trip.
The interesting bit is the .cannotComplete branch. That error is ambiguous by design; it can mean the permission is broken, or it can mean the frontmost app just does not implement AX (PyMOL, some Qt apps, fullscreen games). Fazm handles the ambiguity with two fallbacks, in this order.
Stage 1. Real AX call against the frontmost app
AXUIElementCreateApplication on the PID of whichever app is on top, then read kAXFocusedWindowAttribute back. .success, .noValue, .notImplemented, and .attributeUnsupported all count as 'AX is working'. .apiDisabled is a hard negative and short-circuits out.
Stage 2. Cross-check against Finder
On .cannotComplete, run the exact same two calls against Finder (a known AX-compliant reference app). If Finder also fails, the permission is truly broken. If Finder succeeds, the original failure was app-specific incompatibility, not a permission regression. Lines 468 to 485 of AppState.swift.
Stage 3. CGEvent tap probe against the live TCC database
If Finder is not running, fall through to CGEvent.tapCreate, which checks the live TCC (privacy) database directly and bypasses the per-process cache that can go stale on macOS 26 Tahoe. Lines 490 to 504. This is the tie-breaker when neither of the first two stages gave a clean answer.
WHY BOTHER
Most guides skip this step. It is the whole test.
Other playbooks treat the runtime as a given. "Did Playwright install? Does your Chromedriver match Chrome? Do you have permission to control Safari?" These questions get four sentences in a setup guide, and then we move on to assertions. In practice, most of the time spent debugging a flaky browser automation is time spent discovering that one of those preconditions silently failed. Fazm moves all of that into a self-test that runs every boot, every 5 seconds on retry, and escalates to a visible alert if it still fails after three tries.
Failure modes the self-test catches before the agent sees them
- Accessibility permission toggle flipped on but the AX API cache is stale (macOS 26)
- Permission revoked silently after a macOS minor update
- Frontmost app does not implement AX at all, ruling that out rather than hanging
- TCC database says no but the per-process cache still says yes
- Event tap creation fails because Accessibility is genuinely not trusted
LAYER TWO
The page is a YAML aria tree on disk, not a screenshot in context
Once the runtime is proven, the agent needs a view of the page. Fazm deliberately gives it a structural one. The Playwright MCP server is launched with three flags that are the technical core of how this product tests: an output directory, a mode that writes snapshots as files, and an explicit instruction to omit base64 screenshots from the tool output.
When the agent asks Playwright for a page snapshot, the result is a YAML file written to /tmp/playwright-mcp with the full aria role tree of the page. Every element gets a numbered ref like ref=e14. The agent tells Playwright "click ref=e14" and Playwright maps that back to the DOM element. No screenshot, no OCR, no vision step.
WHAT CHANGES BETWEEN RUNS
Pixel click vs. ref click, on the same page, on the same machine
Re-running the same check twice
A vision model screenshots the page, labels 'Submit', and clicks at (420, 710). Second run: a banner shifts layout, 'Submit' moves to (418, 702), the model clicks at the old coordinates, and lands on the checkbox below it. Third run: a modal opens first, the screenshot is of the modal, the model clicks 'X' but calls it 'Submit', and the test 'passes' against the wrong element. None of these failures are detected without human eyeballing.
- Click coordinates drift with layout jitter
- Screenshot label disagrees with DOM reality
- Modal overlays confuse the target
- No stable identifier across runs
THE NUMBERS
Counts you can verify in the repo right now
self-test stages at boot
AX → Finder → EventTap
second retry interval
accessibilityRetryInterval
screenshots in model context
--image-responses omit
overlay z-index
fazm-overlay
WHAT A RUN LOOKS LIKE
The log from one end-to-end run
The structure below is what you actually see when you follow tail -f /tmp/fazm-dev.log during a single check. The self-test runs first, Playwright writes the aria snapshot to disk, and the agent picks a ref. At no point is a base64 image in the record.
TWO DIFFERENT JOBS
Where this fits next to CI regression suites
Fazm's approach is not a drop-in replacement for Playwright-in-CI. It is a different shape of check for a different moment. The table below is the way to think about which to reach for.
| Feature | Playwright / Selenium in CI | Fazm (runtime dry-run on your Mac) |
|---|---|---|
| Who runs it | CI runner, scheduled or on pull request | You, on your own Mac, when you want to check a flow |
| What you author | Code files, selectors, assertions, fixtures | An English sentence |
| Assertion surface | DOM selector or toMatchScreenshot | Aria role + name (ref=eN) from the YAML snapshot |
| Runtime health check | Implicit: did the job succeed? | Explicit: three-stage AX self-test at boot and every 5 seconds |
| Session state | Fresh browser, fixtures for login | Your real signed-in Chrome, via the Chrome Web Store extension |
| Artifact you can replay | Video, trace.zip, screenshots | YAML aria snapshots in /tmp/playwright-mcp |
| Best use | Frozen regression suite that blocks merges | Verifying a workflow end-to-end on the real environment, once |
WHY THE FOUR PIECES MATTER TOGETHER
The four layers that make a run reproducible
Runtime self-test
Accessibility is proven before any tool runs. See testAccessibilityPermission at Desktop/Sources/AppState.swift line 433, plus the Finder cross-check and the CGEvent.tapCreate tie-breaker.
Structural snapshot
Playwright writes aria YAML to /tmp/playwright-mcp. --image-responses omit drops base64 screenshots from context so the model never sees pixels.
Ref-based clicks
Every action addresses a DOM element by role + name through a ref=eN indirection. The same sentence produces the same ref on the same page.
Visible overlay
Four gradient wings plus a centered pill at z-index 2147483647. pointer-events:none so it does not intercept clicks. Injected on every page load so navigations cannot hide it.
Cross-app continuity
If the flow leaves the tab (OS dialog, Finder, native app), the same agent keeps going through macos-use against the real AX tree. The test surface is the system, not a single tab.
“The common playbooks on testing a browser automation focus on writing tests. Fazm focuses on proving the runtime first, then asserting against an aria snapshot on disk. Two different problems, rarely addressed together.”
Fazm repo: AppState.swift:433, acp-bridge/src/index.ts:1033
HOW TO TRY ONE
A four-step dry-run you can do today
1. Install the Mac app
Signed, notarized .dmg from fazm.ai. Drag to /Applications. No Homebrew, no npm, no pip. At first launch the app runs the three-stage self-test and tells you immediately if Accessibility is live.
2. Grant Accessibility and Screen Recording
Two toggles in System Settings. The onboarding flow verifies them via real AX calls, not the cached permission state, so you will know within seconds if something did not stick.
3. Connect the real-Chrome extension (optional)
Install Playwright MCP Bridge from the Chrome Web Store, paste the base64url token into Fazm once. After that, every browser check runs against the Chrome you are already signed into, not a throwaway Chromium.
4. Type the check in English
Describe the flow you want to verify. Watch the overlay appear in Chrome, the agent click through via refs, and the aria snapshot land in /tmp/playwright-mcp. Re-run the same sentence to see how ref-based reproducibility behaves on your actual site.
THE SELF-HEAL LOOP
What happens when the self-test fails mid-session
The retry logic is short and worth knowing. Fazm does not assume the permission is stable; it assumes the opposite and plans the recovery path.
The alert that fires after three failed retries is titled "Accessibility Permission Needs Restart" and offers "Quit & Reopen" as the primary button. If the user accepts, relaunchApp() (line 422) spawns a delayed open on the bundle path and terminates. This is specifically designed for the macOS 26 cache-stale case, where only a process restart clears the bad state.
SANITY CHECK
How to audit a finished run
After the agent is done, the artifacts are plain files you can open. No dashboard, no cloud uploader, no replay service to log into.
Want to see a real dry-run of one of your flows?
20 minutes. You describe the check, we run it against your real Chrome on the call and read the aria snapshots together.
Book a call →Frequently asked questions
What does Fazm actually test when it tests a browser automation?
Two distinct things at two moments. At app launch, it runs a live Accessibility API probe (testAccessibilityPermission in Desktop/Sources/AppState.swift at line 433) that calls AXUIElementCreateApplication on whatever app is frontmost and reads kAXFocusedWindowAttribute back. That is a health check on the runtime itself. Then at agent turn, every proposed click is resolved against a YAML aria snapshot that Playwright writes to /tmp/playwright-mcp; the agent picks moves by ref=eN, not pixel coordinates. So the test is structural, not visual, and the runtime is verified before it is used.
Why a three-stage Accessibility self-test instead of just trusting the permission toggle?
On macOS 26 (Tahoe) the per-process AX cache is known to go stale. AXIsProcessTrusted() can return true while real AX calls fail silently. Fazm handles that by (1) running a real round-trip call on the frontmost app, (2) on .cannotComplete, cross-checking against Finder (a known AX-compliant reference app), which distinguishes an app-specific AX incompatibility from a genuine permission break, and (3) as a tie-breaker, probing the live TCC database via CGEvent.tapCreate, which bypasses the stale per-process cache. You can read the exact switch statement in Desktop/Sources/AppState.swift at lines 444 to 462, the Finder fallback at lines 468 to 485, and the event-tap probe at lines 490 to 504.
Why does the Playwright leg deliberately drop screenshots?
Because a test that compares pixels is, by definition, flaky. In acp-bridge/src/index.ts on line 1033, Playwright MCP is launched with the flags --output-mode file --image-responses omit --output-dir /tmp/playwright-mcp. That puts the page's aria snapshot on disk as a YAML file with refs (ref=e1, ref=e2, and so on) and strips inline base64 screenshots from the model context. The agent therefore has to pick its next click against a stable structural tree, not an image. Re-runs on the same page version pick the same refs; scrolling, theme changes, and layout jitter do not change the aria role tree, so they do not change the test outcome.
How is this different from a Selenium or Playwright test script?
A Selenium or Playwright script is code that you write, run, and then babysit through flaky selector breakages. Fazm is a signed Mac app you double-click. You describe the check in English, the agent picks tools (Playwright is one of five MCP servers it can call), and every proposed action is resolved against the aria tree. No CSS or XPath selectors to maintain, no test runner, no CI wiring. The upside is the cycle time of typing the check; the tradeoff is that this is a runtime dry-run, not a frozen regression test in a CI pipeline.
Does Fazm replace Playwright or Selenium for production regression testing?
No, and it is not trying to. Production regression suites want frozen code, deterministic execution on CI runners, and tight assertions. Fazm is a consumer Mac app aimed at someone who wants to verify a workflow on their own machine, against their own signed-in sessions, today. If you are shipping a SaaS and need 500 tests on every pull request, you still want Playwright in CI. If you want to check whether an AI agent can actually complete your weekly reconciliation on your real browser, with your real cookies, you want Fazm.
What happens if the Accessibility self-test fails mid-session?
Fazm starts a retry timer that re-runs the check every 5 seconds, up to 3 attempts (see startAccessibilityRetryTimer at Desktop/Sources/AppState.swift line 375). If all retries fail, it opens an alert titled 'Accessibility Permission Needs Restart' with two buttons, 'Quit & Reopen' and 'Later'. If the user taps the first, the app relaunches itself via a delayed open (relaunchApp at line 422). The design assumption is that a stale Accessibility cache is a real and common failure mode on recent macOS, not a theoretical one, so the self-test runs continuously rather than once at boot.
What does 'click by ref, not pixel' buy me in practice?
Reproducibility. If you describe a check twice, the agent reads the aria tree twice and picks the same ref both times, because aria role plus accessible name is stable across scroll, theme, viewport size, and most minor UI tweaks. A pixel-based agent can click at (420, 710) on the first run and (418, 702) on the second, then (0, 0) on the third because the page has a banner that shifted layout. The aria tree does not care about the banner.
Does Fazm also test the overlay injection itself?
Yes, implicitly. The overlay is injected by a page 'load' / 'domcontentloaded' listener registered by scripts/patch-playwright-overlay.cjs against Playwright's extensionContextFactory.js. On every page navigation, the script looks for the element #fazm-overlay and re-injects if missing (browser-overlay-init.js line 18 short-circuits if the element already exists). So it is self-healing across iframes, single-page-app route changes, and tab swaps.
Can the agent call other tools during a browser check, or is it tab-locked?
Not tab-locked. Inside the bridge, the agent has five peer MCP servers available (fazm_tools, playwright, macos-use, whatsapp, google-workspace). If the check needs to pull a code from Gmail mid-run, it can; if it needs to rename a file in Finder, it can. Any step that leaves the tab still uses the accessibility tree, just via the macOS-native macos-use MCP instead of Playwright. The test surface is the system, not one tab.
Where exactly are the aria snapshots written, and can I audit them after a run?
They land in /tmp/playwright-mcp as YAML files. Each file has the full role tree of the page at the moment the agent took a snapshot, with ref=eN identifiers on every element. After a Fazm session, you can cd /tmp/playwright-mcp and read those files directly to see exactly which aria roles the agent was reasoning against. That is the closest you get to a deterministic replay log without instrumenting the page itself.
Does this work on my real Chrome profile or a fresh one?
Either, controlled by the PLAYWRIGHT_USE_EXTENSION environment variable. If it is set to true, the bridge appends --extension to the Playwright args (acp-bridge/src/index.ts lines 1029 to 1031), which hooks Playwright MCP into the Chrome Web Store extension called Playwright MCP Bridge and drives your already-running, already-signed-in Chrome. If not, Playwright launches its own fresh Chromium. Real-Chrome mode is the one that makes a test useful for flows behind a login.
How do I know the agent is actually running against my Chrome and not a sandbox?
The overlay. Every page the agent touches gets a full-viewport div with id fazm-overlay at z-index 2147483647, with four animated gradient wings and a centered pill that reads 'Browser controlled by Fazm · Feel free to switch tabs or use other apps'. The overlay uses pointer-events:none, so it is visible but never intercepts clicks. The CSS is in acp-bridge/browser-overlay-init.js, lines 16 to 68. If you see the pill, the agent has control; if you do not, it does not.
Adjacent reads
Browser automation tool: five peers, zero screenshot clicks
Why the browser is one of five MCP servers inside the Fazm app, and how the BUILTIN_MCP_NAMES set is wired.
AI browser automation: the overlay that says an agent is at the wheel
The gradient-wings overlay injected at z-index 2147483647, and why every page gets one.
Browser automation extension: the real-Chrome bridge
The Chrome Web Store extension, the token handshake, and why logged-in Chrome changes the shape of a test.