SNAPSHOT-FIRST / REAL CHROME / NATIVE CROSSOVER

Automation of a web browser, done by reading the accessibility tree, not the pixels

The SERP for this keyword flattens three very different products into one line: developer frameworks, cloud headless runners, and AI browser replacements. None of them drive the Chrome you are already signed into, none of them click by accessibility ref, and none of them keep working once the task leaves the tab. This page is about the fourth thing: a consumer Mac agent that does all three, with a specific file on disk you can point at to verify.

Matthew Diakonov, Fazm

Published April 21, 20269 min read

4.9from Verified from the Fazm source tree

Reads the a11y tree, not screenshots

Drives your real signed-in Chrome

Clicks by ref=eN, not by selector

Reuses open tabs, closes what it opens

Continues into Mail, Finder, WhatsApp

Automation web browser, translated

Not a script. Not a headless container. An agent that reads the tree and keeps going.

Snapshot written to .playwright-mcp/*.yml

Click by ref=eN, not CSS selector

Your real Chrome, your real cookies

Screenshots only when a human would squint

Same agent continues into Mail and Finder

0:00 / 0:05

THE CATEGORY THE SERP HIDES

Three things pretending to be one thing

When someone searches "automation web browser" they could mean any of three very different things. They get lumped together on the SERP, and picking the wrong one is most of why automation projects stall.

Developer framework in a headless Chromium

Playwright, Puppeteer, Selenium. You write code, it spawns a fresh Chromium, scripts run, tests assert. Powerful, but the browser has no cookies, no extensions, and a brand-new fingerprint. The first time your flow needs your actual login or hits a site that scores browsers by history, it fails. And you have to write and maintain the code.

Cloud headless runner behind an API

Browserless, Firecrawl, Cloudflare Browser Run, Bright Data. Same headless Chromium, but rented over HTTP. Good for scraping public pages at scale. Bad for anything that wants your real session, because the session is on your Mac and the browser is in a data center. The agent also never leaves the browser; it has no concept of your Mail app.

Agentic-browser replacement

Perplexity Comet, Chrome Auto Browse. A new browser, often AI-native, that runs tasks in its own sandbox. Nice for read-and-summarize. But it is not driving your browser — it is a browser. Your saved tabs, bookmarks, extensions, and logged-in services live in the one you already use.

Desktop agent that attaches to your real Chrome

The fourth, smaller category. A local AI agent plugs into the Chrome window you are already running via a Web Store extension, reads each page as an accessibility-tree YAML snapshot, and continues into native Mac apps through the same message. This is what Fazm ships.

snapshot-firstref=e17real Chromereal cookies.playwright-mcp/*.ymlno selectorsno test filesno CIno fresh fingerprintcontinues into Mailcontinues into Findercontinues into WhatsApp

THE ANCHOR

The agent clicks by ref, not by pixel

The single most copied design on the "automation web browser" SERP is "take a screenshot, ask a vision model what to click, compute pixel coordinates, send a mouse event." It is expensive in tokens, fragile across themes and zoom levels, and blind to any control the browser is not currently rendering. Fazm's agent does not do this. When it calls the playwright browser_snapshot tool, the Node bridge writes a YAML file under .playwright-mcp/ that names each interactive element with a stable integer ref. The agent then calls browser_click with that ref. No selector, no coordinate, no screenshot.

.playwright-mcp/snapshot.yml

That YAML is what the model reads. "Open the Stripe summary" compiles to browser_click(ref: "e12"). When the page re-renders, the ref changes, the snapshot refreshes, the model picks the new one. A selector-based script would have died at the first redesign; this did not have a selector to break.

tool-call.ts

Screenshot model vs. snapshot model

Same task, two substrates. The difference is about a 0x cost gap in tokens for a typical long-form page, plus whether a field renames breaks the run.

What 'automating a web browser' costs, per step

Screenshot-first vision automation: model re-reads the page as pixels on every action, must localize every control visually, and hallucinates button labels under glare or compression.

Vision tokens per step: 1k to 3k per screenshot
Blind to off-screen controls and background windows
A page redesign invalidates every learned pixel heuristic
Requires a vision-capable model in the hot loop
Cannot see app-private panels (modal behind another modal)

HOW THE TOOL CALL FLOWS

One message, four possible automation surfaces

One agent, four native targets

The hub is the ACP (Agent Control Protocol) bridge, a Node subprocess Fazm spawns at launch. It multiplexes MCP servers — four of them ship bundled, you can add your own under ~/.fazm/mcp-servers.json. From the agent's point of view, "click a button in Chrome" and "rename a file in Finder" are two tool calls on the same surface, with the same result contract, and it routes by task, not by app.

What happens when you ask Fazm to answer that email thread

WHAT IT LOOKS LIKE IN THE LOG

A real run, boiled down

When you watch the bridge stderr (Fazm writes it to /tmp/fazm-dev.log in dev builds), the same shape repeats for every web task: list tabs, snapshot, click by ref, type, snapshot again, then hand off.

Task: reply to stripe thread, then save CSV to desktop

What you actually get when you pick each option

Playwright / Puppeteer / Selenium

Write code against a fresh Chromium. No session, no extensions, new fingerprint on every run. You maintain the selectors, the waits, the CI. It works; you are now a test engineer.

Browserless / Firecrawl / Browser Run

Rent a headless Chromium over HTTP. Great for scraping public pages at scale. Useless for 'in my actual Gmail,' and it has no view of your Mac outside the tab.

Perplexity Comet / Chrome Auto Browse

A new AI-native browser that runs tasks in its own sandbox. Strong at read-and-summarize. Not driving your existing Chrome, not touching your other apps.

Fazm

Local Mac app. Attaches to your real Chrome via the Playwright MCP Bridge extension, clicks by ref against YAML accessibility snapshots, and continues into Mail, Finder, Calendar, and WhatsApp through MCP servers registered in the same bridge.

Bardeen / Automa / UI.Vision

Record-and-replay extensions. You click through a flow once, they save DOM selectors, replay later. Fine for five-step repetitive tasks on stable pages. First redesign wins.

Snapshot-first local agent vs. HTTP-API headless browser

The two categories closest to colliding on this keyword. One rents a clean browser, the other drives your actual one.

Feature	Cloud headless runner	Fazm
Runs inside your signed-in Chrome	No, clean headless container	Yes, via Web Store extension
Clicks by accessibility ref	No, CSS selectors / coordinates	Yes, browser_click ref=eN
Reuses your open tabs	No, spawns a new page per run	Yes, browser_tabs list/select first
Sees the desktop outside the browser	No, viewport only	Yes, capture_screenshot + macos-use
Requires a vision-capable model	Often yes (screenshot + OCR)	No, YAML snapshot is text
Can act after the task leaves the tab	No, stops at the browser	Yes, native apps via MCP
Setup time	API key + SDK + container	Install app, paste one token
Where the browser runs	Their data center	Your Mac, your Chrome binary

cheaper per step

Rough token ratio between a YAML accessibility snapshot of the active page region and a 1024px-wide screenshot of the same view.

native targets

MCP servers the same bridge routes to: playwright, macos-use, whatsapp, and google-workspace. One agent, four surfaces.

token length

20+

Minimum validated token length in BrowserExtensionSetup.swift line 578. The auth handshake is a loopback base64url string, stored locally under UserDefaults.playwrightExtensionToken.

zero selectors

“Once the model is clicking by accessibility ref, a UI redesign stops being a fire drill. The tree regenerates, the refs change, the next snapshot carries the new ones, and the job runs.”

Fazm runtime, Desktop/Sources/Chat/ChatPrompts.swift line 61

WHAT THE SAME AGENT CAN REACH

The web browser is the first surface, not the only one

One Fazm
agent

ACP + MCP

Chrome

Mail.app

Finder

Calendar

Notes

Settings

Messages

Gmail

Drive

Tab hygiene: the one rule that makes shared-browser automation usable

The failure mode of every naive "agent in your browser" is the tab explosion. You ask one question, it opens fifteen tabs, leaves them open, and by the end of the day you have a browser with 300 tabs and no idea which ones you actually opened. Fazm's system prompt has an explicit rule against this. It lives at ChatPrompts.swift line 63:

The browser_tabs rule, paraphrased from Desktop/Sources/Chat/ChatPrompts.swift

Before navigating anywhere, call browser_tabs action 'list' and scan for a domain match.
If the domain is already open, switch via browser_tabs action 'select' instead of opening a new tab.
If the current tab matches, reuse it rather than opening a new one at all.
After a task is done, close any tabs the agent itself opened via browser_tabs action 'close'.
Only open multiple tabs if the user explicitly asked for them.

How you actually wire it up

Four phases, two buttons you have to click yourself. The rest is automatic: a 2-second polling timer watches for Chrome to appear at /Applications/Google Chrome.app, and a separate timer watches the Chrome profile directory for the extension install.

Fazm browser setup, in four scenes

01 / 04

Welcome

Short explanation of what browser access does. A single 'Set Up' primary button. Dismissible with no penalty: you can run Fazm without ever enabling the extension and it falls back to a headless Chromium for web tasks.

Want to see the ref=eN flow on your own Mac?

15-minute call. We share screens, you pick a web task you do every week, I drive it live with Fazm and you watch the snapshot YAML stream past.

Book a call →

Frequently asked questions

What does 'automation web browser' usually mean, and how is Fazm different?

It usually means one of three things. One, a dev framework that writes code against a headless Chromium (Playwright, Puppeteer, Selenium). Two, a cloud service that rents you a headless browser over an API (Browserless, Firecrawl, Cloudflare Browser Run). Three, an AI browser replacement (Perplexity Comet, Chrome Auto Browse) that runs in a sandbox separate from your Mac. Fazm is none of those. It is a consumer macOS app that attaches a local AI agent to the Chrome window you are already signed into, reads the page as a YAML accessibility snapshot rather than a screenshot, and the same agent continues into Mail, Finder, Calendar, and WhatsApp once the browser leg is done. The tool routing for all of this lives in Desktop/Sources/Chat/ChatPrompts.swift lines 56 to 63.

Snapshot versus screenshot: which does Fazm use, and why does it matter?

Snapshot, almost always. When Fazm's agent calls the playwright browser_snapshot tool, the Node bridge writes a YAML file under .playwright-mcp/ that contains the accessibility tree for the page: every role, label, value, and an integer reference like [ref=e17]. The agent then calls browser_click with ref: 'e17' — no CSS selector, no XY coordinate, no pixel-reading. Screenshots are a fallback reserved for 'when you need visual confirmation' because, in the words of the system prompt, 'it costs extra tokens.' The choice matters because screenshots lose the button's role, a page redesign breaks a selector but a ref computed against the live tree survives it, and accessibility trees see background content that a screenshot cannot crop into.

Is this a dev framework or a consumer app?

Consumer. You install a signed, notarized Mac app. The first time you run a flow that needs the browser, Fazm opens a 880x520 setup window that walks you through installing Google Chrome (if missing), adding the Playwright MCP Bridge extension from the Chrome Web Store (ID mmlmfjhmonkocbjadbfplnigmagldckm), clicking the puzzle piece icon to copy the token, and pasting it. Four numbered steps on the left, animated GIFs on the right. No CLI. No npm install. No YAML to hand-edit. If you want a Python SDK you want playwright-python; if you want 'ask the AI to do the thing in your own Chrome,' you want Fazm.

What does the agent actually see, and where is it stored?

After any navigate, click, or type action, the agent gets a compact YAML file on disk at .playwright-mcp/<timestamp>.yml. Each interactive element gets a bracketed ref — [ref=e3], [ref=e17] — that stays stable across the same page load. The agent uses those refs for browser_click, browser_type, browser_fill_form, browser_select_option. Only when a human would need to visually verify something (chart contents, a captcha picture, a layout bug) does it call browser_take_screenshot, and even then the system prompt tells it to prefer capture_screenshot (desktop-wide) over browser_take_screenshot (viewport only), because the agent's real work is on the whole Mac, not inside the browser tab.

Does it run against my real logged-in Chrome, or a fresh headless build?

Your real Chrome. The setup checks for /Applications/Google Chrome.app explicitly, and a 2-second polling timer waits until it appears if you have not installed it. The bridge then sets the env var PLAYWRIGHT_USE_EXTENSION=true and passes your auth token as PLAYWRIGHT_MCP_EXTENSION_TOKEN (wired at Desktop/Sources/Chat/ACPBridge.swift lines 368 to 377). From that point on, every playwright call runs inside the Chrome you have been using for months, with your cookies, your saved logins, your installed extensions, and the browser fingerprint that a site has already seen. There is a non-extension fallback that launches its own Chromium if you skip the setup, but you lose the session benefit.

How does the agent handle tabs? Does it open ten new ones per task?

No. The routing prompt has an explicit tab-hygiene rule (ChatPrompts.swift line 63): before navigating to a site, the agent calls browser_tabs action 'list' to check if you already have it open. If a match is found by domain, it switches via browser_tabs action 'select' instead of opening a new tab. If the current tab matches, it reuses that one. After finishing a task, tabs the agent opened are closed with browser_tabs action 'close'. The only time a new tab is opened is when the user explicitly asks for one. This is the difference between 'an agent sharing your browser' and 'an agent living in its own browser.'

When does web browser automation need to leave the browser?

Constantly, if the job is a real workflow. Paying a bill ends in Mail. Scheduling something ends in Calendar. Following up ends in WhatsApp or iMessage. Downloading a file ends in Finder. The problem with every cloud-headless 'browser automation' product is that the moment the workflow crosses that boundary, the automation stops and you take over manually. Fazm registers four automation targets in the same bridge: playwright (Chrome), macos-use (Finder, Settings, Mail, Notes, any AX-compliant app via the native mcp-server-macos-use binary), whatsapp (the native Catalyst app), and google-workspace (Gmail, Drive, Calendar APIs). One agent, one message, four surfaces.

What does the developer experience look like if I already know Playwright?

Familiar, but inverted. The Playwright API surface is there — browser_navigate, browser_snapshot, browser_click, browser_fill_form, browser_type, browser_take_screenshot, browser_tabs — but you never write against it yourself. You describe the result you want, the model picks which tools to call, reads each response, and composes the next step. There is no test file, no BeforeAll, no Page Object, no CI. If the page layout shifts in a way a scripted selector would not survive, the model re-reads the new snapshot and picks the new ref. If the job was 'reply to that thread,' it just runs.

Why not just use Perplexity Comet or Chrome Auto Browse?

They are real products, they are just solving a different problem. Perplexity Comet and Chrome Auto Browse are agentic browsers. They replace your browser with theirs, run tasks in their container, and show you a summary. Fazm does not replace anything: it drives the Chrome you already use, keeps your tabs, keeps your bookmarks, and also drives your native apps. If your job is read-and-summarize the web, a new AI browser is fine. If your job is 'do a multi-app thing on my Mac that happens to involve the web,' a fresh browser is the wrong primitive.

Is anything about this cloud-dependent?

The model is. Fazm's default is the Claude API, but the ACP bridge reads a user-configurable ANTHROPIC_BASE_URL (Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382) from the customApiEndpoint UserDefault. Point it at a local OpenAI-compatible shim in front of Ollama, LM Studio, or vLLM and the model call is local too. The automation layer — playwright, macos-use, whatsapp — is all local either way: the browser extension talks to the local Node bridge over a loopback token; the native mcp-server-macos-use binary ships inside Fazm.app/Contents/MacOS and never phones home.

Adjacent ideas we have written up from the same source tree.