Automation of a web browser, done by reading the accessibility tree, not the pixels
The SERP for this keyword flattens three very different products into one line: developer frameworks, cloud headless runners, and AI browser replacements. None of them drive the Chrome you are already signed into, none of them click by accessibility ref, and none of them keep working once the task leaves the tab. This page is about the fourth thing: a consumer Mac agent that does all three, with a specific file on disk you can point at to verify.
THE CATEGORY THE SERP HIDES
Three things pretending to be one thing
When someone searches "automation web browser" they could mean any of three very different things. They get lumped together on the SERP, and picking the wrong one is most of why automation projects stall.
Developer framework in a headless Chromium
Playwright, Puppeteer, Selenium. You write code, it spawns a fresh Chromium, scripts run, tests assert. Powerful, but the browser has no cookies, no extensions, and a brand-new fingerprint. The first time your flow needs your actual login or hits a site that scores browsers by history, it fails. And you have to write and maintain the code.
Cloud headless runner behind an API
Browserless, Firecrawl, Cloudflare Browser Run, Bright Data. Same headless Chromium, but rented over HTTP. Good for scraping public pages at scale. Bad for anything that wants your real session, because the session is on your Mac and the browser is in a data center. The agent also never leaves the browser; it has no concept of your Mail app.
Agentic-browser replacement
Perplexity Comet, Chrome Auto Browse. A new browser, often AI-native, that runs tasks in its own sandbox. Nice for read-and-summarize. But it is not driving your browser — it is a browser. Your saved tabs, bookmarks, extensions, and logged-in services live in the one you already use.
Desktop agent that attaches to your real Chrome
The fourth, smaller category. A local AI agent plugs into the Chrome window you are already running via a Web Store extension, reads each page as an accessibility-tree YAML snapshot, and continues into native Mac apps through the same message. This is what Fazm ships.
THE ANCHOR
The agent clicks by ref, not by pixel
The single most copied design on the "automation web browser" SERP is "take a screenshot, ask a vision model what to click, compute pixel coordinates, send a mouse event." It is expensive in tokens, fragile across themes and zoom levels, and blind to any control the browser is not currently rendering. Fazm's agent does not do this. When it calls the playwright browser_snapshot tool, the Node bridge writes a YAML file under .playwright-mcp/ that names each interactive element with a stable integer ref. The agent then calls browser_click with that ref. No selector, no coordinate, no screenshot.
That YAML is what the model reads. "Open the Stripe summary" compiles to browser_click(ref: "e12"). When the page re-renders, the ref changes, the snapshot refreshes, the model picks the new one. A selector-based script would have died at the first redesign; this did not have a selector to break.
Screenshot model vs. snapshot model
Same task, two substrates. The difference is about a 0x cost gap in tokens for a typical long-form page, plus whether a field renames breaks the run.
What 'automating a web browser' costs, per step
Screenshot-first vision automation: model re-reads the page as pixels on every action, must localize every control visually, and hallucinates button labels under glare or compression.
- Vision tokens per step: 1k to 3k per screenshot
- Blind to off-screen controls and background windows
- A page redesign invalidates every learned pixel heuristic
- Requires a vision-capable model in the hot loop
- Cannot see app-private panels (modal behind another modal)
HOW THE TOOL CALL FLOWS
One message, four possible automation surfaces
One agent, four native targets
The hub is the ACP (Agent Control Protocol) bridge, a Node subprocess Fazm spawns at launch. It multiplexes MCP servers — four of them ship bundled, you can add your own under ~/.fazm/mcp-servers.json. From the agent's point of view, "click a button in Chrome" and "rename a file in Finder" are two tool calls on the same surface, with the same result contract, and it routes by task, not by app.
What happens when you ask Fazm to answer that email thread
WHAT IT LOOKS LIKE IN THE LOG
A real run, boiled down
When you watch the bridge stderr (Fazm writes it to /tmp/fazm-dev.log in dev builds), the same shape repeats for every web task: list tabs, snapshot, click by ref, type, snapshot again, then hand off.
What you actually get when you pick each option
Playwright / Puppeteer / Selenium
Write code against a fresh Chromium. No session, no extensions, new fingerprint on every run. You maintain the selectors, the waits, the CI. It works; you are now a test engineer.
Browserless / Firecrawl / Browser Run
Rent a headless Chromium over HTTP. Great for scraping public pages at scale. Useless for 'in my actual Gmail,' and it has no view of your Mac outside the tab.
Perplexity Comet / Chrome Auto Browse
A new AI-native browser that runs tasks in its own sandbox. Strong at read-and-summarize. Not driving your existing Chrome, not touching your other apps.
Fazm
Local Mac app. Attaches to your real Chrome via the Playwright MCP Bridge extension, clicks by ref against YAML accessibility snapshots, and continues into Mail, Finder, Calendar, and WhatsApp through MCP servers registered in the same bridge.
Bardeen / Automa / UI.Vision
Record-and-replay extensions. You click through a flow once, they save DOM selectors, replay later. Fine for five-step repetitive tasks on stable pages. First redesign wins.
Snapshot-first local agent vs. HTTP-API headless browser
The two categories closest to colliding on this keyword. One rents a clean browser, the other drives your actual one.
| Feature | Cloud headless runner | Fazm |
|---|---|---|
| Runs inside your signed-in Chrome | No, clean headless container | Yes, via Web Store extension |
| Clicks by accessibility ref | No, CSS selectors / coordinates | Yes, browser_click ref=eN |
| Reuses your open tabs | No, spawns a new page per run | Yes, browser_tabs list/select first |
| Sees the desktop outside the browser | No, viewport only | Yes, capture_screenshot + macos-use |
| Requires a vision-capable model | Often yes (screenshot + OCR) | No, YAML snapshot is text |
| Can act after the task leaves the tab | No, stops at the browser | Yes, native apps via MCP |
| Setup time | API key + SDK + container | Install app, paste one token |
| Where the browser runs | Their data center | Your Mac, your Chrome binary |
Rough token ratio between a YAML accessibility snapshot of the active page region and a 1024px-wide screenshot of the same view.
MCP servers the same bridge routes to: playwright, macos-use, whatsapp, and google-workspace. One agent, four surfaces.
Minimum validated token length in BrowserExtensionSetup.swift line 578. The auth handshake is a loopback base64url string, stored locally under UserDefaults.playwrightExtensionToken.
“Once the model is clicking by accessibility ref, a UI redesign stops being a fire drill. The tree regenerates, the refs change, the next snapshot carries the new ones, and the job runs.”
Fazm runtime, Desktop/Sources/Chat/ChatPrompts.swift line 61
WHAT THE SAME AGENT CAN REACH
The web browser is the first surface, not the only one
agent
Tab hygiene: the one rule that makes shared-browser automation usable
The failure mode of every naive "agent in your browser" is the tab explosion. You ask one question, it opens fifteen tabs, leaves them open, and by the end of the day you have a browser with 300 tabs and no idea which ones you actually opened. Fazm's system prompt has an explicit rule against this. It lives at ChatPrompts.swift line 63:
The browser_tabs rule, paraphrased from Desktop/Sources/Chat/ChatPrompts.swift
- Before navigating anywhere, call browser_tabs action 'list' and scan for a domain match.
- If the domain is already open, switch via browser_tabs action 'select' instead of opening a new tab.
- If the current tab matches, reuse it rather than opening a new one at all.
- After a task is done, close any tabs the agent itself opened via browser_tabs action 'close'.
- Only open multiple tabs if the user explicitly asked for them.
How you actually wire it up
Four phases, two buttons you have to click yourself. The rest is automatic: a 2-second polling timer watches for Chrome to appear at /Applications/Google Chrome.app, and a separate timer watches the Chrome profile directory for the extension install.
Fazm browser setup, in four scenes
Welcome
Want to see the ref=eN flow on your own Mac?
15-minute call. We share screens, you pick a web task you do every week, I drive it live with Fazm and you watch the snapshot YAML stream past.
Book a call →Frequently asked questions
What does 'automation web browser' usually mean, and how is Fazm different?
It usually means one of three things. One, a dev framework that writes code against a headless Chromium (Playwright, Puppeteer, Selenium). Two, a cloud service that rents you a headless browser over an API (Browserless, Firecrawl, Cloudflare Browser Run). Three, an AI browser replacement (Perplexity Comet, Chrome Auto Browse) that runs in a sandbox separate from your Mac. Fazm is none of those. It is a consumer macOS app that attaches a local AI agent to the Chrome window you are already signed into, reads the page as a YAML accessibility snapshot rather than a screenshot, and the same agent continues into Mail, Finder, Calendar, and WhatsApp once the browser leg is done. The tool routing for all of this lives in Desktop/Sources/Chat/ChatPrompts.swift lines 56 to 63.
Snapshot versus screenshot: which does Fazm use, and why does it matter?
Snapshot, almost always. When Fazm's agent calls the playwright browser_snapshot tool, the Node bridge writes a YAML file under .playwright-mcp/ that contains the accessibility tree for the page: every role, label, value, and an integer reference like [ref=e17]. The agent then calls browser_click with ref: 'e17' — no CSS selector, no XY coordinate, no pixel-reading. Screenshots are a fallback reserved for 'when you need visual confirmation' because, in the words of the system prompt, 'it costs extra tokens.' The choice matters because screenshots lose the button's role, a page redesign breaks a selector but a ref computed against the live tree survives it, and accessibility trees see background content that a screenshot cannot crop into.
Is this a dev framework or a consumer app?
Consumer. You install a signed, notarized Mac app. The first time you run a flow that needs the browser, Fazm opens a 880x520 setup window that walks you through installing Google Chrome (if missing), adding the Playwright MCP Bridge extension from the Chrome Web Store (ID mmlmfjhmonkocbjadbfplnigmagldckm), clicking the puzzle piece icon to copy the token, and pasting it. Four numbered steps on the left, animated GIFs on the right. No CLI. No npm install. No YAML to hand-edit. If you want a Python SDK you want playwright-python; if you want 'ask the AI to do the thing in your own Chrome,' you want Fazm.
What does the agent actually see, and where is it stored?
After any navigate, click, or type action, the agent gets a compact YAML file on disk at .playwright-mcp/<timestamp>.yml. Each interactive element gets a bracketed ref — [ref=e3], [ref=e17] — that stays stable across the same page load. The agent uses those refs for browser_click, browser_type, browser_fill_form, browser_select_option. Only when a human would need to visually verify something (chart contents, a captcha picture, a layout bug) does it call browser_take_screenshot, and even then the system prompt tells it to prefer capture_screenshot (desktop-wide) over browser_take_screenshot (viewport only), because the agent's real work is on the whole Mac, not inside the browser tab.
Does it run against my real logged-in Chrome, or a fresh headless build?
Your real Chrome. The setup checks for /Applications/Google Chrome.app explicitly, and a 2-second polling timer waits until it appears if you have not installed it. The bridge then sets the env var PLAYWRIGHT_USE_EXTENSION=true and passes your auth token as PLAYWRIGHT_MCP_EXTENSION_TOKEN (wired at Desktop/Sources/Chat/ACPBridge.swift lines 368 to 377). From that point on, every playwright call runs inside the Chrome you have been using for months, with your cookies, your saved logins, your installed extensions, and the browser fingerprint that a site has already seen. There is a non-extension fallback that launches its own Chromium if you skip the setup, but you lose the session benefit.
How does the agent handle tabs? Does it open ten new ones per task?
No. The routing prompt has an explicit tab-hygiene rule (ChatPrompts.swift line 63): before navigating to a site, the agent calls browser_tabs action 'list' to check if you already have it open. If a match is found by domain, it switches via browser_tabs action 'select' instead of opening a new tab. If the current tab matches, it reuses that one. After finishing a task, tabs the agent opened are closed with browser_tabs action 'close'. The only time a new tab is opened is when the user explicitly asks for one. This is the difference between 'an agent sharing your browser' and 'an agent living in its own browser.'
When does web browser automation need to leave the browser?
Constantly, if the job is a real workflow. Paying a bill ends in Mail. Scheduling something ends in Calendar. Following up ends in WhatsApp or iMessage. Downloading a file ends in Finder. The problem with every cloud-headless 'browser automation' product is that the moment the workflow crosses that boundary, the automation stops and you take over manually. Fazm registers four automation targets in the same bridge: playwright (Chrome), macos-use (Finder, Settings, Mail, Notes, any AX-compliant app via the native mcp-server-macos-use binary), whatsapp (the native Catalyst app), and google-workspace (Gmail, Drive, Calendar APIs). One agent, one message, four surfaces.
What does the developer experience look like if I already know Playwright?
Familiar, but inverted. The Playwright API surface is there — browser_navigate, browser_snapshot, browser_click, browser_fill_form, browser_type, browser_take_screenshot, browser_tabs — but you never write against it yourself. You describe the result you want, the model picks which tools to call, reads each response, and composes the next step. There is no test file, no BeforeAll, no Page Object, no CI. If the page layout shifts in a way a scripted selector would not survive, the model re-reads the new snapshot and picks the new ref. If the job was 'reply to that thread,' it just runs.
Why not just use Perplexity Comet or Chrome Auto Browse?
They are real products, they are just solving a different problem. Perplexity Comet and Chrome Auto Browse are agentic browsers. They replace your browser with theirs, run tasks in their container, and show you a summary. Fazm does not replace anything: it drives the Chrome you already use, keeps your tabs, keeps your bookmarks, and also drives your native apps. If your job is read-and-summarize the web, a new AI browser is fine. If your job is 'do a multi-app thing on my Mac that happens to involve the web,' a fresh browser is the wrong primitive.
Is anything about this cloud-dependent?
The model is. Fazm's default is the Claude API, but the ACP bridge reads a user-configurable ANTHROPIC_BASE_URL (Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382) from the customApiEndpoint UserDefault. Point it at a local OpenAI-compatible shim in front of Ollama, LM Studio, or vLLM and the model call is local too. The automation layer — playwright, macos-use, whatsapp — is all local either way: the browser extension talks to the local Node bridge over a loopback token; the native mcp-server-macos-use binary ships inside Fazm.app/Contents/MacOS and never phones home.
Adjacent ideas we have written up from the same source tree.
Keep reading
Browser automation extension, the kind that attaches to real Chrome
Deeper on the Chrome Web Store bridge, the four phases of setup, and why 'extension' is the right primitive for keeping your session.
Local LLMs, April 2026: what the models shipped, what the Macs still need
Why a fast local model still loses to a smaller Claude run if you hand it a screenshot instead of an accessibility tree.
Anthropic, new model release, April 2026
What changed on the model side in the same week Fazm shipped v2.4.0, and why the MCP server list matters more than the benchmark.