The honest taxonomy of browser automation tools, including the category lists never name
Every guide on this subject lists the same twelve projects and calls it coverage. The projects are real, but the taxonomy is wrong. There are four architecturally different categories, and the line between the third (click by pixel) and the fourth (click by ref) is the split that actually predicts which tool breaks under production load. This page walks through all four, names tools in each, and shows where Fazm sits, with line numbers.
THE SHAPE OF THE MARKET
Four categories, not one
If you read five of the ranking guides on this subject back to back, you notice they all list the same projects and very few of them split the list along architectural lines. The honest split is not by license or popularity. It is by how the tool decides what to click, and that splits everything into exactly four groups.
1. Code frameworks
You import a library and write a script against a browser driver. Deterministic, fast, headless-friendly, excellent in CI. Bound to a tab. Examples: Playwright, Puppeteer, Selenium, Cypress, WebDriverIO, Nightwatch, TestCafe.
2. Record and replay
Cloud testing suites that record human sessions and let non-engineers author flows in a low-code UI. Great for QA teams, weaker when the underlying DOM shifts. Examples: TestRigor, Mabl, Katalon, Ghost Inspector, BrowserStack Low Code.
3. Vision AI agents
A multimodal model looks at a screenshot of the page, labels buttons, clicks by coordinates. Works without selectors. Breaks on re-renders and ambiguous controls. Examples: Browser-Use, Nanobrowser, Skyvern, Multi-On, Adept ACT-1.
4. Accessibility tree agents
The agent reads the structured accessibility tree of the current surface (browser tab, native app, dialog) and asks the language model to pick a ref. No pixel math. Can cross out of the browser into the OS. Fazm is the consumer flavor of this.
Almost no public list names category 4 as distinct from category 3, which is why tools that read the accessibility tree get lumped in with tools that read pixels. They are not the same thing, and they fail under different conditions.
THE ANCHOR FACT
One file, one line, five names, one Set
This is the thing that makes Fazm structurally different from everything else on this topic. In the bridge that wires the Fazm Mac app to the Claude Agent SDK, there is a single source of truth for which tools are built in. It is a JavaScript Set with five names in it. The browser is one of them. Not the first. Not the whole product.
Every action the agent takes is routed through one of these five peers. If the sentence is "email the PDF to Sam," the agent picks google-workspace. If it is "rename the folder in Finder," macos-use. If it is "fill the form on this site," playwright. Same agent, same conversation, different peer. A library cannot do this; a library starts from the assumption that the world ends at the URL bar.
THE FIVE PEERS, AS AN ORBIT
The agent at the center, five tools around it
The visual model is not a stack with the browser on top. It is a hub and spokes. The agent decides which peer to call for each step of your request.
Agent
The five names are exactly the five strings in the Set on line 1266. Read the source yourself; this page is just a mirror.
THE NAMED TOOLS IN EACH CATEGORY
Who lives in each of the four buckets
Most commentary on this subject is either unfiltered listicle or a thin review of one tool. Here is the compressed version. Pick the row that matches your task shape, not the row with the most GitHub stars.
CATEGORY 1 / CODE FRAMEWORKS
Playwright, Puppeteer, Selenium, Cypress, WebDriverIO, TestCafe, Nightwatch
You write code. The library drives a browser. Deterministic, fast, excellent for CI. Playwright is the modern favorite and what Fazm uses under the hood for its browser leg. Selenium has the longest track record. Cypress is the QA darling. All of them stop the moment your task leaves the tab. If you are an engineer and your scenario is "re-run this exact flow forever," this is your category. If you wanted a tool your parents could use, this is not it.
CATEGORY 2 / RECORD AND REPLAY
TestRigor, Mabl, Katalon, Ghost Inspector, BrowserStack Low Code
Cloud QA suites. Record a human session, replay it with some heuristics for handling minor DOM changes. Great for non-engineer test authors on stable apps. Weaker when the app ships a design refresh. Built for regression, not for new tasks. They are genuinely useful for the team that owns the app; they are not useful for the end user who wants "do this thing on the internet for me."
CATEGORY 3 / VISION AI AGENTS
Browser-Use, Nanobrowser, Skyvern, Multi-On, Adept ACT-1
A vision model looks at a screenshot, labels buttons, clicks by pixel coordinates. Works with zero selectors, which is powerful. The stability failure mode is mislabels: two similar controls, a popover that overlaps, a scroll that shifts the frame, any of them can make the click land on the wrong thing. These tools ship to production and work, but they burn tokens on every frame and the error budget is real. The right choice when the DOM is hostile or unavailable.
CATEGORY 4 / ACCESSIBILITY TREE AGENTS (where Fazm lives)
Fazm, plus a handful of research projects
The agent reads the accessibility tree of the current surface. On the web, that is the Playwright MCP’s YAML snapshot. On the desktop, it is the macOS AX API (AXUIElementCreateApplication, AXUIElementCopyAttributeValue). Every element has a role, a label, bounds, and a stable ref. The language model picks the ref. No pixel math, no OCR, no vision-step token cost. The tradeoff is it will not work on pure canvas or heavily custom-drawn UIs. The upside is the same primitive works equally well on Chrome, Finder, Mail, WhatsApp, and System Settings, which is how Fazm covers the whole task instead of just the tab.
PIXEL VS REF, AS A DIAGRAM
What the agent actually sees
The difference between a vision agent and an accessibility tree agent is not subtle. A vision agent receives an image. An accessibility tree agent receives a typed tree. The model is picking targets from one or the other, and that choice shapes the failure mode.
Category 3 input
A base64 screenshot
Every click lands because the model guessed pixel coordinates from an image.
Category 4 input
A YAML snapshot
Every click lands because the ref is a stable pointer into the tree.
HOW FAZM COMPARES, ROW BY ROW
| Feature | Typical browser automation tool | Fazm (category 4) |
|---|---|---|
| Install model | npm install / pip install / CLI | Signed Mac app, double-click to run |
| Scope of control | Single browser tab | Any AX-compliant Mac app, plus a browser tab |
| How the agent picks targets | CSS / XPath selectors or pixel coordinates | Structural refs (ref=e4) from the accessibility tree |
| Screenshots in agent context | Often included (vision agents) | Stripped at launch (--image-responses omit, line 1033) |
| User interface | Code editor or low-code UI | English sentence in a chat window |
| Cross-app tasks | You write the glue between tools | Agent routes across five MCP peers automatically |
| Where tools are declared | Varies per framework | One JavaScript Set on line 1266 of acp-bridge/src/index.ts |
| Works with your real Chrome profile | Sometimes (persistent context flags) | Yes, via the Chrome Web Store bridge extension |
WHAT IT LOOKS LIKE AT RUNTIME
One sentence, three peers
Here is a task that no browser-only tool can do end to end. It leaves Gmail, lands in a vendor portal, comes back to Finder, finishes in Numbers. The agent stays the same; the peer changes at each step.
Three of the five peers in one run. No glue code, no selectors, no pixel math.
WHY THE SCREENSHOT FLAG MATTERS
The three flags that keep the agent honest
When the Playwright MCP is launched, three specific flags turn it from a vision tool into a tree tool. The exact line is reproduced here. If you want to verify, it is in acp-bridge/src/index.ts at line 1033.
What each flag does
- --output-mode file tells Playwright MCP to write the page snapshot to a .yml file on disk rather than inlining it in the tool response.
- --image-responses omit drops any base64-encoded screenshots the browser tool would have returned, so they never enter the model context.
- --output-dir /tmp/playwright-mcp pins where the snapshots live so the agent can reference them by path if it wants to re-read.
The net effect is that the agent is reading a YAML tree, not pixels, every time it picks a click target. That is the single most important architectural difference between categories 3 and 4.
THE CATEGORY 4 NUMBERS
Five peers, two permissions, one app bundle
The pitch for category 4 compresses into a few numbers. These are direct counts from the source tree and the installer, not marketing.
HOW THE AX API IS ACTUALLY VERIFIED
The boot-time probe that makes accessibility trustworthy
Category 4 tools live and die on whether the macOS Accessibility API is responding correctly. Permission can be granted in System Settings but silently stale after a macOS upgrade or a developer-ID change. Fazm probes it at startup with a real AX call, not with the cached AXIsProcessTrusted() result that macOS usually returns.
If the round trip returns .cannotComplete, the app disambiguates by running the same probe against Finder, then falls back to a CGEvent.tapCreate check that reads the live TCC database. This redundancy is the reason the accessibility tree is a reliable primitive in practice and not just in theory.
HOW TO PICK
A rule of thumb for the four categories
There is no "best" browser automation tool in the abstract. There is the one that matches your task shape. The question you have to ask is: who is the user, and does the task stay inside the browser?
If you are an engineer on a team that owns the app → category 1.
Playwright or Puppeteer. You want CI determinism and your scope is bounded. Fazm uses Playwright under the hood for exactly this reason; it is the right library for the browser leg.
If you are a QA lead on a stable app → category 2.
TestRigor or Mabl. You want non-engineers to author and maintain tests. The cloud-suite overhead is worth it.
If the target page is canvas-heavy or has no usable DOM → category 3.
Browser-Use or Skyvern. Vision is the honest tool when pixels are the only signal. Budget for occasional mislabels.
If you are a human on a Mac and the task crosses apps → category 4.
Fazm. The right tool when the sentence involves Gmail and Chrome and Finder and Numbers, when you do not want to write code, and when you want the clicks to be stable under re-renders.
“Every guide on this subject lists the same twelve projects and calls it coverage. The architectural split between pixels and refs is the one that predicts which tool survives the next page update.”
The honest taxonomy, not the listicle
Want the cross-app run demoed live?
Fifteen minutes, live on your Mac. We will run a real multi-peer task so you can see the five MCP servers routing in real time.
Book a call →Frequently asked questions
What is the fourth category you keep referring to?
The fourth category is accessibility-tree desktop agents that happen to also drive a browser. Every standard list puts browser automation into three buckets: code frameworks (Playwright, Puppeteer, Selenium, Cypress, WebDriverIO), record-and-replay cloud suites (TestRigor, Mabl, Katalon, Ghost Inspector), and vision AI agents (Browser-Use, Nanobrowser, Skyvern, Multi-On). Fazm is in a fourth bucket that lists almost never name. The agent reads a YAML accessibility snapshot of whatever surface is active (a Chrome tab, a Finder window, a Mail message, a WhatsApp chat) and asks a language model to pick a structural ref, not a pixel coordinate. Browser automation is one peer feature of a larger Mac agent, not the whole thing.
Why does the pixel versus ref distinction matter?
Because pixel clicks are nondeterministic and refs are not. A vision model that looks at a screenshot can mislabel a button, miss an element that scrolled, or pick the wrong target when two similar controls are on screen. A structural ref (ref=e1, ref=e17) is a pointer into the rendered DOM or accessibility tree. It stays valid whether the page re-renders, scrolls, or zooms. Fazm runs the Playwright MCP with the flag --image-responses omit on line 1033 of acp-bridge/src/index.ts, which means base64 screenshots are stripped from the agent context entirely. The model only ever sees the structured snapshot. This is a deliberate architecture choice; it costs some visual reasoning and buys a lot of stability.
How many tools does Fazm expose to the agent, and where are they listed?
Five built-in MCP servers, hardcoded as a single JavaScript Set at acp-bridge/src/index.ts line 1266: const BUILTIN_MCP_NAMES = new Set(["fazm_tools", "playwright", "macos-use", "whatsapp", "google-workspace"]). fazm_tools is the internal helper API (SQL over the local fazm.db, capture_screenshot for context, permission status, browser profile lookup). playwright is the browser leg. macos-use is a native bundled binary at Contents/MacOS/mcp-server-macos-use that drives any accessibility-compliant Mac app. whatsapp is a dedicated native controller for the WhatsApp Catalyst app. google-workspace is a Python MCP for Gmail, Drive, and Calendar. These five are peers. The agent picks whichever one fits the sentence you typed.
When would I pick a code framework over a tool like this?
When you want determinism in CI, when your task is bounded to a single domain you control, and when you have engineers who can maintain selectors. Playwright is excellent at staging a signup flow for regression testing. Puppeteer is excellent at PDF rendering and deep Chrome DevTools work. Selenium is battle-tested. If your scenario is "run this same sequence on this one site on every pull request," code wins. If your scenario is "pull a receipt from Gmail, upload it to a vendor portal in Chrome, rename the file in Finder, and log the amount in Numbers," no browser-only library covers the full path, and you are writing glue code.
When would I pick a vision AI agent like Browser-Use or Skyvern?
When you absolutely cannot run anything on the user's machine, when the target page has no usable DOM structure (heavily canvas-rendered apps, certain legacy Flash-style UIs), or when you explicitly need a cloud-hosted agent because the workload is high volume. Vision agents trade stability for universality. If the only signal you have is pixels, a vision model is the honest choice. The tradeoff is that every scroll, re-render, and A/B test is a chance for the model to mislabel a control. In practice Fazm bets that for 90% of consumer tasks on the Mac, the accessibility tree is present and correct, so picking a ref is the better move.
Is Fazm a framework I have to install with a CLI?
No. It is a signed, notarized macOS app you download, double-click, and grant two permissions to (Accessibility and Screen Recording). No package manager, no npm install, no YAML config. The Claude Agent SDK, the five bundled MCP servers, the Playwright runtime, and the native binaries all ship inside the app bundle. You type in English and the agent decides which of the five tools to call. The fact that there is Playwright under the hood is an implementation detail, not a user surface.
Can I add my own MCP servers on top of the five that ship in the box?
Yes. Right after the five built-ins are registered in acp-bridge/src/index.ts, the code reads ~/.fazm/mcp-servers.json and appends any entries that match Claude Code's schema (name, command, args, env, enabled). If you write a custom MCP for Linear, Jira, or a private internal tool, Fazm picks it up on next launch and the agent gets a new peer with no app update. The five built-ins just cover the most common surfaces for a Mac user.
How does Fazm verify that the macOS Accessibility API is actually healthy before it relies on it?
There is a boot-time probe in Desktop/Sources/AppState.swift. The function testAccessibilityPermission at line 433 calls AXUIElementCreateApplication on the frontmost app and reads kAXFocusedWindowAttribute via AXUIElementCopyAttributeValue. If that returns .cannotComplete, the code runs a secondary check against Finder (a known AX-compliant app) to disambiguate a real permission problem from a target app that does not implement AX (Qt, OpenGL, some Python apps). If Finder also fails, it falls back to a CGEvent tap probe that reads the live TCC database, bypassing the per-process cache that goes stale on macOS 26 Tahoe. That redundancy is why accessibility can be the primitive we build on.
Does Fazm use my real Chrome with my logins, or a fresh Chromium?
Both are supported. If the environment variable PLAYWRIGHT_USE_EXTENSION is set to true, the Playwright MCP attaches to the Chrome you are already signed into via the Chrome Web Store extension "Playwright MCP Bridge". Your cookies, SSO sessions, and browser fingerprint come along, which transparently passes Cloudflare Turnstile and Google sign-in. If not, Playwright launches its own headless Chromium for an isolated session. Real-Chrome mode is the daily driver; fresh-Chromium mode is the hatch for throwaway browsing.
What does a cross-app run look like from the user's point of view?
One sentence. Example: "Pull my March Uber receipts from Gmail, upload them to concur.com, and log the total in my Numbers sheet." The agent calls google-workspace to search Gmail and download the PDFs, then calls playwright to drive the concur.com upload flow, then calls macos-use to switch to Numbers and type the total into the right cell. Three different MCP tools, one message, zero glue code. This is the shape that no list of "twelve browser automation tools" ever captures, because every tool on those lists starts and ends at the URL bar.
Is there a free version, and what does it cost to run?
Download is free. Fazm uses your own API keys for the underlying LLMs, which keeps cost transparent (you pay Anthropic or Google directly for the tokens the agent consumes). There is no recurring subscription gated behind the browser automation feature itself. The five MCP servers are always available once Accessibility and Screen Recording permissions are granted.
Three neighboring angles on the same topic
Keep reading
Browser automation tool (singular)
Why the browser is one of five peer tools in Fazm, not the whole product, with the same line-number citations.
AI browser automation
How the accessibility tree approach fits into the broader story of AI-driven browser agents in 2026.
Browser automation extension
The Chrome Web Store bridge that lets Fazm drive your real Chrome profile instead of a fresh Chromium.