FOUR CATEGORIES / NOT THREE / NOT ONE

The honest taxonomy of browser automation tools, including the category lists never name

Every guide on this subject lists the same twelve projects and calls it coverage. The projects are real, but the taxonomy is wrong. There are four architecturally different categories, and the line between the third (click by pixel) and the fourth (click by ref) is the split that actually predicts which tool breaks under production load. This page walks through all four, names tools in each, and shows where Fazm sits, with line numbers.

Matthew Diakonov, Fazm

Published April 21, 202610 min read

4.9from Written from the Fazm source tree

Four categories, named and cited

Line 1266 of the bridge

AX API on any Mac app

No screenshot clicks

Open source references

Browser automation, reframed

Four categories. The fourth is the one lists keep missing.

1. Code frameworks bound to a tab

2. Record and replay cloud suites

3. Vision agents that click by pixel

4. Accessibility tree agents on any app

Fazm sits in category four

0:00 / 0:05

THE SHAPE OF THE MARKET

Four categories, not one

If you read five of the ranking guides on this subject back to back, you notice they all list the same projects and very few of them split the list along architectural lines. The honest split is not by license or popularity. It is by how the tool decides what to click, and that splits everything into exactly four groups.

1. Code frameworks

You import a library and write a script against a browser driver. Deterministic, fast, headless-friendly, excellent in CI. Bound to a tab. Examples: Playwright, Puppeteer, Selenium, Cypress, WebDriverIO, Nightwatch, TestCafe.

2. Record and replay

Cloud testing suites that record human sessions and let non-engineers author flows in a low-code UI. Great for QA teams, weaker when the underlying DOM shifts. Examples: TestRigor, Mabl, Katalon, Ghost Inspector, BrowserStack Low Code.

3. Vision AI agents

A multimodal model looks at a screenshot of the page, labels buttons, clicks by coordinates. Works without selectors. Breaks on re-renders and ambiguous controls. Examples: Browser-Use, Nanobrowser, Skyvern, Multi-On, Adept ACT-1.

4. Accessibility tree agents

The agent reads the structured accessibility tree of the current surface (browser tab, native app, dialog) and asks the language model to pick a ref. No pixel math. Can cross out of the browser into the OS. Fazm is the consumer flavor of this.

Almost no public list names category 4 as distinct from category 3, which is why tools that read the accessibility tree get lumped in with tools that read pixels. They are not the same thing, and they fail under different conditions.

THE ANCHOR FACT

One file, one line, five names, one Set

This is the thing that makes Fazm structurally different from everything else on this topic. In the bridge that wires the Fazm Mac app to the Claude Agent SDK, there is a single source of truth for which tools are built in. It is a JavaScript Set with five names in it. The browser is one of them. Not the first. Not the whole product.

acp-bridge/src/index.ts (line 1266)

Every action the agent takes is routed through one of these five peers. If the sentence is "email the PDF to Sam," the agent picks google-workspace. If it is "rename the folder in Finder," macos-use. If it is "fill the form on this site," playwright. Same agent, same conversation, different peer. A library cannot do this; a library starts from the assumption that the world ends at the URL bar.

THE FIVE PEERS, AS AN ORBIT

The agent at the center, five tools around it

The visual model is not a stack with the browser on top. It is a hub and spokes. The agent decides which peer to call for each step of your request.

Claude
Agent

fazm_tools

playwright

macos-use

google-workspace

The five names are exactly the five strings in the Set on line 1266. Read the source yourself; this page is just a mirror.

THE NAMED TOOLS IN EACH CATEGORY

Who lives in each of the four buckets

Most commentary on this subject is either unfiltered listicle or a thin review of one tool. Here is the compressed version. Pick the row that matches your task shape, not the row with the most GitHub stars.

CATEGORY 1 / CODE FRAMEWORKS

Playwright, Puppeteer, Selenium, Cypress, WebDriverIO, TestCafe, Nightwatch

You write code. The library drives a browser. Deterministic, fast, excellent for CI. Playwright is the modern favorite and what Fazm uses under the hood for its browser leg. Selenium has the longest track record. Cypress is the QA darling. All of them stop the moment your task leaves the tab. If you are an engineer and your scenario is "re-run this exact flow forever," this is your category. If you wanted a tool your parents could use, this is not it.

CATEGORY 2 / RECORD AND REPLAY

TestRigor, Mabl, Katalon, Ghost Inspector, BrowserStack Low Code

Cloud QA suites. Record a human session, replay it with some heuristics for handling minor DOM changes. Great for non-engineer test authors on stable apps. Weaker when the app ships a design refresh. Built for regression, not for new tasks. They are genuinely useful for the team that owns the app; they are not useful for the end user who wants "do this thing on the internet for me."

CATEGORY 3 / VISION AI AGENTS

Browser-Use, Nanobrowser, Skyvern, Multi-On, Adept ACT-1

A vision model looks at a screenshot, labels buttons, clicks by pixel coordinates. Works with zero selectors, which is powerful. The stability failure mode is mislabels: two similar controls, a popover that overlaps, a scroll that shifts the frame, any of them can make the click land on the wrong thing. These tools ship to production and work, but they burn tokens on every frame and the error budget is real. The right choice when the DOM is hostile or unavailable.

CATEGORY 4 / ACCESSIBILITY TREE AGENTS (where Fazm lives)

Fazm, plus a handful of research projects

The agent reads the accessibility tree of the current surface. On the web, that is the Playwright MCP’s YAML snapshot. On the desktop, it is the macOS AX API (AXUIElementCreateApplication, AXUIElementCopyAttributeValue). Every element has a role, a label, bounds, and a stable ref. The language model picks the ref. No pixel math, no OCR, no vision-step token cost. The tradeoff is it will not work on pure canvas or heavily custom-drawn UIs. The upside is the same primitive works equally well on Chrome, Finder, Mail, WhatsApp, and System Settings, which is how Fazm covers the whole task instead of just the tab.

PIXEL VS REF, AS A DIAGRAM

What the agent actually sees

The difference between a vision agent and an accessibility tree agent is not subtle. A vision agent receives an image. An accessibility tree agent receives a typed tree. The model is picking targets from one or the other, and that choice shapes the failure mode.

Category 3 input

A base64 screenshot

Every click lands because the model guessed pixel coordinates from an image.

Category 4 input

A YAML snapshot

Every click lands because the ref is a stable pointer into the tree.

HOW FAZM COMPARES, ROW BY ROW

Feature	Typical browser automation tool	Fazm (category 4)
Install model	npm install / pip install / CLI	Signed Mac app, double-click to run
Scope of control	Single browser tab	Any AX-compliant Mac app, plus a browser tab
How the agent picks targets	CSS / XPath selectors or pixel coordinates	Structural refs (ref=e4) from the accessibility tree
Screenshots in agent context	Often included (vision agents)	Stripped at launch (--image-responses omit, line 1033)
User interface	Code editor or low-code UI	English sentence in a chat window
Cross-app tasks	You write the glue between tools	Agent routes across five MCP peers automatically
Where tools are declared	Varies per framework	One JavaScript Set on line 1266 of acp-bridge/src/index.ts
Works with your real Chrome profile	Sometimes (persistent context flags)	Yes, via the Chrome Web Store bridge extension

WHAT IT LOOKS LIKE AT RUNTIME

One sentence, three peers

Here is a task that no browser-only tool can do end to end. It leaves Gmail, lands in a vendor portal, comes back to Finder, finishes in Numbers. The agent stays the same; the peer changes at each step.

fazm session log — cross-app automation

Three of the five peers in one run. No glue code, no selectors, no pixel math.

WHY THE SCREENSHOT FLAG MATTERS

The three flags that keep the agent honest

When the Playwright MCP is launched, three specific flags turn it from a vision tool into a tree tool. The exact line is reproduced here. If you want to verify, it is in acp-bridge/src/index.ts at line 1033.

acp-bridge/src/index.ts (line 1033)

What each flag does

--output-mode file tells Playwright MCP to write the page snapshot to a .yml file on disk rather than inlining it in the tool response.
--image-responses omit drops any base64-encoded screenshots the browser tool would have returned, so they never enter the model context.
--output-dir /tmp/playwright-mcp pins where the snapshots live so the agent can reference them by path if it wants to re-read.

The net effect is that the agent is reading a YAML tree, not pixels, every time it picks a click target. That is the single most important architectural difference between categories 3 and 4.

THE CATEGORY 4 NUMBERS

Five peers, two permissions, one app bundle

The pitch for category 4 compresses into a few numbers. These are direct counts from the source tree and the installer, not marketing.

0Built-in MCP peers

0Native Mac binaries bundled

0Permissions required

0CLI commands to install

HOW THE AX API IS ACTUALLY VERIFIED

The boot-time probe that makes accessibility trustworthy

Category 4 tools live and die on whether the macOS Accessibility API is responding correctly. Permission can be granted in System Settings but silently stale after a macOS upgrade or a developer-ID change. Fazm probes it at startup with a real AX call, not with the cached AXIsProcessTrusted() result that macOS usually returns.

Desktop/Sources/AppState.swift (line 433)

If the round trip returns .cannotComplete, the app disambiguates by running the same probe against Finder, then falls back to a CGEvent.tapCreate check that reads the live TCC database. This redundancy is the reason the accessibility tree is a reliable primitive in practice and not just in theory.

HOW TO PICK

A rule of thumb for the four categories

There is no "best" browser automation tool in the abstract. There is the one that matches your task shape. The question you have to ask is: who is the user, and does the task stay inside the browser?

If you are an engineer on a team that owns the app → category 1.

Playwright or Puppeteer. You want CI determinism and your scope is bounded. Fazm uses Playwright under the hood for exactly this reason; it is the right library for the browser leg.

If you are a QA lead on a stable app → category 2.

TestRigor or Mabl. You want non-engineers to author and maintain tests. The cloud-suite overhead is worth it.

If the target page is canvas-heavy or has no usable DOM → category 3.

Browser-Use or Skyvern. Vision is the honest tool when pixels are the only signal. Budget for occasional mislabels.

If you are a human on a Mac and the task crosses apps → category 4.

Fazm. The right tool when the sentence involves Gmail and Chrome and Finder and Numbers, when you do not want to write code, and when you want the clicks to be stable under re-renders.

4 categories

“Every guide on this subject lists the same twelve projects and calls it coverage. The architectural split between pixels and refs is the one that predicts which tool survives the next page update.”

The honest taxonomy, not the listicle

Want the cross-app run demoed live?

Fifteen minutes, live on your Mac. We will run a real multi-peer task so you can see the five MCP servers routing in real time.

Book a call →

Frequently asked questions

What is the fourth category you keep referring to?

The fourth category is accessibility-tree desktop agents that happen to also drive a browser. Every standard list puts browser automation into three buckets: code frameworks (Playwright, Puppeteer, Selenium, Cypress, WebDriverIO), record-and-replay cloud suites (TestRigor, Mabl, Katalon, Ghost Inspector), and vision AI agents (Browser-Use, Nanobrowser, Skyvern, Multi-On). Fazm is in a fourth bucket that lists almost never name. The agent reads a YAML accessibility snapshot of whatever surface is active (a Chrome tab, a Finder window, a Mail message, a WhatsApp chat) and asks a language model to pick a structural ref, not a pixel coordinate. Browser automation is one peer feature of a larger Mac agent, not the whole thing.

Why does the pixel versus ref distinction matter?

Because pixel clicks are nondeterministic and refs are not. A vision model that looks at a screenshot can mislabel a button, miss an element that scrolled, or pick the wrong target when two similar controls are on screen. A structural ref (ref=e1, ref=e17) is a pointer into the rendered DOM or accessibility tree. It stays valid whether the page re-renders, scrolls, or zooms. Fazm runs the Playwright MCP with the flag --image-responses omit on line 1033 of acp-bridge/src/index.ts, which means base64 screenshots are stripped from the agent context entirely. The model only ever sees the structured snapshot. This is a deliberate architecture choice; it costs some visual reasoning and buys a lot of stability.

How many tools does Fazm expose to the agent, and where are they listed?

Five built-in MCP servers, hardcoded as a single JavaScript Set at acp-bridge/src/index.ts line 1266: const BUILTIN_MCP_NAMES = new Set(["fazm_tools", "playwright", "macos-use", "whatsapp", "google-workspace"]). fazm_tools is the internal helper API (SQL over the local fazm.db, capture_screenshot for context, permission status, browser profile lookup). playwright is the browser leg. macos-use is a native bundled binary at Contents/MacOS/mcp-server-macos-use that drives any accessibility-compliant Mac app. whatsapp is a dedicated native controller for the WhatsApp Catalyst app. google-workspace is a Python MCP for Gmail, Drive, and Calendar. These five are peers. The agent picks whichever one fits the sentence you typed.

When would I pick a code framework over a tool like this?

When you want determinism in CI, when your task is bounded to a single domain you control, and when you have engineers who can maintain selectors. Playwright is excellent at staging a signup flow for regression testing. Puppeteer is excellent at PDF rendering and deep Chrome DevTools work. Selenium is battle-tested. If your scenario is "run this same sequence on this one site on every pull request," code wins. If your scenario is "pull a receipt from Gmail, upload it to a vendor portal in Chrome, rename the file in Finder, and log the amount in Numbers," no browser-only library covers the full path, and you are writing glue code.

When would I pick a vision AI agent like Browser-Use or Skyvern?

When you absolutely cannot run anything on the user's machine, when the target page has no usable DOM structure (heavily canvas-rendered apps, certain legacy Flash-style UIs), or when you explicitly need a cloud-hosted agent because the workload is high volume. Vision agents trade stability for universality. If the only signal you have is pixels, a vision model is the honest choice. The tradeoff is that every scroll, re-render, and A/B test is a chance for the model to mislabel a control. In practice Fazm bets that for 90% of consumer tasks on the Mac, the accessibility tree is present and correct, so picking a ref is the better move.

Is Fazm a framework I have to install with a CLI?

No. It is a signed, notarized macOS app you download, double-click, and grant two permissions to (Accessibility and Screen Recording). No package manager, no npm install, no YAML config. The Claude Agent SDK, the five bundled MCP servers, the Playwright runtime, and the native binaries all ship inside the app bundle. You type in English and the agent decides which of the five tools to call. The fact that there is Playwright under the hood is an implementation detail, not a user surface.

Can I add my own MCP servers on top of the five that ship in the box?

Yes. Right after the five built-ins are registered in acp-bridge/src/index.ts, the code reads ~/.fazm/mcp-servers.json and appends any entries that match Claude Code's schema (name, command, args, env, enabled). If you write a custom MCP for Linear, Jira, or a private internal tool, Fazm picks it up on next launch and the agent gets a new peer with no app update. The five built-ins just cover the most common surfaces for a Mac user.

How does Fazm verify that the macOS Accessibility API is actually healthy before it relies on it?

There is a boot-time probe in Desktop/Sources/AppState.swift. The function testAccessibilityPermission at line 433 calls AXUIElementCreateApplication on the frontmost app and reads kAXFocusedWindowAttribute via AXUIElementCopyAttributeValue. If that returns .cannotComplete, the code runs a secondary check against Finder (a known AX-compliant app) to disambiguate a real permission problem from a target app that does not implement AX (Qt, OpenGL, some Python apps). If Finder also fails, it falls back to a CGEvent tap probe that reads the live TCC database, bypassing the per-process cache that goes stale on macOS 26 Tahoe. That redundancy is why accessibility can be the primitive we build on.

Does Fazm use my real Chrome with my logins, or a fresh Chromium?

Both are supported. If the environment variable PLAYWRIGHT_USE_EXTENSION is set to true, the Playwright MCP attaches to the Chrome you are already signed into via the Chrome Web Store extension "Playwright MCP Bridge". Your cookies, SSO sessions, and browser fingerprint come along, which transparently passes Cloudflare Turnstile and Google sign-in. If not, Playwright launches its own headless Chromium for an isolated session. Real-Chrome mode is the daily driver; fresh-Chromium mode is the hatch for throwaway browsing.

What does a cross-app run look like from the user's point of view?

One sentence. Example: "Pull my March Uber receipts from Gmail, upload them to concur.com, and log the total in my Numbers sheet." The agent calls google-workspace to search Gmail and download the PDFs, then calls playwright to drive the concur.com upload flow, then calls macos-use to switch to Numbers and type the total into the right cell. Three different MCP tools, one message, zero glue code. This is the shape that no list of "twelve browser automation tools" ever captures, because every tool on those lists starts and ends at the URL bar.

Is there a free version, and what does it cost to run?

Download is free. Fazm uses your own API keys for the underlying LLMs, which keeps cost transparent (you pay Anthropic or Google directly for the tokens the agent consumes). There is no recurring subscription gated behind the browser automation feature itself. The five MCP servers are always available once Accessibility and Screen Recording permissions are granted.

Three neighboring angles on the same topic