Reframe: it is not either-or

AI agent plugin vs UI automation is a per-call decision, not an architecture commitment.

Most of the writing on this is ideological. Plugin people say UI automation is fragile. UI automation people say plugins do not exist for half the apps that matter. Both are right and both are missing the point: a real agent ships both kinds of channels and lets the model pick at the tool-call layer. Fazm's bridge registers five MCP servers in one process and exposes the routing rule to the model in plain English.

M
Matthew Diakonov

Direct answer, verified 2026-05-05

Plugin when an API exists. UI automation when it does not. Ship both.

The pragmatic shape is a routing layer that registers both kinds of tools and picks per intent. Fazm does this with five MCP servers in one bridge: fazm_tools, playwright, macos-use, whatsapp, google-workspace. Lines 1473, 1544, 1553, 1563, 1576 of acp-bridge/src/index.ts. A sixth path at line 1599 merges user-defined MCPs from ~/.fazm/mcp-servers.json. Source on GitHub.

5.0from Fazm source tree
Five MCP servers registered in one bridge process
Plugin lane: google-workspace via real OAuth API
UI automation lane: AX trees from playwright, macos-use, whatsapp
User extension via ~/.fazm/mcp-servers.json (no recompile)

The two lanes, side by side, in one Mac app

Each chip below is an app or surface Fazm can drive. The bracket marks the lane: a real API plugin, or accessibility-tree UI automation, or a custom user MCP added via config.

Gmail (plugin)Google Calendar (plugin)Google Docs (plugin)Google Sheets (plugin)Google Drive (plugin)Apple Mail (UI)Notes (UI)Reminders (UI)Finder (UI)System Settings (UI)Safari (UI)Slack Catalyst (UI)Discord Catalyst (UI)WhatsApp (UI)Figma Desktop (UI)VS Code (UI)Chrome (UI via Playwright)Custom MCP (~/.fazm/mcp-servers.json)
0

MCP servers registered in one bridge process

0

Plugin lane (google-workspace, real APIs)

0

UI-automation lanes (playwright, macos-use, whatsapp)

0

User-config extension (~/.fazm/mcp-servers.json)

Single-mode agent vs Fazm's both-mode routing

The choice is not about which architecture is purer. It is about whether your agent can reach the next app in the user's workflow without a context switch. Toggle the panel below to see what each posture actually gets you.

One-shot architecture vs per-call routing

Single-mode agent. Picks one architecture and commits. Plugin-only flavor: works great on Gmail, Drive, Calendar. Stops dead at Finder. Cannot rename a downloaded PDF, cannot click a button in Notes, cannot drive Slack Catalyst. UI-only flavor: works on every Mac app, but every Gmail send is 8 to 12 AX tree turns. A 'reply to all customers from yesterday' workflow burns through token budget and breaks on a focus shift. No structured message IDs to reuse downstream. Either way, half the user's real workflows are off-limits.

  • Plugin-only: half the Mac is invisible
  • UI-only: every API call is 10x its true cost
  • Either way: workflows hit a wall
  • Architecture is a one-shot decision

What the bridge actually registers

Six surfaces. Five hard-coded MCP servers and one user-extension hook. Each one is a real stdio subprocess the bridge spawns at session start, with its own command, args, and env block.

Five MCP servers, one bridge process

fazm_tools (1473), playwright (1544), macos-use (1553), whatsapp (1563), google-workspace (1576). The bridge spawns each as a stdio subprocess at session start and exposes their union of tools to the model. Tool names are prefixed with mcp__<server>__<tool> so routing is deterministic; the bridge dispatches by prefix, the model picks by description.

Plugin lane: google-workspace

Python stdio MCP. Real Google APIs over OAuth 2.0. Credentials at ~/.google_workspace_mcp/credentials. One structured call sends a Gmail message; one returns a row from Sheets. Zero AX traversal.

UI lane: macos-use

Native macOS binary. Calls AXUIElementCreateApplication and walks kAXChildren on any AX-compliant app: Mail, Notes, Reminders, Finder, System Settings, Slack Catalyst, Figma desktop. Returns a text accessibility tree, not pixels.

UI lane: playwright (snapshots, not screenshots)

Browser automation. The bridge passes --output-mode file --image-responses omit at line 1491, so browser_snapshot returns AX YAML to a file and only a small reference flows into context. The model picks elements by [ref=e1] handles, not by pixel coords.

UI lane: whatsapp (Catalyst)

Dedicated MCP for the macOS WhatsApp Catalyst app. There is no public WhatsApp desktop API; AX is the only path. Search, open chat, verify active chat, send message, all through accessibility tools.

User extension: ~/.fazm/mcp-servers.json

Lines 1599-1632 read this file at session start and merge entries (same format as Claude Code's mcpServers) into the server list. Add your own plugin or UI-automation MCP without touching Fazm source. Disabled entries are skipped via cfg.enabled === false at line 1610.

5 servers

servers.push({ name: 'fazm_tools' / 'playwright' / 'macos-use' / 'whatsapp' / 'google-workspace' })

acp-bridge/src/index.ts lines 1473, 1544, 1553, 1563, 1576

When the plugin lane is the right call

These are the conditions under which a real API beats UI automation, every time. If the target app fits, route to the plugin and skip the AX work entirely.

Pick the plugin path when

  • App exposes a public, stable API (Gmail, Calendar, Docs, Sheets, Drive)
  • Action is bulk or scripted (read N messages, send N replies, batch update rows)
  • You need structured return data (message IDs, event IDs, file IDs) for the next step
  • Token budget matters: one structured call beats 8 UI turns
  • The action runs server-side and does not need the user's UI focus
  • Reliability dominates: APIs do not break on a UI redesign

When the UI-automation lane is the right call

These are the conditions under which the accessibility tree beats every API attempt, because no usable API exists or the API lags the UI.

Pick the UI-automation path when

  • App has no public API (Apple Mail, Notes, Finder, System Settings)
  • Action is logged-in-only and cookie-bound (most consumer web flows)
  • You need exactly the path a human takes (drag, drop, multi-select in Finder)
  • API exists but lags the UI feature (Slack Catalyst, WhatsApp desktop)
  • Read-only inspection: scrape a chart from a dashboard with no export endpoint
  • Cross-app sequencing where one app is plugin-able and the other is not

One real turn that crosses both lanes

User asks: “Reply to the latest Stripe invoice and file the PDF in Documents/Invoices/2026.” Watch the routing. The first three tool calls go to the plugin lane (Gmail). The next three go to the UI-automation lane (Finder). One bridge, one model, one context.

Cross-lane workflow: Gmail plugin then Finder UI

UserFazm UIacp-bridgeModelRight MCP'Reply to latest Stripe invoice and file the PDF'session/prompt with full tool catalogfive MCP servers' tools availabletool_use mcp__google-workspace__gmail_read_threadsroute to google-workspace (line 1576)JSON: thread_id, sender, subject, bodytool_use mcp__google-workspace__gmail_sendroute to google-workspace (plugin path)JSON: { sent: true, messageId }tool_use mcp__macos-use__open_application_and_traverse Finderroute to macos-use (line 1553, UI path)AX tree: [AXWindow] Finder > [AXOutline] Documentstool_use mcp__macos-use__click_and_traverse 'Invoices'route to macos-useAX tree: invoice file selectedstream: 'Replied and filed under Documents/Invoices/2026'

How to add your own MCP without touching Fazm source

Lines 1599 to 1632 of acp-bridge/src/index.ts read ~/.fazm/mcp-servers.json at session start. The shape mirrors Claude Code's mcpServers config:

{
  "stripe": {
    "command": "/usr/local/bin/stripe-mcp",
    "args": ["--readonly"],
    "env": { "STRIPE_API_KEY": "sk_test_..." },
    "enabled": true
  },
  "linear": {
    "command": "node",
    "args": ["/Users/me/linear-mcp/dist/index.js"],
    "enabled": true
  }
}

Every entry shows up in the model's tool catalog on next session. Plugin-style or UI-automation-style, the routing layer treats them identically. This is the integration surface: a config file the user owns, not a roadmap the vendor controls.

See the five-server bridge route across both lanes live

Twenty minute demo. We will run a workflow that calls the Google Workspace plugin lane and the macos-use UI-automation lane in the same conversation, and watch the bridge dispatch by tool prefix in real time.

FAQ

Frequently asked questions

Plugin or UI automation, which should I use?

Plugin every time the target app exposes a stable API. UI automation only when it does not. Plugins are faster, cheaper in tokens, deterministic, and survive UI redesigns. UI automation is the floor: it works on any AX-compliant app even when no API exists. A serious agent ships both and picks at the tool-call layer, not at the architecture layer. Fazm registers five MCP servers in one bridge process to do this: fazm_tools (acp-bridge/src/index.ts:1473), playwright (1544), macos-use (1553), whatsapp (1563), google-workspace (1576). The model picks per call.

Why is 'AI agent plugin vs UI automation' a false dichotomy?

Because every nontrivial agent task crosses a mix of apps with APIs and apps without. Sending an email through Gmail is a plugin call. Renaming a file in Finder is UI automation. A real workflow uses both back to back. Treating the choice as an architectural commitment forces you to either rebuild every integration as a plugin (impossible for closed-source desktop apps) or to drive everything via screenshots (slow, fragile, expensive). The pragmatic shape is a routing layer that sees both kinds of tools and picks per intent. Fazm's ChatPrompts.swift lines 97-105 are that routing layer, written into the system prompt the model receives every turn.

What does Fazm's plugin path actually look like?

The google-workspace MCP server is a Python stdio process registered at acp-bridge/src/index.ts:1576. It speaks to real Google APIs over OAuth 2.0, with credentials stored under ~/.google_workspace_mcp/credentials, and exposes typed tool calls for Gmail, Calendar, Docs, Sheets, Drive. No screenshot, no DOM walk: a structured request, a structured response. When the user says 'send Carol the proposal as an attachment', the model calls a Gmail tool directly, not Mail.app via accessibility.

What does Fazm's UI automation path actually look like?

Three channels. One, playwright (registered at index.ts:1544) drives the browser, but uses browser_snapshot accessibility YAML rather than screenshots; the bridge passes --output-mode file --image-responses omit at line 1491 to keep image bytes out of context. Two, macos-use (1553) is a native binary that walks the macOS accessibility tree via AXUIElementCopyAttributeValue for any AX-compliant app: Finder, Mail, Notes, Slack Catalyst, Figma desktop. Three, whatsapp (1563) is a dedicated MCP for the WhatsApp Catalyst app, also AX-based. All three paths return text trees, not pixels.

How does the model know which tool to use?

The system prompt explicitly routes it. ChatPrompts.swift line 99 says 'WhatsApp: whatsapp tools (mcp__whatsapp__*) for sending/reading WhatsApp messages via the native macOS WhatsApp app'. Line 101 says 'Desktop apps: macos-use tools (mcp__macos-use__*) for Finder, Settings, Mail, etc.' Line 103 says 'Browser: playwright tools ONLY for web pages inside Chrome'. The plugin path (Google Workspace) gets used implicitly because its tool names (mcp__google-workspace__*) describe Gmail, Calendar, Docs operations that the desktop apps cannot match without UI work. The model reads tool descriptions and picks the cheapest path that matches the intent.

What about apps that have neither a real plugin nor a stable AX tree?

The user can add their own MCP server. acp-bridge/src/index.ts lines 1599-1632 read ~/.fazm/mcp-servers.json (same format as Claude Code's mcpServers config) and append every entry to the registered server list. So if your accounting app has a CLI, you wrap it in an MCP server and Fazm picks it up at next launch. The routing layer extends without code changes. This is the difference between a closed agent and an open one: the integration surface is a config file, not a roadmap.

Does the same agent really call a Gmail API and click around in Finder in the same turn?

Yes, that is the entire point. A typical session looks like: user asks 'reply to the latest invoice from Stripe and file the PDF in Documents/Invoices/2026'. The model calls a google-workspace tool to read the latest Stripe message and draft a reply, then calls a macos-use tool to drag the PDF into the right folder. Both happen in the same conversation, in the same bridge process, against the same context. The tool-result handler at index.ts:2271-2307 strips images and forwards text from every channel uniformly.

Why not just plugin everything and skip UI automation?

Three reasons. One, most desktop apps do not have public APIs. Apple Mail, Notes, Reminders, Finder, System Settings have no plugin SDK; AX tree is the only stable read/write surface. Two, browser sites with logged-in state often do not expose API access for what the user can already do in the UI; cookie-bound flows are AX or nothing. Three, even when an API exists, it can lag the UI. Slack's plugin API does not cover every Catalyst-only feature. Fazm's whatsapp MCP (1563) exists because there is no public WhatsApp API for the desktop client; AX is the only path.

Why not just UI-automate everything and skip plugins?

Token cost and reliability. A Gmail send through google-workspace's API is one structured call returning a message ID. The same 'send' through UI automation is: focus Mail, click Compose, click To field, type address, click Subject, type subject, click Body, type body, click Send. Eight to twelve tool turns, eight to twelve AX tree retrievals, and any failure mid-sequence (a popup, a focus shift) requires recovery. Plugins skip the whole choreography. Fazm uses UI automation for apps that need it, not for apps that do not.

Is this an 'RPA vs AI agents' debate?

Not quite. RPA tools (UiPath, Automation Anywhere, Power Automate Desktop) record screen actions and replay them deterministically. They do not reason; they re-execute. A computer-use AI agent reasons per turn and adapts to a changed screen. UI automation in the AI-agent sense is RPA's substrate (driving the UI) wired to a model that decides what to drive. Fazm's macos-use returns AX trees that the model consumes; the model picks the next click based on the tree, not a recorded path. Plus the plugin lane (google-workspace) is something RPA tools historically lack as a first-class equal.

How do I verify the five-server registration in the source?

Open ~/fazm/acp-bridge/src/index.ts. Search for 'servers.push'. Five matches: line 1473 (fazm_tools), 1544 (playwright), 1553 (macos-use), 1563 (whatsapp), 1576 (google-workspace). A sixth path at 1621 reads user-defined entries from ~/.fazm/mcp-servers.json. Run 'grep -n "servers.push" ~/fazm/acp-bridge/src/index.ts' yourself; the line numbers in this article match the shipped source.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.