Automation in business process: the layer every guide skips is how the agent actually reads the app

Every prominent guide for this topic answers a different question. IBM, Red Hat, UiPath, Salesforce, Atlassian, Acronis, and Wikipedia all describe business-process automation as an orchestration problem of HTTP boxes and arrows. Fine for server to server. Useless the moment a process step lives in Mail, Numbers, Contacts, or a small-business CRM that has no public API. This guide is about the layer those articles never open: how an on-device agent on a Mac actually reads the window it is automating, and why reading the accessibility tree beats matching pixels on a screenshot.

M
Matthew Diakonov
11 min read
4.8from early Mac users
Six accessibility-tree tools in the bundled binary
Works across Mail, Numbers, Calendar, Slack, Finder, and any AX-compliant app
Consumer app, no developer framework required

What the other guides actually say, and where they stop

Open the first ten pages that currently answer this query and they share a four-beat structure. A one-paragraph definition that pulls from Gartner. A list of benefits (cost reduction, compliance, accuracy). A catalogue of processes suitable for automation (invoice routing, employee onboarding, CRM updates). A pitch for the vendor's cloud iPaaS. None of them cross the boundary from "orchestrating APIs" to "driving a desktop app." That crossover is where the question gets interesting, because that is where most of the day-to-day work of a person running a business actually happens.

IBM: iPaaS + RPARed Hat: Ansible workflowsUiPath: Windows RPA botsSalesforce: Flow + connectorsAtlassian: Jira automation rulesAcronis: workload orchestrationWikipedia: BPA definitionFazm: accessibility tree, any Mac app

The missing layer is the one between "call the API" and "click the button." When the API does not exist, or the vendor gates it behind an enterprise plan, you either give up on automating that step, or you write a script that drives the app. The state of the art for driving apps, until very recently, was one of two things: a Windows-only RPA bot that image-matches its way around a screen, or a browser extension that can only see HTML. Neither reaches native Mac apps.

0native 'and_traverse' tools bundled
0mcp-server-macos-use version
0pip or npm installs required
0%runs local on macOS 13+

The anchor fact: six tools that end in _and_traverse

If you install Fazm and peek inside the application bundle, there is a file at Contents/MacOS/mcp-server-macos-use. It is a Mach-O arm64 binary compiled from a small Swift server. Run it directly and the first thing it prints on stdout is its tool list:

mcp-server-macos-use --startup-log

That naming convention is not cosmetic. Every action tool (click, type, press_key, scroll, open_application) ends in _and_traverse because the second half of every call walks the accessibility tree of the window that resulted from the action and writes the result to disk. The agent gets back a text file of the new window state, plus a sample of on-screen elements. No screenshot step sits between "click happened" and "what is on screen now." The traversal is the ground truth.

6 tools / 1 binary

Each line in the traversal .txt has the format [Role] "text" x:N y:N w:W h:H visible. The agent passes those raw values to the next click_and_traverse, which auto-centres the click at (x+w/2, y+h/2).

macos-use MCP server protocol, version 1.6.0

Where the binary is wired in

The registration happens in the ACP bridge (the TypeScript process that stitches the Fazm app to the agent runtime). Around line 1056 of acp-bridge/src/index.ts there is an existsSync check against the binary path; if it is present, the bridge announces a new MCP server called macos-use and the agent inherits all six traversal tools automatically.

acp-bridge/src/index.ts

The permission probe on the Swift side is where the accessibility-first design gets proven out. Fazm does not ask the OS "do I have the permission" with AXIsProcessTrusted, which is known to go stale on macOS 26 Tahoe. It does a real AX call on the frontmost app and interprets the AXError cases. The code is in Desktop/Sources/AppState.swift around line 433, and it handles a common production gotcha: some apps (Qt, OpenGL, Python-based desktop apps) return AXError.cannotComplete because they do not implement AX, not because the permission is broken. The code disambiguates by re-probing against Finder.

Desktop/Sources/AppState.swift

This is the kind of detail that a screenshot-only agent never has to think about, because it never talks to the OS about apps as apps. It talks to pixels as pixels. The cost of that simplicity shows up later as coordinate drift in production.

A single cycle of the loop, end to end

Imagine a small bookkeeping process: "open Mail, find the latest invoice from the co-packer, grab the PDF, open Numbers, paste a new row." Here is what runs on the user's Mac, turn by turn. The LLM never sees a pixel; it sees a file path to a traversal and a small inline sample.

One business-process run, as the agent sees it

1

open_application_and_traverse(app: "Mail")

The MCP server brings Mail to the front with NSWorkspace.open, waits for the window to stabilize, then calls AXUIElementCreateApplication(mailPid) and walks the tree. Returns a path to /tmp/macos-use/*_traverse.txt and a short inline sample.

2

Agent reads the .txt, finds the mailbox list

The traversal has lines like [Outline] "Mailboxes" x:0 y:80 w:260 h:520 and [Row] "Inbox (42 unread)" x:16 y:160 w:232 h:28 visible. The agent greps for the right row and extracts x/y/w/h.

3

click_and_traverse(x: 132, y: 174)

One call. The centre of the Inbox row is clicked, Mail re-renders, the server re-traverses and returns the new window state. No second 'take a screenshot now' step. The traversal and the action are atomic.

4

type_and_traverse(text: "from:copacker has:attachment")

Agent has already identified the search field from the new traversal. One call types the query AND reads the resulting list of matching messages in the same round trip.

5

press_key_and_traverse(key: "Return") → click first result → export PDF

Each step is one tool call returning the resulting window tree. The agent composes a five-step chain without ever asking for a screenshot, because the tree tells it what is on the screen at each step.

6

Switch to Numbers, click the target cell, type the row, save

Because this is the OS accessibility tree, not a browser DOM, the identical tool set drives Numbers as cleanly as it drove Mail. Same six tools, different app.

Two ways of reading an app, compared

This is the tradeoff that every guide on this topic glosses over. Once you decide to automate a process that touches a native app, you pick between a vision-first agent (a screenshot + OCR + coordinate predictor) and an accessibility-first agent (reads the tree). Both work for the golden path. They diverge under stress.

FeatureScreenshot-only agentAccessibility-tree (Fazm)
How the agent finds a buttonOCR the pixels, match the word 'Send', predict coordinatesRead the AX tree line [Button] "Send" x:y:w:h, click centre
Behavior when the window re-rendersCoordinate drift, click lands on wrong elementNew traversal after every action, coordinates always current
Works on non-browser Mac appsDepends on OCR quality per theme / DPIYes, any app that implements AX (essentially all mainstream Mac apps)
Context cost per turnOne JPEG of the full window (tens of KB base64)A text sample of on-screen elements + file path to full tree
What breaks the approachDark-mode themes, fonts, scale factor changesApps that opted out of AX (rare: Qt, some Java, OpenGL canvases)
Installation modelUsually a developer framework or Docker containerConsumer Mac app, the binary ships inside the .app bundle

Fazm still uses capture_screenshot when the agent needs to describe the screen to the user or when an app genuinely has no AX tree. But the default control path for business-process actions is traversal, not vision.

Which kinds of business processes this unlocks

The point of wiring an accessibility-tree-native agent into a consumer Mac app is not "automate HTTP calls." There are excellent tools for that. It is the shape of process that previously was impossible to automate because every prior option either could not see native apps or could not be installed by a non-engineer. A few examples from talking to early users.

Month-end invoice reconciliation in Numbers

Open Mail, filter for vendor invoices by date, open each PDF, extract amount + due date, paste a new row in a Numbers workbook, mark the email as read. The whole flow stays on the user's Mac. No API keys, no third-party SaaS.

Inbound support triage

Read unread messages in a support inbox (Mail or a native client), categorize by product area, draft a reply in the user's voice, leave it as a draft for review. The agent sees the threading, attachments, and signature blocks because the client exposes them via AX.

CRM data hygiene in a niche Mac app

Small-trade CRMs (field-service, real-estate, accounting tools) ship as Mac desktop apps with no public API. An accessibility-first agent can still read their tables, flag stale contacts, and bulk-edit rows the way a human assistant would.

Product-listing cross-posting

A small seller posts the same item to eBay, Etsy, and their own Shopify. The agent opens each site in the browser (Playwright MCP), drives the form, and uses macos-use for the local image-picker dialog. Native OS dialogs are accessibility-tree, not HTML DOM.

Weekly standup summary from a local journal

Read the user's local journal files (Obsidian vault, Notes.app, a plain folder of markdown), extract what was shipped, draft a Slack message, wait for user approval before sending.

Contact-to-calendar handoff

Open Contacts, find a specific person, copy their email, switch to Calendar, create a meeting, invite them. Three apps, one agent, each app driven through the same accessibility-tree interface.

Scoring this against what you had before

If you are evaluating whether an on-device, accessibility-tree agent fills a real gap in your current automation stack, these are the practical questions. None of them are about the agent's IQ. All of them are about what layer of the process you can reach.

What your current stack can already do

  • Zapier / Make / n8n wire HTTP endpoints together.
  • Browser RPA (Playwright, Puppeteer recipes) drive web apps.
  • Shell scripts and cron drive filesystem and CLI work.
  • SaaS iPaaS orchestrate long-running workflows.

What it cannot reach, that a tree-reading agent can

  • Native Mac apps with no API (Mail, Numbers, a specific CRM).
  • OS-level dialogs (Save As, file pickers, permission prompts).
  • Catalyst / Electron apps where DOM differs from accessibility tree.
  • Processes that cross three or more apps in one run.

This is not a recommendation to rip out your iPaaS. It is a recommendation to be honest about the gap between "process that fits on a diagram of boxes and arrows" and "process that lives in your 0 open desktop apps." A large share of a small-business day is the second kind.

Trying it on your own machine

You do not need to read the Swift sources to verify any of the above. Install Fazm, grant Accessibility permission in System Settings, and ask the agent something that requires touching a native app ("open Mail and tell me how many unread messages I have"). Under the hood, the agent will call open_application_and_traverse, then read the traversal, then answer. Tail the app log to watch it happen.

~/Library/Logs/Fazm/fazm.log (tail)

The file path pattern matters. A full accessibility tree for a content-heavy app like Mail can run into tens of thousands of lines. Dumping that inline every turn would blow the LLM context. Writing it to a file and returning a path + a sampled excerpt is how Fazm keeps the agent's running context small enough to chain five or ten steps across three apps without losing coherence.

Walk through the traversal loop live

15 minutes, screen-shared. We open the Fazm app bundle, run mcp-server-macos-use directly, and watch the agent drive a real Mac app.

Book a call

Frequently asked questions

Every 'automation in business process' write-up sounds identical. Why?

Because they share a blind spot. IBM, Red Hat, UiPath, Salesforce, Atlassian, Acronis, and the Wikipedia entry all describe business-process automation as an orchestration problem: boxes and arrows connecting SaaS systems over APIs. That framing works when every process endpoint is a HTTP endpoint. It stops working the moment a step lives inside a native desktop app that has no public API, which is the majority of the work a small-business owner actually does on a Mac: Numbers, Mail, Contacts, a local accounting tool, a specific trade CRM, Finder. For those apps the automation has to literally read the window, and that is the layer those guides do not describe.

What does Fazm use instead of screenshots to read an app?

The macOS accessibility tree. Every Mac app exposes an AXUIElement hierarchy of its visible UI: windows, buttons, text fields, tables, each tagged with a role, a label, and a frame. Fazm ships a bundled binary, mcp-server-macos-use version 1.6.0, that reads that tree directly with AXUIElementCreateApplication and AXUIElementCopyAttributeValue. After any action it returns the traversal as a text file with lines shaped like [Button] "Send" x:1204 y:760 w:68 h:32 visible. The agent operates on that, not on pixels. You can see the binary at Contents/MacOS/mcp-server-macos-use inside the Fazm app bundle.

What are the six 'and_traverse' tools and why do they matter?

Run the mcp-server-macos-use binary directly and it logs the tool list on startup: macos-use_open_application_and_traverse, macos-use_click_and_traverse, macos-use_type_and_traverse, macos-use_press_key_and_traverse, macos-use_scroll_and_traverse, macos-use_refresh_traversal. The naming convention is deliberate. Each action tool does the thing AND re-reads the accessibility tree in the same call, so the agent's next decision is always grounded in the new window state, not in a pixel prediction. That is how you chain 'open Mail, click the inbox, open the first email from Stripe, reply' without the agent hallucinating a button that moved two pixels.

Is this really different from the 'computer use' / vision-based approach other agents take?

Yes, and the difference is the failure mode. A screenshot-only agent has to OCR the pixels every turn, predict coordinates, and hope nothing shifts. When the window renders at a different scale factor, or a modal appears, or a dark-mode theme changes the contrast, the coordinate prediction drifts and the click lands on the wrong thing. An accessibility-tree agent reads the structured hierarchy the app already publishes, matches by role + label, and clicks the centre of the element's published frame. It is the difference between an OCR-based PDF scraper and reading the actual text layer. Same outcome when everything is pristine, wildly different when the UI re-renders.

Does Fazm use screenshots at all?

Yes, but only for describing the screen to the user or to the LLM as a last resort, not as the coordinate source for actions. The screenshot tool is capture_screenshot in Desktop/Sources/Providers/ChatToolExecutor.swift (the case label is at line 65 of that file), and it is separate from the macos-use traversal tools. You get a JPEG when you ask 'what's on my screen' and the agent decides a visual description is worth more than the tree. For 'click the Send button in Mail,' the agent uses click_and_traverse, not capture_screenshot.

Why can this work on any Mac app and not just the browser?

Because the accessibility tree is an OS-level API, not a browser API. Once the user grants Accessibility permission in System Settings, Fazm can read and act on any window from any app, including native SwiftUI and AppKit apps, Electron apps, Catalyst apps, and Java desktop apps. A browser-only agent can only see the HTML DOM of the current tab. A native-automation-layer agent can see the Finder window, the Calendar popover, a payroll app's table, and the browser tab, and treat them identically. For business-process work that spans Mail to Numbers to a CRM, this is what makes chaining possible.

What about apps that do not implement accessibility well (Qt, some Java apps)?

Honest answer: the traversal will be sparse, and the agent has to fall back to screenshot + keyboard shortcuts. Fazm's AppState.swift (line 433, testAccessibilityPermission) explicitly handles the AXError.cannotComplete case as ambiguous: it could mean the app does not speak AX (Qt, OpenGL, some PyMOL-class apps) rather than a permission problem, and the code disambiguates by probing Finder as a secondary control. For a business process running in a poorly AX-compliant app, we recommend a keyboard-driven flow instead of click-driven. Most mainstream Mac apps used by small businesses (Mail, Calendar, Contacts, Numbers, Safari, Chrome, Slack, Zoom, QuickBooks, FreshBooks, HubSpot's Mac wrapper) expose the tree fully.

Why did you build this as a Swift binary and ship it in the app bundle?

Three reasons. First, the macOS Accessibility API is only comfortably callable from Swift or Objective-C; doing it from Node or Python requires shelling out or using fragile FFI bridges. Second, shipping the binary inside Contents/MacOS means the user never runs 'pip install' or 'npm install' to get it; the first time they click a button in Fazm's chat, the binary is already there, signed and notarized. Third, it lets the tool act as a pure MCP server that any other agent (Claude, Gemini, GPT, whatever the user plugs in next) can speak to without Fazm-specific glue. The Fazm app bundle also ships mcp-server-whatsapp (same pattern for WhatsApp Catalyst) and mcp-server-google-workspace as a bundled Python venv.

How does the traversal get from the MCP server back to the LLM?

The traversal is written to a .txt file under /tmp/macos-use/ and the MCP response returns the file path plus a small sample of on-screen elements inline. The agent reads the file to find the exact element it needs, copies the x/y/w/h values, and passes them to the next click_and_traverse call, which auto-centres at (x+w/2, y+h/2). The file-based pattern matters because a full accessibility tree for a content-heavy app like Mail can be tens of thousands of lines, which would blow the LLM context if passed inline every turn. File + sample + grep is how Fazm keeps the agent's context lean enough to run a long business-process chain.

If I already use Zapier / Make / n8n, where does this fit?

They complement each other. Zapier is excellent for HTTP-to-HTTP orchestration: Stripe webhook to Slack message, Typeform submission to Airtable row. It cannot touch the state of a Mail.app window on your Mac. Fazm is the opposite: it can drive Mail, Numbers, and a local accounting tool as if a person were at the keyboard. The question is not 'which one,' it is 'which layer of the process.' Reporting and cross-SaaS glue belongs in Zapier. The parts of the process that live in desktop apps with no public API belong in a native agent that reads the accessibility tree. Most real small-business processes have both.