One real example of business process automation, walked all the way through
Every other article on this keyword is a listicle. Ten to fifteen named examples at one paragraph each. Useful for motivation, useless for deciding whether your shop can actually build the thing. This page takes one example, the weekly vendor bill triage loop, and traces it step by step from the PDF landing in Mail to the payment-ETA reply going out. Every click, every type, every tree read. The Mac apps involved have no public API, so the only honest automation surface is the OS accessibility tree. Here is what that looks like.
How one chat command fans out across four Mac apps
The Fazm tool router takes the inputs on the left, sends each action to the right tool, and leaves artifacts for every step. The router lives in the ChatToolExecutor.swift switch statement; the primitives live in the mcp-server-macos-use binary wired into the app via the ACP bridge at acp-bridge/src/index.ts.
The nine steps of the loop, each as a short sequence of primitive calls
This is the level of specificity other listicles skip. Each step names the apps, the tree elements, and the tool calls. If you install Fazm and ask it to run this exact loop, the sequence below is what will land in /tmp/macos-use/ as a set of nine .txt files plus a few helpers for the cross-app handoffs.
1. Monday 9:04 am. The bill arrives as a PDF in Mail.
A single-line chat command starts the loop: 'triage this week's vendor bills'. The router opens Mail via open_application_and_traverse with identifier 'Mail'. The tree comes back with one AXRow per unread message in the Bills mailbox. Each row has AXStaticText children for sender, subject, date.
2. Open the first bill, grab the attachment path.
click_and_traverse on the first AXRow element. The message loads. The tree now shows an AXImage child for each inline attachment and a Save button in the toolbar. One click_and_traverse on that Save button, with text set to the target Finder path, saves the PDF to ~/Downloads/. Nothing is estimated from pixels; every coordinate comes from the tree line returned by the previous call.
3. Extract line items with the bundled pdf skill.
The pdf skill (Desktop/Sources/BundledSkills/pdf.skill.md) gets loaded. It runs pdfplumber against ~/Downloads/<file>.pdf inside the bundled Python and returns structured JSON: vendor name, bill date, due date, line items with quantity, unit price, total, tax. This is the only step in the loop that is not driven by the accessibility tree, because a PDF's internal structure is richer than what the Preview viewport exposes.
4. Open the desktop accounting app, start a new bill entry.
open_application_and_traverse 'QuickBooks Desktop' (or whatever accounting app the user has installed; the desktop tree looks structurally similar for FreshBooks, Xero Desktop, Sage Desktop). click_and_traverse on the Vendors AXMenu element, then on New Bill. The tree now exposes the form fields: AXTextField 'Vendor', AXTextField 'Ref No', AXTable for line items. No REST API exists for any of this. The tree is the only surface.
5. Type each line item into the table.
One click_and_traverse per field, with the text parameter set from the extracted JSON. Vendor name populates from the PDF. Each row of the line items table gets click_and_traverse on the AXCell, then a type_and_traverse of the value, then press_key_and_traverse 'Tab'. The tree refreshes after each call, so the agent always knows which cell is active next. Six line items take six click/type pairs, not 12 separate calls.
6. Save the bill, write a ledger row in Numbers.
press_key_and_traverse 's' with Command modifier saves the bill. open_application_and_traverse 'Numbers' brings the master tracker forward. The xlsx skill (Desktop/Sources/BundledSkills/xlsx.skill.md) formats a ledger row: date, vendor, amount, category, status: Posted. click_and_traverse on the first empty AXRow of sheet Bills, type_and_traverse the row, press Return. Two apps, one logical operation, zero pixel guessing.
7. File the PDF into the right Finder folder.
open_application_and_traverse 'Finder', click_and_traverse the path breadcrumb to /Users/me/Bills/2026-Q2/ (or create it if missing; a command-N makes the folder in the same traversal), drag_and_drop is simulated by a copy via pasteboard: press_key_and_traverse 'c' with Command, switch to target folder, press_key_and_traverse 'v' with Command. The original stays in ~/Downloads/ until the agent verifies the copy landed by reading back the Finder tree.
8. Reply to the vendor with a payment ETA.
Back to Mail. click_and_traverse on Reply. The doc-coauthoring skill drafts a two-line message: 'Thanks, received and logged. Will process by <due date>.' The tree shows the body as AXTextArea; type_and_traverse fills it. The send button is not clicked; the agent stops one step short of send and hands control back with 'draft ready, press Cmd+Shift+D to send'. That stop-short-of-irreversible behavior is the default throughout the loop.
9. Every action logged to /tmp/macos-use/ for audit.
Each step writes a .txt tree dump and a .png screenshot to /tmp/macos-use/<timestamp>_<action>.txt. The .txt file is the ground truth; the .png is an extra for visual review. An hour later you can open the folder and read what the agent saw, what it clicked, and what came back. Screenshot-based RPA does not produce this artifact trail because there is no structured tree to save.
The anchor fact: grep the tree dumps after a run
After Fazm runs the loop, every action leaves a plain text file in /tmp/macos-use/. One element per line, role + label + x/y/w/h + visibility. You can check what the agent saw and what it clicked long after the run is over. This is the single feature that most distinguishes accessibility-tree automation from screenshot-based RPA: the artifact is structured, not a PNG.
What a single step looks like at the tool-call layer
Every step in the timeline above decomposes into one or two calls like this. The coordinates come from the tree line returned by the previous call, not from estimating pixels in a screenshot. The pressKey chain in one call is the single most common shape; the whole loop is built out of roughly nineteen of these.
Why this one example fails on iPaaS
Row by row, where the accessibility-tree approach can reach and where a traditional iPaaS or screenshot RPA stops short. Same process, different surface.
| Feature | Zapier / Make / screenshot RPA | Fazm (accessibility tree) |
|---|---|---|
| Open Mail.app, walk its unread list | Zapier/Make: IMAP only, no Mail.app tagging | Yes, via AXRow tree of the Bills mailbox |
| Extract line items from the attached PDF | Requires a separate OCR vendor + glue | Bundled pdf skill + pdfplumber, structured JSON |
| Create a new bill in QuickBooks Desktop for Mac | No QuickBooks Desktop connector exists | Yes, form fields reachable via accessibility tree |
| Write a ledger row in Numbers | iPaaS uses Google Sheets as substitute | Yes, bundled xlsx skill + cell-level tree writes |
| File the PDF into a specific Finder folder | Folder Actions scripts, brittle on rename | Yes, Finder tree + pasteboard actions |
| Draft the vendor reply, stop before send | Send happens or no send happens, no pause | doc-coauthoring skill + Mail compose tree |
| Cost per automation step | ~2000 tokens per screenshot for vision RPA | ~20 tokens per tree line |
| Audit artifact | Pixel screenshots, no structural record | /tmp/macos-use/*.txt per action, signed by OS |
The uncopyable parts: file paths, counts, and a prompt
These are the facts you can verify against the Fazm source and against your own disk after installing. None of them is invented for the page; every path and count is real in the 2026-04 build. If any of them drift in a later version, the verification commands still work and still return a number.
17 bundled skills, shipped with the app
ai-browser-profile, canvas-design, deep-research, doc-coauthoring, docx, find-skills, frontend-design, google-workspace-setup, pdf, pptx, social-autoposter, social-autoposter-setup, telegram, travel-planner, video-edit, web-scraping, xlsx. For the bill-triage example, pdf + xlsx + doc-coauthoring do the non-desktop work. On disk at Desktop/Sources/BundledSkills/.
Six primitives cover the loop end to end
open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, refresh_traversal. Spec pinned at /Users/matthewdi/mcp-server-macos-use/llms.txt. Every step in the timeline on this page decomposes into a short sequence of these six.
Tool routing is explicit in source
Desktop/Sources/Chat/ChatPrompts.swift line 60 under the <tools> block: Screenshots: ALWAYS use capture_screenshot (modes: screen or window). NEVER use browser_take_screenshot, that only sees the browser viewport, not the desktop. Desktop apps: macos-use. Browser: playwright. The prompt is the contract.
Every action leaves a tree dump on disk
/tmp/macos-use/<timestamp>_<action>.txt per step, plus a .png screenshot. rg -n AXButton /tmp/macos-use/*.txt will return the exact lines the model read. Screenshot-based RPA cannot produce this artifact.
Four apps in this one example have no public API
Mail.app, Preview.app, the desktop accounting app (QuickBooks Desktop for Mac has no REST), and Finder. The accessibility tree is the only honest automation surface. iPaaS connectors do not cover any of them.
Blocked SQL keywords in the chat's execute_sql
DROP, ALTER, CREATE, PRAGMA, ATTACH, DETACH, VACUUM are blocked at the Swift layer (ChatToolExecutor.swift line 152). Multi-statement queries are blocked too. The execute_sql tool is real; the prompt routes structured questions there instead of asking the model to guess from free text.
Why this example is worth walking all the way through
Most buyers of business process automation software have never seen one run at the layer the engineer is working at. They have seen product marketing that claims coverage, a sales demo that runs a happy path in Salesforce, and a listicle that names ten examples. They have not seen what happens when the target app is QuickBooks Desktop for Mac or Preview or Mail and the REST API is nowhere to be found.
The answer is not magic. It is a tree. Every visible element in every Mac app exposes its role and label through the AXUIElement API. An agent that reads that tree and clicks by role + label is, at worst, mildly slower than an API call. At best, for apps without APIs, it is the only automation that exists.
That is what is special about this one example. Not the inbox, not the PDF, not the ledger row. The fact that all four apps in the loop ship no public automation surface and the loop still runs on a consumer Mac with one chat command.
Questions people ask after reading this
Why just one example, not ten?
Because every other result for this keyword is a listicle. Kissflow, IBM, Workato, Tallyfy, and ProcessMaker each give you 10 to 15 named examples at one-paragraph depth, and none of them walks a single example to the level of the actual clicks and tool calls. The honest answer to 'what does business process automation look like?' is not a longer list. It is one loop, traced all the way through, so you can tell whether the thing is buildable for your shop. This page takes vendor bill triage because it is the most common weekly process in a small business and because it runs through four Mac apps that have no usable public API.
Why vendor bill triage specifically?
Three reasons. It runs weekly in almost every small business, so the payoff is obvious. It touches four apps with no honest API surface for this flow on macOS (Mail, Preview, a desktop accounting app, Finder), which forces the automation to use the accessibility tree instead of REST endpoints. And every step produces a verifiable artifact (a PDF, a ledger row, a filed document, a sent reply) so you can check the automation's work without trusting the model's self-report.
What is the accessibility tree and why is it the key to this example?
Apple's AXUIElement API exposes every onscreen element as a tree node with a role (AXButton, AXTextField, AXRow, AXCell, AXStaticText), a label, x/y coordinates, width, height, and visibility. Fazm drives the Mac through that tree instead of taking screenshots and asking a vision model where to click. Each traversal writes a structured text file to /tmp/macos-use/<timestamp>_<action>.txt where every line looks like [AXButton (button)] "Save" x:680 y:520 w:80 h:30 visible. You can verify it after any Fazm session: run rg -n AXButton /tmp/macos-use/*.txt and you will see dozens of matches. The tree is the contract between the model and the OS, and it is what makes this example reliable enough to actually run week after week.
What exactly is Fazm doing when it runs this example?
Six `macos-use` primitives cover the whole loop: open_application_and_traverse (launch or focus an app and read its tree), click_and_traverse (click a tree element, optionally type text, optionally press a key, all in one call), type_and_traverse (type into the focused field), press_key_and_traverse (fire a key with optional modifiers), scroll_and_traverse (scroll by N lines at a coordinate), and refresh_traversal (re-read the tree without acting). Every step in the timeline on this page decomposes into a short sequence of these six calls. The primitive set is pinned in /Users/matthewdi/mcp-server-macos-use/llms.txt and the Mac client wires it up at /Users/matthewdi/fazm/acp-bridge/src/index.ts.
How is this different from screenshot-based agents like Claude Computer Use or OpenAI Operator?
Those agents take a screenshot of the screen, pass the image to a vision model, and ask the model to point at pixel coordinates where it wants to click. That fails at three things this example needs. First, token cost: a full-resolution Retina screenshot is thousands of tokens per step, and this example is roughly 20 steps per bill. Second, brittleness: a 1-pixel UI shift, a theme change, or a different monitor scale invalidates a template match, but does not change the AXButton role and label. Third, scope: Claude Computer Use runs in a Docker container with a virtual display, so it cannot reach your real Mail or a real desktop accounting app; Operator is browser-only. Fazm runs locally, attaches to the accessibility tree of real Mac apps, and pays dozens of tokens per tree line instead of thousands per screenshot.
Why not just use Zapier or Make for this?
Zapier, Make, n8n, Workato, and Power Automate Cloud are iPaaS. They move data between apps that already expose REST or SOAP endpoints. In this example: Mail on Mac does not expose an iPaaS-friendly attachment API (IMAP does, but attachment handling in IMAP connectors is spotty and the categorization live in the Mail.app scripting surface, which iPaaS products do not speak). QuickBooks Desktop for Mac ships no public REST API, full stop; QuickBooks Online's API does not cover Desktop; there is no Zapier connector for it. Preview is not a cloud service. Finder tagging and filing is a macOS-only affair. The iPaaS tier works beautifully when it works, which is when every app in the chain is a cloud SaaS with a documented API. That is not this example.
What are the 17 bundled skills and why do they matter for this walkthrough?
Fazm ships 17 skill.md files inside the app bundle at Desktop/Sources/BundledSkills/: ai-browser-profile, canvas-design, deep-research, doc-coauthoring, docx, find-skills, frontend-design, google-workspace-setup, pdf, pptx, social-autoposter, social-autoposter-setup, telegram, travel-planner, video-edit, web-scraping, xlsx. Each is a prompt that gets loaded when the user's request matches its description. For the vendor bill example, three of them do most of the work: pdf (extract structured line items out of the attachment), xlsx (write ledger rows into the user's master tracker), and doc-coauthoring (draft the vendor reply). The skill files are plain Markdown, versioned with the app, and visible to you on disk, not hidden behind a vendor API.
How does Fazm decide when to read the tree and when to screenshot?
The routing rule is explicit in Desktop/Sources/Chat/ChatPrompts.swift under the <tools> block. Paraphrased: desktop apps go to macos-use (tree); web pages inside Chrome go to Playwright (DOM); screenshots are taken only when the user literally asks 'what does this look like'. The rule says, verbatim, ALWAYS use capture_screenshot for screenshots (with modes screen or window). NEVER use browser_take_screenshot, that only sees the browser viewport, not the desktop. The effect is that a multi-step automation across Mail + Preview + Numbers + a desktop app costs tens of tokens per step, not thousands, because the model is reading structured text, not looking at pixels.
What happens if something on the accounting-app screen is unexpected?
Two things. First, every tool call returns the refreshed tree after the action, so the model sees the new state before its next call. If a modal dialog appeared that the plan did not anticipate, the next response is the model handling that dialog (clicking Cancel, reading the error, adjusting). Second, InputGuard blocks the user's keyboard and mouse during an action so a stray click does not land in the middle of the automated sequence, and a 30-second watchdog prevents permanent lockout if the server ever crashes mid-step. Pressing Escape cancels immediately. All four are mechanisms of the mcp-server-macos-use binary and not something the prompt layer has to coordinate.
Can I verify this walkthrough against my own machine?
Yes. Install Fazm from fazm.ai, grant accessibility permission when asked, then ask it to run this exact example on your own bill inbox. After the run, open /tmp/macos-use/ in Finder. You will see a .txt file and a .png file per action. The .txt file is the accessibility tree snapshot at the moment of the action. Grep it: rg -n AXButton /tmp/macos-use/*.txt. The lines match the format shown in this guide exactly. You can also browse the bundled skills on disk at /Applications/Fazm.app/Contents/Resources/BundledSkills/ (path may vary by build), and the primitive contract is pinned at github.com/mediar-ai/mcp-server-macos-use in llms.txt.
Is this only for bookkeeping, or does the pattern generalize?
The pattern generalizes to any weekly loop where the inputs land in one app, the processing happens in a second, the record-of-truth lives in a third, and the outbound message leaves from a fourth, and none of those apps expose a usable public API. Real examples of the same shape include: client onboarding across Contacts + Calendar + a CRM desktop app + Mail, tax-period prep across Preview + Numbers + a scanned-document folder in Finder + Mail, grant-tracking across a bank dashboard in Chrome + Numbers + a desktop CRM + Mail. The only thing that changes between them is which apps sit in which slot and which of the 17 bundled skills gets loaded. The six primitives stay the same.
Run this exact example on your own Mac
Install Fazm, grant accessibility permission once, and ask it to triage your bills. Watch /tmp/macos-use/ fill up with tree dumps. Free to start.
Download Fazm →
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.