NATIVE MAC / ACCESSIBILITY TREE / SIX TOOLS / NO SCREENSHOT GUESSING

A local AI app that actually does things on your Mac.

Most lists under this heading are wrappers around a 7B model in a chat window. You type, it types back, and nothing outside the window changes. Fazm is a different shape. It is a signed Mac app that ships a small Swift server at Contents/MacOS/mcp-server-macos-use, which exposes six accessibility tools and hands the agent a flat text tree of the whole screen after every click. The AI greps that file the way you would grep a log, finds the button it wants, and presses it.

Download Fazm for Mac

Matthew Diakonov, Fazm

Published April 23, 20269 min read

4.9from Written from the Fazm source tree

Signed notarized Mac app

Six accessibility tools, not one giant screenshot

Drives Finder, Mail, Safari, Notes, Slack

AX tree is a flat .txt the agent greps

Consumer install, no terminal

It runs on your Mac. And it acts on your Mac.

A local AI app that is not just a chat tab

Signed Mac app, not a pip install

Six accessibility tools, one bundled Swift server

Reads the AX tree as text, not screenshots

Clicks real buttons in Finder, Mail, Safari, Slack

Every row: [Role] "text" x:N y:N w:W h:H visible

0:00 / 0:05

THE MISREAD

When people say “local AI app”, they usually mean a chat window

If you searched this topic and opened the first few pages, you got a ranked list of the same half dozen names: Ollama, LM Studio, GPT4All, Jan, Msty, Faraday. Every one of them is the same shape. A local LLM runner with a chat UI on top. You load a model, you type into a box, it types back. If you ask it to rename a file, it will write a sentence about how you could rename a file. It will not rename the file. It cannot reach outside the tab.

That is a real product category and a good one. It is not the thing most people actually want when they type this topic into Google. The intent underneath the words is usually closer to “an AI that lives on my Mac and does stuff on my Mac”. The chat-in-a-box answer and the do-stuff answer are two different products that happen to share three words of marketing language.

This page is about the second one.

Chat-only local AI vs. a local AI that operates the apps you already have

A pretty window, a blinking cursor, and a model waiting for you to describe what you want. Everything it can do is constrained to text it emits back into the window. Your Finder, Mail, Safari, Slack, and Notes are unchanged.

Bound to a chat tab
No hands on your OS
Cannot click a Send button
Cannot reorder a Finder list
Great for private conversation, not for work

THE BUNDLE

What is inside the .app that makes this work

If you right-click the Fazm app bundle and open Contents/MacOS, you will find two things. One is the main Swift/SwiftUI app. The other is a small binary called mcp-server-macos-use. It is a standalone Model Context Protocol server, built from a separate repo, bundled universal (arm64 + x86_64), signed and notarized with the rest of the app. The Fazm release pipeline literally runs file /tmp/mcp-server-macos-use-universal | grep -q "universal binary" as a gate on every build.

When the app starts, it spawns the server as a child process and speaks to it over stdio. Every time the model wants to do something on your Mac, it sends a tool call, the server runs it, and the server writes back a compact text summary plus a path to a fresh accessibility tree. That is the loop.

Voice in, accessibility out: the local loop

THE SIX TOOLS

Everything the agent is allowed to do on your Mac, in one list

The surface area is intentionally small. Not fifty tools, not thirty, not a metatool that takes a free-form command. Six. Each one performs a specific physical action and returns a fresh accessibility tree so the next step starts with ground truth.

macos-use_open_application_and_traverse

Opens or activates an app by name, bundle ID, or file path, then dumps the accessibility tree of the new active window.

Used as the first step of almost every flow. The response includes the PID that the next tool calls will target.

macos-use_click_and_traverse

Clicks at (x + w/2, y + h/2) inside the PID's window. Optionally types and presses a key in the same call.

The element parameter accepts a label match against the tree so the agent does not have to carry coordinates in its head.

macos-use_type_and_traverse

Types a string into whatever field currently has focus, then retraverses. Also takes an optional pressKey so type-and-enter is one call.

The agent is explicitly forbidden from typing its chain of thought into a user document, a rule shipped in the system prompt.

macos-use_press_key_and_traverse

Synthesizes a key event with optional modifier flags (Cmd, Shift, Option, Control, Fn).

This is how Cmd+S, Cmd+F, Cmd+Shift+N, and the rest of the macOS keyboard vocabulary become available to the agent.

macos-use_scroll_and_traverse

Simulates a scroll wheel event at a given coordinate and retraverses. Useful for long documents, long lists, long chat histories.

macos-use_refresh_traversal

Retraverses the PID's accessibility tree without performing an action. The ‘look again’ tool for when an animation finished or the user typed something.

THE ANCHOR FACT

Every element, one line of text, this exact shape

The server dumps one row per accessibility element, in a flat .txt file, with a format pinned in the tool instructions so the model is told the shape it will receive. It is not JSON. It is not a tree with indentation that matters. It is one element per line that the agent can grep. The shape is:

ax-tree-row-format

That is the piece nobody else writing about this topic mentions, because they do not know. Open up Fazm.app/Contents/MacOS, look at a traversal output, and the format will match exactly. It is what lets the agent use the accessibility tree the way it uses any other text: read, grep, decide.

0accessibility tools exposed by the server

0bundled binary inside Fazm.app

0screenshots needed to find an element

0architectures in the universal binary

Sources/MCPServer/main.swift

A REAL TREE

What the agent actually sees when it opens Mail

This is a trimmed sample of the output written to the .txt file after opening Mail and activating the message body. Every role is from AXUIElement, every coordinate is pulled from AXPosition and AXSize, the ‘visible’ flag comes from the window geometry. When the agent wants to send the message, it greps for [Button] "Send", grabs the x/y/w/h, and calls click_and_traverse. No pixels are involved on the way in.

ax-tree.txt

One full turn: send that Mail message

OUT OF THE BOX

What a local AI that speaks accessibility can actually do

File and Finder work

Move, rename, compress, zip, share, create folders, tag, reveal in Finder. Every Finder column is a table in the AX tree, every row has a label, every context-menu item becomes a pressable Button.

Mail drafts and replies

Open the right thread by subject match, dictate a reply, press Cmd+Shift+D or click Send by AX label.

Calendar changes

Create, reschedule, or cancel events with the New Event button and the native editor. Reads your actual Calendar database.

Slack and Discord

Electron apps publish a usable AX tree. Pick the channel, type the message, attach the file, send. No browser tab, no copy-paste.

Notes and Pages

Dictate into a specific note, add bullets, bold a section, insert a table, export to PDF.

System Settings

Toggle Do Not Disturb, rotate wallpapers, switch audio output, change network. Every pane is an accessible form.

APPS IT DRIVES ON DAY ONE

FinderMailCalendarNotesMessagesSafariChromeSlackDiscordNotionCursorVS CodeSystem SettingsKeynotePagesNumbersPreviewMusicRemindersPhotos

What people call a local AI app, and what Fazm is

The other entries are good products in a different category. Put side by side, the category confusion is easy to see.

Feature	Typical local LLM runner	Fazm
Runs as a native Mac app	Yes	Yes
Chat interface	Yes	Yes (voice first)
Operates other Mac apps (Finder, Mail, Safari, Slack)	No	Yes, via six accessibility tools
Reads the macOS accessibility tree	No	Yes, after every action, as a grep-able .txt
Uses screenshots as the primary context	Usually no (no screen access at all)	No (the AX tree is primary, screenshots are a fallback)
Keeps history and memory on your Mac	Yes	Yes (per-user SQLite in Application Support)
Voice input built in	Rarely	Yes, with a hotkey and a floating bar
Needs a terminal to install	Sometimes	No, signed DMG, drag to Applications

THE LOOP

How “make the contract change in that doc” becomes real events

Here is what actually happens when you speak a request into Fazm. The four middle steps are the ones this page exists to describe.

Capture

Audio comes in through the floating control bar. Transcription produces a user message. Memory and browser profile load the parts of your history that are relevant.

Tool choice

The agent decides the first action. For Mac work, that is almost always macos-use_open_application_and_traverse with the name of the target app.

Accessibility read

The server walks the AX tree of the target window, writes one row per element to a temp .txt file, and returns the path. The model reads the file with a grep, not a full context load.

Synthesized input

Click, type, press, scroll. Every synthesized event is a CGEvent landing on the target app's event queue, not a human-level mouse jerk.

Verify and continue

A fresh AX tree is written after each action. The agent greps the new tree to confirm the state changed the way it expected, and either moves to the next step or retries.

Summarize

The agent tells you what it did in one or two lines, not an essay. If anything was ambiguous, it asks with quick-reply buttons.

WHY “LOCAL” STILL MATTERS

Where the boundary actually is

“Local” is doing a lot of work in this topic, and it is worth separating the parts that move. The model that chooses tool calls runs in the cloud today, because a good tool-use model is bigger than what most Macs can host. Everything else is local: the accessibility reads, the synthesized events, the transcribed audio, the indexed files, the conversation history, the knowledge graph, the browser profile. All of it sits in a per-user GRDB SQLite database under ~/Library/Application Support/Fazm/users/{userId}.

The practical effect is that your file tree, your Mail, your documents, your calendar, your memory, and your voice never leave the machine. The only thing that does is the description of the tool call the model wants to make next. That is a different privacy story than either “nothing is local” or “the whole model is local but it cannot do anything”, and it is the one that actually helps you get work done.

How to try it in five minutes

1.Download Fazm.dmg from fazm.ai, drag Fazm.app into /Applications, open it.
2.Grant Accessibility in System Settings > Privacy & Security. That is the only permission the tools need.
3.Press the hotkey, say “open my inbox and summarize the unread threads”. Watch the AX tree flow through the floating bar.
4.Try something tactile: “rename everything in Downloads from last Tuesday” or “add two bullets to my standup note”.
5.If you want to see exactly what it did, the .txt trees are in /tmp/macos-use/ until they are garbage collected.

Want to see it click real buttons on a real Mac?

Fifteen minutes, one Zoom, a live demo of the six accessibility tools driving your stack.

Book a call →

Frequently asked questions

What actually makes Fazm a local AI app rather than a chat wrapper around a local model?

Fazm is a signed, notarized macOS app that runs on your machine and operates other macOS apps directly. The agent's 'hands' are a Swift MCP server that ships inside the bundle at Contents/MacOS/mcp-server-macos-use. That server wraps Apple's AXUIElement accessibility APIs and exposes six tools: macos-use_open_application_and_traverse, macos-use_click_and_traverse, macos-use_type_and_traverse, macos-use_press_key_and_traverse, macos-use_scroll_and_traverse, and macos-use_refresh_traversal. Every call returns a .txt file containing the current accessibility tree. That is what makes it 'local AI' in the working sense of the phrase: it lives on your Mac and acts on the apps that are already open, rather than sitting in a tab asking what you want to talk about.

Is the model itself running on my Mac, or in the cloud?

The model that writes the chain of tool calls runs in the cloud today. The part that matters for 'local' is the execution surface: every tool call, every AX tree read, every key event, every click coordinate lives and stays on your Mac. A chat-only local-LLM wrapper inverts that: the model is local but nothing it says ever touches Finder, Mail, Safari, or your other apps. Fazm is the opposite shape. A future swap to an on-device model is a config change because the agent loop is the same.

Why does Fazm read an accessibility tree instead of taking screenshots like other computer-use agents?

Every row the server writes has the shape '[Role] "text" x:N y:N w:W h:H visible' — role from AXUIElement, label from AXTitle/AXValue, and exact pixel coordinates from AXPosition and AXSize. That is the same data macOS hands to VoiceOver. The agent grep's the .txt for the button it wants and passes the x/y/w/h straight back to click_and_traverse, which auto-centers the click at (x+w/2, y+h/2). Screenshot-only agents have to re-detect every element on every step and frequently misread custom controls, dark mode, or non-Latin text. Reading the AX tree is faster, cheaper, and correct by construction.

Which Mac apps can Fazm actually drive?

Anything that publishes an accessibility tree - which is almost every Mac app, because macOS assistive technologies depend on it. Native apps like Finder, Mail, Calendar, Notes, Messages, Safari, System Settings, Keynote, Pages, and Numbers all work. Electron-based apps (Slack, Discord, Notion, Cursor, VS Code) work. Cross-platform Qt/Chromium apps work. For the browser specifically, Fazm also bundles a separate Playwright MCP bridge so the model can choose between clicking via AX or using real DOM selectors inside Chrome. The fallback when an app has a broken AX tree is a screenshot, but it is not the default path.

Exactly where does the accessibility tree live on disk, and what does one row look like?

The server writes each traversal to a temporary .txt file and returns the path as the 'file' field of the tool response. Every row has the format '[Role] "text" x:N y:N w:W h:H visible' — one element per line, no JSON, no hierarchy, just flat grep-able text. That format is defined in the server's instruction string inside Sources/MCPServer/main.swift and shipped to the model as part of the tool metadata, so the agent is explicitly told to grep the file rather than reload it into its context window. The 'visible' flag is present or absent per line.

Does Fazm need terminal access or Homebrew? Is this a developer-only tool?

No. You download a DMG from fazm.ai, drag the app into /Applications, and open it. The only permission it asks for is Accessibility (granted in System Settings > Privacy & Security > Accessibility). There is no pip install, no Node install, no chromedriver on PATH, no Docker. The bundled mcp-server-macos-use binary is signed with Fazm's Developer ID and universal (arm64 + x86_64), which is why the file 'is universal binary' check at codemagic.yaml line 223 of the Fazm repo has to pass on every release build.

How do voice commands turn into actions on my Mac?

You press a hotkey or say a wake phrase, Fazm captures audio, transcription runs, the transcript becomes a user message in the conversation, the agent chooses a tool call, the tool call hits mcp-server-macos-use over stdio, the server dispatches to MacosUseSDK which in turn uses the AXUIElement family of APIs and CGEvent for synthesizing input, and the accessibility tree written after the action comes back to the model as a .txt file path. The whole loop is stdio and function calls, not HTTP. The server binary communicates with the app over stdin/stdout per the MCP spec.

What happens when Fazm can't find a button? Does it click the wrong thing?

The click tool accepts either exact x/y coordinates or an 'element' string that is matched against the AX tree. If the match is ambiguous or missing, the tool returns an error with the nearby rows of the tree and the agent retries with either refined text or a refresh_traversal. The server's instructions to the model, included on every tool call, explicitly forbid estimating coordinates from screenshots: 'NEVER estimate coordinates visually from screenshots. Screenshot pixel positions do NOT match screen coordinates (they differ by the window origin offset).' The rule is enforced by the prompt attached to the tool, not the model's general judgment.

Will this conflict with my existing apps or steal focus like a screen recorder?

The AXUIElement APIs send synthesized events to the target app process without stealing global focus the way a TCP-driven automation binary does. You can keep typing in a different app while Fazm is clicking something in Mail, because the press and click events land on the target window's event queue directly. The visible signal that Fazm is doing work is the floating control bar and the overlay the Chrome bridge paints on browser tabs, not a cursor that jumps around the screen.

Is my data sent anywhere while Fazm is operating my apps?

The accessibility tree rows the model reads are the same metadata assistive technologies receive: role, label, visible frame. The raw pixels of your screen are not sent unless a screenshot is explicitly taken (which is rare in the default agent loop because the AX tree is the primary context). Memory, conversation history, indexed files, and the knowledge graph are stored in a per-user GRDB SQLite database at ~/Library/Application Support/Fazm/users/{userId}/ on your Mac. That is the 'local' surface that matters: your text, your files, your history never leave the machine.