A local AI app that actually does things on your Mac.
Most lists under this heading are wrappers around a 7B model in a chat window. You type, it types back, and nothing outside the window changes. Fazm is a different shape. It is a signed Mac app that ships a small Swift server at Contents/MacOS/mcp-server-macos-use, which exposes six accessibility tools and hands the agent a flat text tree of the whole screen after every click. The AI greps that file the way you would grep a log, finds the button it wants, and presses it.
THE MISREAD
When people say “local AI app”, they usually mean a chat window
If you searched this topic and opened the first few pages, you got a ranked list of the same half dozen names: Ollama, LM Studio, GPT4All, Jan, Msty, Faraday. Every one of them is the same shape. A local LLM runner with a chat UI on top. You load a model, you type into a box, it types back. If you ask it to rename a file, it will write a sentence about how you could rename a file. It will not rename the file. It cannot reach outside the tab.
That is a real product category and a good one. It is not the thing most people actually want when they type this topic into Google. The intent underneath the words is usually closer to “an AI that lives on my Mac and does stuff on my Mac”. The chat-in-a-box answer and the do-stuff answer are two different products that happen to share three words of marketing language.
This page is about the second one.
Chat-only local AI vs. a local AI that operates the apps you already have
A pretty window, a blinking cursor, and a model waiting for you to describe what you want. Everything it can do is constrained to text it emits back into the window. Your Finder, Mail, Safari, Slack, and Notes are unchanged.
- Bound to a chat tab
- No hands on your OS
- Cannot click a Send button
- Cannot reorder a Finder list
- Great for private conversation, not for work
THE BUNDLE
What is inside the .app that makes this work
If you right-click the Fazm app bundle and open Contents/MacOS, you will find two things. One is the main Swift/SwiftUI app. The other is a small binary called mcp-server-macos-use. It is a standalone Model Context Protocol server, built from a separate repo, bundled universal (arm64 + x86_64), signed and notarized with the rest of the app. The Fazm release pipeline literally runs file /tmp/mcp-server-macos-use-universal | grep -q "universal binary" as a gate on every build.
When the app starts, it spawns the server as a child process and speaks to it over stdio. Every time the model wants to do something on your Mac, it sends a tool call, the server runs it, and the server writes back a compact text summary plus a path to a fresh accessibility tree. That is the loop.
Voice in, accessibility out: the local loop
THE SIX TOOLS
Everything the agent is allowed to do on your Mac, in one list
The surface area is intentionally small. Not fifty tools, not thirty, not a metatool that takes a free-form command. Six. Each one performs a specific physical action and returns a fresh accessibility tree so the next step starts with ground truth.
macos-use_open_application_and_traverse
Opens or activates an app by name, bundle ID, or file path, then dumps the accessibility tree of the new active window.
Used as the first step of almost every flow. The response includes the PID that the next tool calls will target.
macos-use_click_and_traverse
Clicks at (x + w/2, y + h/2) inside the PID's window. Optionally types and presses a key in the same call.
The element parameter accepts a label match against the tree so the agent does not have to carry coordinates in its head.
macos-use_type_and_traverse
Types a string into whatever field currently has focus, then retraverses. Also takes an optional pressKey so type-and-enter is one call.
The agent is explicitly forbidden from typing its chain of thought into a user document, a rule shipped in the system prompt.
macos-use_press_key_and_traverse
Synthesizes a key event with optional modifier flags (Cmd, Shift, Option, Control, Fn).
This is how Cmd+S, Cmd+F, Cmd+Shift+N, and the rest of the macOS keyboard vocabulary become available to the agent.
macos-use_scroll_and_traverse
Simulates a scroll wheel event at a given coordinate and retraverses. Useful for long documents, long lists, long chat histories.
macos-use_refresh_traversal
Retraverses the PID's accessibility tree without performing an action. The ‘look again’ tool for when an animation finished or the user typed something.
THE ANCHOR FACT
Every element, one line of text, this exact shape
The server dumps one row per accessibility element, in a flat .txt file, with a format pinned in the tool instructions so the model is told the shape it will receive. It is not JSON. It is not a tree with indentation that matters. It is one element per line that the agent can grep. The shape is:
That is the piece nobody else writing about this topic mentions, because they do not know. Open up Fazm.app/Contents/MacOS, look at a traversal output, and the format will match exactly. It is what lets the agent use the accessibility tree the way it uses any other text: read, grep, decide.
A REAL TREE
What the agent actually sees when it opens Mail
This is a trimmed sample of the output written to the .txt file after opening Mail and activating the message body. Every role is from AXUIElement, every coordinate is pulled from AXPosition and AXSize, the ‘visible’ flag comes from the window geometry. When the agent wants to send the message, it greps for [Button] "Send", grabs the x/y/w/h, and calls click_and_traverse. No pixels are involved on the way in.
OUT OF THE BOX
What a local AI that speaks accessibility can actually do
File and Finder work
Move, rename, compress, zip, share, create folders, tag, reveal in Finder. Every Finder column is a table in the AX tree, every row has a label, every context-menu item becomes a pressable Button.
Mail drafts and replies
Open the right thread by subject match, dictate a reply, press Cmd+Shift+D or click Send by AX label.
Calendar changes
Create, reschedule, or cancel events with the New Event button and the native editor. Reads your actual Calendar database.
Slack and Discord
Electron apps publish a usable AX tree. Pick the channel, type the message, attach the file, send. No browser tab, no copy-paste.
Notes and Pages
Dictate into a specific note, add bullets, bold a section, insert a table, export to PDF.
System Settings
Toggle Do Not Disturb, rotate wallpapers, switch audio output, change network. Every pane is an accessible form.
APPS IT DRIVES ON DAY ONE
What people call a local AI app, and what Fazm is
The other entries are good products in a different category. Put side by side, the category confusion is easy to see.
| Feature | Typical local LLM runner | Fazm |
|---|---|---|
| Runs as a native Mac app | Yes | Yes |
| Chat interface | Yes | Yes (voice first) |
| Operates other Mac apps (Finder, Mail, Safari, Slack) | No | Yes, via six accessibility tools |
| Reads the macOS accessibility tree | No | Yes, after every action, as a grep-able .txt |
| Uses screenshots as the primary context | Usually no (no screen access at all) | No (the AX tree is primary, screenshots are a fallback) |
| Keeps history and memory on your Mac | Yes | Yes (per-user SQLite in Application Support) |
| Voice input built in | Rarely | Yes, with a hotkey and a floating bar |
| Needs a terminal to install | Sometimes | No, signed DMG, drag to Applications |
THE LOOP
How “make the contract change in that doc” becomes real events
Here is what actually happens when you speak a request into Fazm. The four middle steps are the ones this page exists to describe.
Capture
Audio comes in through the floating control bar. Transcription produces a user message. Memory and browser profile load the parts of your history that are relevant.
Tool choice
The agent decides the first action. For Mac work, that is almost always macos-use_open_application_and_traverse with the name of the target app.
Accessibility read
The server walks the AX tree of the target window, writes one row per element to a temp .txt file, and returns the path. The model reads the file with a grep, not a full context load.
Synthesized input
Click, type, press, scroll. Every synthesized event is a CGEvent landing on the target app's event queue, not a human-level mouse jerk.
Verify and continue
A fresh AX tree is written after each action. The agent greps the new tree to confirm the state changed the way it expected, and either moves to the next step or retries.
Summarize
The agent tells you what it did in one or two lines, not an essay. If anything was ambiguous, it asks with quick-reply buttons.
WHY “LOCAL” STILL MATTERS
Where the boundary actually is
“Local” is doing a lot of work in this topic, and it is worth separating the parts that move. The model that chooses tool calls runs in the cloud today, because a good tool-use model is bigger than what most Macs can host. Everything else is local: the accessibility reads, the synthesized events, the transcribed audio, the indexed files, the conversation history, the knowledge graph, the browser profile. All of it sits in a per-user GRDB SQLite database under ~/Library/Application Support/Fazm/users/{userId}.
The practical effect is that your file tree, your Mail, your documents, your calendar, your memory, and your voice never leave the machine. The only thing that does is the description of the tool call the model wants to make next. That is a different privacy story than either “nothing is local” or “the whole model is local but it cannot do anything”, and it is the one that actually helps you get work done.
How to try it in five minutes
- 1.Download Fazm.dmg from fazm.ai, drag Fazm.app into /Applications, open it.
- 2.Grant Accessibility in System Settings > Privacy & Security. That is the only permission the tools need.
- 3.Press the hotkey, say “open my inbox and summarize the unread threads”. Watch the AX tree flow through the floating bar.
- 4.Try something tactile: “rename everything in Downloads from last Tuesday” or “add two bullets to my standup note”.
- 5.If you want to see exactly what it did, the .txt trees are in /tmp/macos-use/ until they are garbage collected.
Want to see it click real buttons on a real Mac?
Fifteen minutes, one Zoom, a live demo of the six accessibility tools driving your stack.
Book a call →Frequently asked questions
What actually makes Fazm a local AI app rather than a chat wrapper around a local model?
Fazm is a signed, notarized macOS app that runs on your machine and operates other macOS apps directly. The agent's 'hands' are a Swift MCP server that ships inside the bundle at Contents/MacOS/mcp-server-macos-use. That server wraps Apple's AXUIElement accessibility APIs and exposes six tools: macos-use_open_application_and_traverse, macos-use_click_and_traverse, macos-use_type_and_traverse, macos-use_press_key_and_traverse, macos-use_scroll_and_traverse, and macos-use_refresh_traversal. Every call returns a .txt file containing the current accessibility tree. That is what makes it 'local AI' in the working sense of the phrase: it lives on your Mac and acts on the apps that are already open, rather than sitting in a tab asking what you want to talk about.
Is the model itself running on my Mac, or in the cloud?
The model that writes the chain of tool calls runs in the cloud today. The part that matters for 'local' is the execution surface: every tool call, every AX tree read, every key event, every click coordinate lives and stays on your Mac. A chat-only local-LLM wrapper inverts that: the model is local but nothing it says ever touches Finder, Mail, Safari, or your other apps. Fazm is the opposite shape. A future swap to an on-device model is a config change because the agent loop is the same.
Why does Fazm read an accessibility tree instead of taking screenshots like other computer-use agents?
Every row the server writes has the shape '[Role] "text" x:N y:N w:W h:H visible' — role from AXUIElement, label from AXTitle/AXValue, and exact pixel coordinates from AXPosition and AXSize. That is the same data macOS hands to VoiceOver. The agent grep's the .txt for the button it wants and passes the x/y/w/h straight back to click_and_traverse, which auto-centers the click at (x+w/2, y+h/2). Screenshot-only agents have to re-detect every element on every step and frequently misread custom controls, dark mode, or non-Latin text. Reading the AX tree is faster, cheaper, and correct by construction.
Which Mac apps can Fazm actually drive?
Anything that publishes an accessibility tree - which is almost every Mac app, because macOS assistive technologies depend on it. Native apps like Finder, Mail, Calendar, Notes, Messages, Safari, System Settings, Keynote, Pages, and Numbers all work. Electron-based apps (Slack, Discord, Notion, Cursor, VS Code) work. Cross-platform Qt/Chromium apps work. For the browser specifically, Fazm also bundles a separate Playwright MCP bridge so the model can choose between clicking via AX or using real DOM selectors inside Chrome. The fallback when an app has a broken AX tree is a screenshot, but it is not the default path.
Exactly where does the accessibility tree live on disk, and what does one row look like?
The server writes each traversal to a temporary .txt file and returns the path as the 'file' field of the tool response. Every row has the format '[Role] "text" x:N y:N w:W h:H visible' — one element per line, no JSON, no hierarchy, just flat grep-able text. That format is defined in the server's instruction string inside Sources/MCPServer/main.swift and shipped to the model as part of the tool metadata, so the agent is explicitly told to grep the file rather than reload it into its context window. The 'visible' flag is present or absent per line.
Does Fazm need terminal access or Homebrew? Is this a developer-only tool?
No. You download a DMG from fazm.ai, drag the app into /Applications, and open it. The only permission it asks for is Accessibility (granted in System Settings > Privacy & Security > Accessibility). There is no pip install, no Node install, no chromedriver on PATH, no Docker. The bundled mcp-server-macos-use binary is signed with Fazm's Developer ID and universal (arm64 + x86_64), which is why the file 'is universal binary' check at codemagic.yaml line 223 of the Fazm repo has to pass on every release build.
How do voice commands turn into actions on my Mac?
You press a hotkey or say a wake phrase, Fazm captures audio, transcription runs, the transcript becomes a user message in the conversation, the agent chooses a tool call, the tool call hits mcp-server-macos-use over stdio, the server dispatches to MacosUseSDK which in turn uses the AXUIElement family of APIs and CGEvent for synthesizing input, and the accessibility tree written after the action comes back to the model as a .txt file path. The whole loop is stdio and function calls, not HTTP. The server binary communicates with the app over stdin/stdout per the MCP spec.
What happens when Fazm can't find a button? Does it click the wrong thing?
The click tool accepts either exact x/y coordinates or an 'element' string that is matched against the AX tree. If the match is ambiguous or missing, the tool returns an error with the nearby rows of the tree and the agent retries with either refined text or a refresh_traversal. The server's instructions to the model, included on every tool call, explicitly forbid estimating coordinates from screenshots: 'NEVER estimate coordinates visually from screenshots. Screenshot pixel positions do NOT match screen coordinates (they differ by the window origin offset).' The rule is enforced by the prompt attached to the tool, not the model's general judgment.
Will this conflict with my existing apps or steal focus like a screen recorder?
The AXUIElement APIs send synthesized events to the target app process without stealing global focus the way a TCP-driven automation binary does. You can keep typing in a different app while Fazm is clicking something in Mail, because the press and click events land on the target window's event queue directly. The visible signal that Fazm is doing work is the floating control bar and the overlay the Chrome bridge paints on browser tabs, not a cursor that jumps around the screen.
Is my data sent anywhere while Fazm is operating my apps?
The accessibility tree rows the model reads are the same metadata assistive technologies receive: role, label, visible frame. The raw pixels of your screen are not sent unless a screenshot is explicitly taken (which is rare in the default agent loop because the AX tree is the primary context). Memory, conversation history, indexed files, and the knowledge graph are stored in a per-user GRDB SQLite database at ~/Library/Application Support/Fazm/users/{userId}/ on your Mac. That is the 'local' surface that matters: your text, your files, your history never leave the machine.