Alternative to running a local LLM solo

A local LLM gives you tokens. A local AI agent gives you actions.

The difference is a harness, and on a Mac the harness is the part that spawns subprocesses, talks to AXUIElementCreateApplication, parses tool calls, asks for Accessibility permission, drops image bytes, and renders a chat window. Fazm is that harness. Five built-in MCP servers, one Swift app, one bridge subprocess, and a model endpoint of your choice. This page walks through what is actually inside.

Matthew Diakonov, Written with AI

Published April 30, 20268 min read

Download Fazm for Mac

Direct answer · verified 2026-04-30

Difference between a local LLM and a local AI agent on Mac

A local LLM (Ollama, LM Studio, MLX, llama.cpp) is just inference. Prompt in, tokens out. A local AI agent is the harness around that inference: a tool dispatcher, an MCP server set, accessibility-API access to your apps, a permission layer, and a UI to drive it.

Fazm is the harness, not the model. It declares its harness explicitly at acp-bridge/src/index.ts:1496 with BUILTIN_MCP_NAMES = new Set(["fazm_tools", "playwright", "macos-use", "whatsapp", "google-workspace"]). Five subprocess-spawned binaries, plus the Swift app driving them. Source on GitHub: github.com/m13v/fazm.

5.0from Fazm source tree

Five MCP servers spawned at startup, declared at index.ts:1496

Tool-result filter drops base64 image bytes, lines 2271-2307

Live AX permission probe in AppState.swift:439

MIT licensed, full source at github.com/m13v/fazm

The two stacks side by side

Eight rows. The left column is what Fazm gives you when you install it. The right column is what you get from a local LLM stack on its own. There is overlap (both run on your Mac, both respect your privacy) and there is real divergence (one ends at tokens, the other ends at clicks).

Feature	Local LLM only (Ollama, LM Studio, MLX, llama.cpp)	Fazm (the agent harness)
Primary output	Tokens. Strings emitted from a model API	Actions on real apps (clicks, typing, file writes, browser navigations)
Reads your apps	No. The model only sees what you paste into the prompt	Yes, via macos-use MCP and macOS accessibility APIs (AXUIElementCopyAttributeValue)
Tool calling	Up to you. The model emits text; nothing parses or executes it	MCP protocol, five built-in servers spawned as subprocesses, dispatched in acp-bridge/src/index.ts
macOS permissions	None. Inference does not touch the OS	Requests Accessibility, Screen Recording, Microphone on first run; live AX probe at AppState.swift:439
Voice input	Not part of the LLM. You build it yourself	Built in. WhisperKit on-device transcription drives the chat
UI	A REST endpoint and a CLI. UI is your problem	Native SwiftUI macOS app with chat, history, settings, onboarding
Context discipline	No context layer. The model gets exactly what you send	Tool-result filter at lines 2271-2307 drops image bytes; only text flows into model context
Where the tokens come from	Itself. That is the entire job	Pluggable. Anthropic by default, swappable to any Anthropic-compatible endpoint including a local Ollama or MLX bridge

What is actually inside the harness

Six concrete pieces, each tied to a specific file in the Fazm source tree. This is the part the existing playbooks online tend to wave at without showing. If you tried to build it yourself, you would end up reinventing roughly this stack.

Five MCP servers, registered at startup

acp-bridge/src/index.ts:1496 declares BUILTIN_MCP_NAMES = new Set(['fazm_tools', 'playwright', 'macos-use', 'whatsapp', 'google-workspace']). Each one is a separate subprocess. fazm_tools handles internal Fazm calls. playwright drives Chrome. macos-use drives any Mac app via the AX API. whatsapp drives the Catalyst app. google-workspace drives Gmail, Calendar, Drive, Docs, Sheets through Google APIs. Your model calls them by tool name prefix.

Native macos-use binary

Bundled at Contents/MacOS/mcp-server-macos-use, registered at index.ts:1287 only if it exists. It is what reads kAXRole, kAXTitle, kAXValue, kAXPosition, kAXSize for whatever app is in front. A local LLM has no equivalent.

Tool-result filter that drops image bytes

acp-bridge/src/index.ts lines 2271-2307 has two branches, both text. There is no item.type === 'image' branch. A screenshot tool result drops on the floor; an AX tree flows through. That keeps the model's context window from being consumed by base64.

Live AX permission probe

AppState.swift:439 calls AXUIElementCreateApplication on the frontmost app's pid and AXUIElementCopyAttributeValue for the focused window. If the result is .apiDisabled, Fazm knows the user revoked Accessibility access and surfaces the System Settings flow. A local LLM never has to know about this.

User MCP servers in ~/.fazm/mcp-servers.json

MCPServerManager.swift loads ~/.fazm/mcp-servers.json on launch and joins those servers with the built-ins. The format mirrors Claude Code (name, command, args, env, enabled). Plug in any MCP-compatible server and the agent gets new tools without a code change.

On-device voice and chat UI

WhisperKit transcribes mic input on-device. The Swift app renders the chat, the tool-call traces, the permission prompts, and the onboarding. You talk, the agent acts. None of that is what an LLM API gives you.

5 servers

“BUILTIN_MCP_NAMES = new Set(["fazm_tools", "playwright", "macos-use", "whatsapp", "google-workspace"])”

acp-bridge/src/index.ts line 1496

0Built-in MCP servers

0Image branches in tool-result filter (index.ts:2271-2307)

0Tokens for a typical Mail-window AX tree

0Tokens for a 1920x1200 screenshot

v0macOS version floor

The five built-in servers, one paragraph each

Each row is one of the subprocesses Fazm's bridge spawns. The model picks which one to call based on the tool-name prefix. None of them ship inference; each one wraps an OS-level capability the LLM cannot reach on its own.

fazm_tools

Internal stdio MCP. Settings, app history, window info. Where the agent looks first when a request is about Fazm itself.

playwright

Chrome via the Playwright MCP. Real browser, real cookies, real logins. Snapshots written to /tmp/playwright-mcp/ as YAML.

macos-use

Native binary that traverses any AX-compliant Mac app. Mail, Calendar, Slack Catalyst, Figma, VS Code, Cursor, Obsidian, Finder.

Native binary for the WhatsApp Catalyst desktop app. Read chats, search, open by index, send messages, all through accessibility.

google-workspace

Bundled Python MCP for Gmail, Calendar, Drive, Docs, Sheets. Hits Google APIs directly with a stored OAuth token, no browser scripting.

The corollary: pick your model freely

Because the harness is decoupled from the model, you have a real choice on the inference side. Fazm defaults to Anthropic Claude through the Claude Agent SDK because in 2026 it is still the most reliable option for multi-step tool calling. You can also point the bridge at any Anthropic-compatible endpoint, including a local proxy that exposes Ollama or MLX behind that API contract.

The honest tradeoff: open-weights models are gaining fast on tool-call reliability, especially Qwen3 and Gemma 3, but they still emit malformed JSON arguments more often than Claude on long-horizon tasks. If your workflow is "open Mail, find the Stripe invoice, reply with thanks" and you are running locally, expect to retry more. The harness is forgiving; the model is where the wins or losses are.

Tools in the local-LLM-only category

These are excellent at one job: running a model and exposing tokens through a stable API. They are not agents, and they do not pretend to be. Pair them with a harness (Fazm or otherwise) and you have a complete stack.

OllamaLM StudioMLXllama.cppGPT4AllJanAnythingLLMLocal-API gateways

One concrete task, on each side

Same prompt: open Mail, find the latest Stripe invoice, reply with thanks.

Local LLM only

Ollama on its own

You paste the prompt. Ollama emits text suggesting how a person could do this. No app opens. No invoice gets read. You copy the suggestion, switch windows yourself, click Mail, find the message, type the reply. The model was a notepad.

Tokens used: a few hundred. Apps touched by the agent: zero.

Fazm: the harness

Same prompt, real actions

Voice or chat in. The model emits a tool call: mcp__macos-use__open_application_and_traverse. The macos-use subprocess opens Mail, walks the AX tree, returns text. The bridge filter forwards the text. The model picks the Stripe row by kAXTitle, clicks Reply, types thanks. Mail really replies.

Tokens used: a few thousand. Apps touched: Mail. Screenshot bytes in context: 0.

When you actually want each thing

Pick a local LLM only when your job is local chat, local code completion in your editor, batch summarisation of files you own, or anything where the output is text and a human reads it next. The simpler the dependency, the better. Ollama plus a 7B model on Apple Silicon will do this for free, forever, with no agent layer in sight.

Pick a local AI agent (Fazm or similar) when the job is to do something on your Mac: open apps, fill forms, draft replies in your own voice, update a spreadsheet, follow up on yesterday's emails, schedule a meeting from a thread. The friction of doing those by hand is the entire reason agents exist, and a model on its own cannot do any of them.

Use both by pointing the agent harness at a local model. You keep the inference on your machine and you keep the action loop. The reliability ceiling becomes the model's tool-calling quality, which is the part of "local AI" that has improved fastest in the last twelve months.

Watch the harness run on your own Mac

20-minute demo. We open Mail, Slack, and a Google Sheet from a single prompt, walk through the macos-use traversals streaming through the bridge, and answer the model question last. Bring your local LLM stack if you have one.

FAQ

Frequently asked questions

Short version: what is the difference between a local LLM and a local AI agent on macOS?

A local LLM is the inference engine. Ollama, LM Studio, MLX, llama.cpp, all of them take a prompt and emit tokens, and that is the entire job. A local AI agent is the harness that turns those tokens into actions on your Mac. The harness is the part that parses tool calls, dispatches them to subprocesses, reads your apps through accessibility APIs, types into text fields, clicks buttons, and renders progress in a UI. On Fazm specifically, the harness is the Swift macOS app plus an acp-bridge Node subprocess that registers five built-in MCP servers (fazm_tools, playwright, macos-use, whatsapp, google-workspace) and routes every tool call through them. The model is swappable. The harness is the product.

Can I just point Ollama at my files and call it an agent?

Not in any practical sense. Ollama gives you a HTTP API at localhost:11434 that takes a prompt and returns text. It does not click anything, it does not see your apps, it does not know whether the click landed. To turn that into an agent on macOS you need three new pieces: a tool-calling loop that parses the model's output, a set of executable tools (open Safari, type into Mail, run a shell command), and a permission layer that has been granted Accessibility access in System Settings. Fazm ships all three already wired together. If you want to build it yourself, the minimum viable stack is roughly: Ollama or MLX for inference, the Anthropic-compatible MCP protocol for tools, a Swift or Node host that has Accessibility entitlements, and a UI to drive it.

What does the harness actually do that the LLM cannot?

Five things, all visible in the Fazm source tree. (1) It registers MCP servers on startup, see acp-bridge/src/index.ts around line 1280 where playwright, macos-use, whatsapp, and google-workspace get spawned. (2) It dispatches tool calls to the right server based on the tool name prefix. (3) It filters tool results into model context, dropping image bytes and forwarding text only at lines 2271-2307, so a 500 KB screenshot does not blow up the context window. (4) It runs a live accessibility probe at AppState.swift:439 to verify the AX API is actually working before any click goes out. (5) It manages voice input, transcription, and the chat UI. None of that is the LLM's job.

Why use accessibility APIs instead of just feeding the model screenshots?

A 1920x1200 screenshot is roughly 350K image tokens for the modern Anthropic tokenizer. A macOS accessibility tree for a single Mail window is typically 500 to 2000 tokens. Ten turns on screenshots costs about 3.5 million tokens of pixels before the model thinks; ten turns on AX trees costs about five thousand. Beyond cost, the AX tree gives you named fields (kAXRole, kAXTitle, kAXValue, kAXPosition, kAXSize) instead of pixel coordinates that go stale the moment a window moves. macos-use, the native binary registered at acp-bridge/src/index.ts:1287, is what makes that path possible for any AX-compliant Mac app, not just a browser tab.

If the harness matters more than the model, which model should I run locally?

Honest answer: the harness works with any Anthropic-compatible endpoint, so you have options. Out of the box Fazm routes through Claude (via the Claude Agent SDK as a Node subprocess), and that is the highest reliability option for tool calling today. If you want fully local, point the bridge at a local proxy that exposes an Anthropic-compatible API on top of Ollama or MLX (a few open-source bridges exist). Reality check: open-weights models in 2026 are still meaningfully behind Claude on multi-step tool calling, especially for nested arguments and JSON repair. A local agent that calls tools poorly is worse than a cloud agent that calls them well.

Where can I read the harness code myself?

Fazm is MIT-licensed on GitHub at github.com/m13v/fazm. Key anchors: acp-bridge/src/index.ts holds the bridge that spawns MCP servers and routes tool calls (built-in server set declared at line 1496; macos-use registration at line 1287; tool-result filter at lines 2271-2307). Desktop/Sources/MCPServerManager.swift loads ~/.fazm/mcp-servers.json so you can plug in your own servers next to the built-ins. Desktop/Sources/AppState.swift around line 439 holds the live AX permission probe.

What about Ollama, LM Studio, MLX, llama.cpp on their own? Are they useless?

Not at all. They are excellent at the one job they have: running an LLM on your hardware and exposing tokens through a stable API. If your need is local chat or local code completion, that is the entire stack. The mismatch is when people try to make the model itself the agent. A model is a function from string to string. An agent is a process with state, tools, permissions, retries, error handling, and a UI. You need both, but they are not interchangeable.

Is Fazm a local LLM, a local AI agent, or both?

Local AI agent. Fazm does not ship inference weights. It ships the macOS app that drives the agent loop, and the bridge that calls out to whichever LLM endpoint you configure. By default that is Anthropic Claude via the Claude Agent SDK; you can route through a corporate proxy, GitHub Copilot, or any Anthropic-compatible gateway, including a local one running on top of Ollama or MLX. That gives you the harness benefits (AX-tree access, MCP servers, voice, UI) with full choice on where the tokens come from.

What does the inside of the Fazm harness look like, in one sentence?

A Swift macOS app launches a Node subprocess (acp-bridge), the bridge spawns the Claude Agent SDK as another subprocess plus five MCP server subprocesses (fazm_tools, playwright, macos-use, whatsapp, google-workspace), every model tool call is dispatched to one of those servers, every tool result is filtered through the text-only branch at lines 2271-2307 of the bridge, and the Swift app renders the conversation while AppState.swift:439 keeps probing that the AX API is still working.

Related guides

guide

Local AI agent: what it is and how to run one

Plain explainer of what a local AI agent does on a Mac, and how it differs from cloud chatbots and browser-only agents.

Read

alternative

Accessibility tree vs screenshots

The exact code path in Fazm's bridge that drops screenshot bytes from every tool response, and the math on what that saves.

Read

guide

Local AI desktop agent: the Fazm observer loop

Fazm watches a rolling 60-minute screen video, asks Gemini for the next task, and runs it through Claude Agent SDK locally.

Read