Ollama is the model half of a local AI agent. Here are the two halves it does not ship.
Every tutorial on the first page of Google teaches you how to ollama pull a GGUF, launch ollama serve, and point an OpenAI-compatible client at localhost:11434. Nobody explains what turns that into an agent that actually clicks Send in Mail, files a receipt in Finder, or answers a Slack thread. This guide walks the two layers Ollama deliberately stops short of, and shows why the accessibility-tree approach beats a screenshot VLM for any model, local or cloud.
The command that does nothing
Open a terminal, run ollama run llama3.1 'reply thanks to the top mail'. The model dutifully explains, step by step, how to do it. The characters stream into your terminal. Mail does not open. Nothing in your inbox moves. There was never any mechanism for it to.
That gap is not a bug in Ollama. It is a scope decision. Ollama is a model-serving runtime, one of the best-engineered in the ecosystem. It hit 52 million monthly downloads in Q1 2026, hosts GGUF builds of Llama 3.1, DeepSeek-R1, Qwen 3, Kimi-K2.5, gpt-oss, and Gemma 3, and speaks an OpenAI-compatible REST API on port 11434. What it does not do, and was never trying to do, is tell the model what is on your screen or translate the model's words into a real mouse click. Those are the two layers Ollama-on-Mac tutorials skip.
The anatomy of a local Mac agent
Four layers. Ollama owns one.
Every real local-AI agent on a Mac is at least four things wired in series. People conflate the first layer with the whole stack, which is why the question "can I make an agent with Ollama?" sounds simpler than it is.
Layer 1 – Model runtime
Downloads the weights, loads them into unified memory, serves tokens. This is exactly where Ollama lives: `ollama pull llama3.1`, `ollama serve`, OpenAI-compatible REST on :11434. 135,000 GGUF-formatted models on HuggingFace plug in here. Ollama owns this layer.
Layer 2 – Perception
Turns 'what is on the screen' into tokens the model can reason about. Two real options on a Mac: ship a 4K base64 screenshot and hope the model has a vision head, or walk the macOS accessibility tree and ship a few KB of structured text. Fazm ships the latter as mcp-server-macos-use (a 21 MB ARM64 Mach-O inside the app bundle).
Layer 3 – Action
Turns the model's tool call back into a real OS event. On macOS that is CGEventCreateMouseEvent, CGEventKeyboardSetUnicodeString, and AXUIElement Press actions. Fazm collapses observe+act into single `_and_traverse` tool calls so the next step already has fresh ground truth.
Layer 4 – Planner / memory
The outer loop: session state, tool-use policy, fallbacks when a tree dump is empty. Fazm's ACP bridge at acp-bridge/src/index.ts is one concrete implementation. Any Ollama-driven copy of this stack re-implements this layer around a local model.
One perception server. Any local model. Any Mac app.
The bundled accessibility-tree server is model-agnostic. It speaks MCP over stdio and returns structured UTF-8. Swap the client and you can drive it from Llama 3.1, DeepSeek-R1, Qwen 3, or Kimi-K2.5 just as easily as from Claude.
The 21 MB Mach-O that does the seeing
Install Fazm and you can find the perception layer on disk at Fazm.app/Contents/MacOS/mcp-server-macos-use. It is a 21 MB ARM64 Mach-O, a Swift binary that registers as an MCP server over stdio and exposes exactly six tools, all suffixed _and_traverse:
macos-use_open_application_and_traversemacos-use_click_and_traversemacos-use_type_and_traversemacos-use_press_key_and_traversemacos-use_scroll_and_traversemacos-use_refresh_traversal
Under the hood each tool calls AXUIElementCreateApplication(pid) and AXUIElementCopyAttributeValue(kAXFocusedWindowAttribute), the same primitives VoiceOver uses. The ACP bridge registers the binary at acp-bridge/src/index.ts:1056-1064 whenever it exists on disk. Nothing about that registration is gated to Claude. The server does not know or care which model is on the other end of the MCP pipe.
What a single observation step weighs
These numbers come from a real traversal of a Fazm Dev window. Same screen, two ways of feeding it to a model.
The same screen, captured as a 4K base64 screenshot, is roughly0K input tokens. The AX tree above is closer to 0K. That ratio is the reason a 7B Ollama model can run this loop at all.
Why screenshots are a worse fit for Ollama than for Claude
Frontier cloud models can tolerate a clumsy perception layer. Anthropic's reference computer-use loop sends a full-resolution screenshot on every observation step, and it mostly works, because a 200B+ parameter model has enough capacity to find a 42x32 pixel Send button in a 4K JPEG and because Anthropic happily prefills that image for you.
A local Ollama model does not have either of those luxuries. Prefilling 350K tokens of base64 image data on a Mac with a 32 GB unified-memory budget is the difference between an interactive agent and a background task. And a 7B model, even a 32B model, is usually not good enough to pick the right button out of raw pixels. The accessibility tree is the cheat code: the OS has already labelled every clickable element with a role and a title, and AX gives you a C API to read them.
That is what makes the tree path interesting specifically for local AI. It is not just a Claude optimization. It is the architecture that makes smaller models viable as Mac agents.
Screenshot VLM vs accessibility tree
Both approaches can theoretically work with an Ollama-served model. Only one is practical today.
| Feature | Screenshot VLM | AX tree |
|---|---|---|
| Token cost per observation | 350K tokens (base64 4K screenshot) | ~10K tokens (structured AX tree) |
| Round trips per step | 2 (act, then screenshot) | 1 (every tool returns post-action tree) |
| Works on a 7B local model | No — context window blows up | Yes — fits inside 32K context |
| Element lookup | Model has to infer coords from pixels | Substring match on role + title |
| Hardware requirement | VLM head + enough VRAM for image tokens | Any text-only instruct-tuned model |
| Real file on disk after the call | screenshot.png (opaque to grep) | /tmp/macos-use/*.txt (grep-able, diff-able) |
A single reply, step by step
Step 0 – User types a goal
What the Ollama side of this would look like
Fazm today is Claude-backed. The part you care about, the perception and action layers, is not. Llama 3.1, DeepSeek-R1, Qwen 3, and Kimi-K2.5 all support structured tool calls when served through Ollama. Wire the inference loop to speak MCP over stdio to the bundled Mach-O and you have an Ollama-driven Mac agent that uses the same tree-first pattern.
The key detail is that every _and_traverse tool response contains both the action result and the freshly re-walked tree. That is what prevents an Ollama loop from paying a second round trip per step just to see what happened.
How the shipped product actually routes
The system prompt at Desktop/Sources/Chat/ChatPrompts.swift lines 56-61 hard-codes the routing rule: desktop apps go to the accessibility tree, web pages inside Chrome go to Playwright, and the screenshot fallback is explicitly the last resort, not the default.
How to think about the stack
If you are evaluating Ollama for a Mac agent project, the question is not "which model should I pull", it is "which of these four layers am I actually building, and which am I buying".
1. Model runtime
Use Ollama. Serves GGUF, has an OpenAI-compatible API, supports tool calls on every modern model. You do not build this.
2. Perception layer
On a Mac this should be an AX-tree server, not a screenshot VLM. Either use mcp-server-macos-use from the Fazm bundle or re-implement the AXUIElement walk yourself. Either way, speak MCP to it so you can swap clients later.
3. Action layer
CGEventCreateMouseEvent, CGEventKeyboardSetUnicodeString, and the AXPress action. The same Mach-O above already bundles all three behind the _and_traverse tools; if you roll your own, fold the action and the post-action tree walk into a single tool response.
4. Planner / memory
Your inference loop. Session state, retry policy, fallback to screenshot when the tree is empty, model switching between haiku-class and opus-class models when the task gets harder. This is where most of the app-specific logic lives.
Vocabulary you actually need
The one thing the SERP misses
Ollama solves model serving. It does not solve Mac agents.
Every Ollama tutorial published in 2026 talks about pulls, quantization, context length, and $0 inference. All of that is true. It is also, on its own, not enough to close a single Mail reply on your Mac.
The perception and action layers are where the product work lives. Fazm ships one real implementation of both, wired to Claude today and structurally ready for any local model you point at it tomorrow. The 21 MB Mach-O inside the app bundle is the part of the stack most worth copying if you want to build an Ollama-first version yourself.
Thinking about an Ollama-driven Mac agent?
Fifteen minutes with the team that already shipped the tree-first perception and action layers. We will walk the exact binary, the tool schemas, and where Ollama would plug in.
Book a call →Ollama local AI on macOS, straight answers
Does Fazm use Ollama or local models today?
No. Fazm is a consumer Mac app that ships with Claude Sonnet 4.6 as the default model, hardcoded as DEFAULT_MODEL at acp-bridge/src/index.ts line 1245. The interesting thing for someone researching Ollama is that the part of Fazm that reads and drives the Mac is completely model-agnostic. The mcp-server-macos-use binary speaks MCP over stdio and returns plain UTF-8 text; any model that can consume tool responses, including a local Ollama-served model, could be wired to the same perception and action layer. Fazm uses Claude because it is fast and already trained on tool use, not because the tree mechanism depends on it.
Why can't Ollama drive my Mac by itself?
Ollama is a model-serving runtime. `ollama run llama3.1` gives you a REPL that returns text. It has no code path that reads your screen, no code path that synthesizes a mouse click, and no code path that types into the focused app. Turning a local model into an agent requires at least two more layers wrapped around it: a perception layer (something that reports what is on the screen as tokens the model can reason about) and an action layer (something that translates a model's text output back into CGEvent, keyboard, and shell calls). Ollama's scope deliberately stops before those layers begin.
Why is the accessibility tree better than a screenshot VLM for a local model?
Three reasons, all worse on small local models than on frontier cloud models. First, size: a base64-encoded 4K Retina screenshot is typically 500 KB to 5 MB of text in a tool response, versus a few kilobytes for a full AX tree dump. For a 7B local model with an 8K or 32K context window, that difference is the gap between one step and a hundred. Second, precision: the tree already has element roles, titles, and CGFloat frames. The model substring-searches for the word 'Send' and gets back `[AXButton] "Send" x:842 y:712 w:68 h:32 visible`. A VLM has to infer the same thing from pixels. Third, speed: a traversal of a real Fazm Dev window takes about 0.72 seconds and returns 441 elements. Encoding, shipping, and prefilling a 4K screenshot on a consumer GPU is slower.
What is in the 21 MB ARM64 Mach-O at Fazm.app/Contents/MacOS/mcp-server-macos-use?
A Swift binary that implements an MCP server over stdio and exposes six tools: macos-use_open_application_and_traverse, macos-use_click_and_traverse, macos-use_type_and_traverse, macos-use_press_key_and_traverse, macos-use_scroll_and_traverse, and macos-use_refresh_traversal. Under the hood it calls AXUIElementCreateApplication(pid) and AXUIElementCopyAttributeValue(kAXFocusedWindowAttribute), walks the tree depth-first, and emits one text line per element with role, title, x, y, width, height, and a visibility flag. The ACP bridge registers it at acp-bridge/src/index.ts lines 1056-1064 whenever the binary exists on disk. It is not gated to Claude, it is gated to whatever model the bridge is currently talking to.
Could I build my own Ollama-driven Mac agent with this approach?
Technically yes. The pattern is: run your Ollama model in the tool-use mode supported since Llama 3.1 and DeepSeek-R1, define the same six _and_traverse tool schemas, and proxy tool calls to an MCP server that wraps AXUIElement. Every one of those tools should return not just the action result but the post-action accessibility tree in the same tool response, because that is what collapses observe-act-observe into a single round trip. Fazm ships the server half of this stack as the bundled binary. The client half (the inference loop that speaks to Ollama instead of Anthropic) is the part you would write.
What is the actual speed penalty for the screenshot-first approach on a local model?
Most of it shows up in the prefill. A 500 KB base64 screenshot at ~1.3 bytes per token is on the order of 350K input tokens for a single observation step. On Anthropic's API that is expensive but tolerable. On a 7B Ollama model running on a Mac with 32 GB of unified memory, prefilling that many tokens is the difference between an interactive agent and a background task. The accessibility tree dump is closer to 10K tokens for the same screen, which is the range these models actually feel fast in.
Does the accessibility tree work on every Mac app?
On most of them. Finder, Mail, Calendar, Messages, Safari, Slack, Notes, Reminders, and the System Settings panel expose a full tree via AX. Catalyst apps (including the Mac WhatsApp client) also work, which is why Fazm bundles a separate whatsapp MCP that goes deeper into that specific app. The edge cases are apps that render directly to a Metal or OpenGL canvas and skip AX entirely, usually games and a few Qt or SDL ports. For those apps any local-AI-plus-Mac agent has to fall back to the screenshot path, which Fazm does via a separate capture_screenshot tool.
How does this compare to running a pure VLM locally, like LLaVA or Qwen2-VL?
You can do that, and projects like the Hugging Face Open Computer Agent run Qwen2-VL-72B for exactly this reason. The tradeoff is that on a consumer Mac, a 72B VLM is too big and a small VLM is not accurate enough. A 7B to 32B text-only Ollama model paired with an AX tree dump tends to outperform a similarly-sized VLM paired with a raw screenshot, because the tree has already done the vision work. The tree approach essentially offloads 'where is the Send button' from the neural net into a 21 MB Mach-O that walks a C API in under a second.
Is there a real product that demonstrates this architecture on Claude today?
Yes. Fazm. The desktop binary registers macos-use, playwright, whatsapp, google-workspace, and fazm_tools as builtin MCP servers (BUILTIN_MCP_NAMES at acp-bridge/src/index.ts line 1266). The system prompt at Desktop/Sources/Chat/ChatPrompts.swift line 56 explicitly forbids browser_take_screenshot for desktop apps and routes them through macos-use instead. The model is Claude, not Ollama, but the perception and action layers are the same ones a local-first implementation would need, and they are open to inspection on disk after any install.
Where does this leave Ollama on Mac, strategically?
Ollama in 2026 is the clear winner of the local-model-runtime layer; it hit 52 million monthly downloads in Q1 and HuggingFace hosts 135K GGUF-formatted models aimed at it. That layer is solved. The open frontier is not 'which local model runs fastest', it is 'what does the model look at, and what does the model control'. The answer that works on a Mac is the accessibility tree for perception and CGEvent-plus-AXPress for action. The answer that works everywhere a browser runs is Playwright. Ollama stays where it is strongest: model serving, prompt templating, and OpenAI-compatible APIs. Everything above it is the agent framework, and that is the layer most Ollama-on-Mac tutorials skip entirely.
Keep reading
Related guides
Claude Computer Use Agent: the tool-schema swap that runs on a real Mac
Anthropic's reference 'computer' tool takes screenshots. Fazm swaps it for six _and_traverse MCP tools that fold observation into the action response.
Building Local AI Agents on macOS: Accessibility APIs, Security, and Practical Setup
Why local-first agent execution matters, how accessibility APIs outperform screenshots, and how to wire the two layers together.
macOS Accessibility API Agent Speed: what a 50x speedup actually looks like
Measurements of a tree-first agent vs a screenshot-first agent on the same workflow. The gap is mostly prefill time, not model time.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.