Field notes from one shipping macOS agent
Local MLX model for desktop loops: the one settings field that wires it in
MLX gives you fast token generation on Apple Silicon. A desktop loop gives you the right thing to ask the model on each turn. Most posts stop at the first half. This one is about the seam where they meet, using a real shipping open source loop on macOS as the reference.
Direct answer (verified 2026-05-11)
Run your MLX model behind an Anthropic Messages compatible HTTP server on localhost (claude-code-mlx-proxy, vllm-mlx, mlx-omni-server with a translation shim, claude-code-local, or LM Studio 0.4 with its /v1/messages endpoint). Then point a desktop computer-use agent loop at that URL. In fazm specifically that is one field: Settings > AI Chat > Custom API Endpoint = http://127.0.0.1:8888, save, the bridge restarts, the same 20 native tools and the same accessibility-tree screen state are now driven by your local MLX model. The seam is customApiEndpoint at SettingsPage.swift line 885, exported as ANTHROPIC_BASE_URL at ACPBridge.swift line 469.
The seam, in three lines of code
The whole reason this works is that the agent loop and the model server are separate processes joined by an HTTP URL. Most desktop AI products tightly couple the two. Fazm does not. The agent loop is the open source claude-agent-acp 0.29.2 spawned as a Node child process; the model behind it is whatever URL ANTHROPIC_BASE_URL resolves to when that child process starts. Swap the URL, you swap the model. The 20 tools, the cron scheduler, the screen-state representation, the conversation store, the permission system: all unchanged.
The Settings UI side is one stored value:
The bridge-spawn side reads it once and exports it as an env var to the child process before exec:
// Desktop/Sources/Chat/ACPBridge.swift, lines 467-470
// Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"),
!customEndpoint.isEmpty {
env["ANTHROPIC_BASE_URL"] = customEndpoint
}That is the entire wiring. Anything that speaks the Anthropic Messages API on the other end of that URL works. The bridge has no opinion about whether the tokens are coming from api.anthropic.com, from a corporate gateway, from a GitHub Copilot proxy, or from a claude-code-mlx-proxy serving Qwen2.5-Coder-32B-Instruct in 4-bit MLX on the same Mac.
What goes on the other end of that URL
The bridge does not run MLX itself. It runs a small HTTP server that accepts Anthropic Messages requests, translates them to whatever tool-format the underlying MLX model was trained on, runs inference via mlx-lm, and translates the response back. The active community has settled on a handful of these. The right one depends on your model and how strict you are about API shape.
Anthropic-compatible MLX bridges that drop into the customApiEndpoint field
claude-code-mlx-proxy
MLX-backed Anthropic Messages server, designed for Claude Code. Defaults to http://localhost:8888.
vllm-mlx
Native MLX vLLM-style server with continuous batching, MCP tool calling, vision-language support. /v1/messages endpoint.
claude-code-local
MLX-native Anthropic-API server tuned for offline / NDA workflows. Ships Qwen 3.5 122B, Llama 3.3 70B, Gemma 4 31B presets.
mlx-serve
Native Zig server, no Python. Exposes both OpenAI-compatible and Anthropic-compatible HTTP. Ships with a menu bar app.
LM Studio 0.4+
Desktop app, runs MLX models locally. Exposes a /v1/messages endpoint that any tool built for the Anthropic API can talk to with just a base URL change.
mlx-omni-server
Apple Silicon inference server, OpenAI-shaped. Use behind a thin Anthropic-shaped translation layer if you want to talk to fazm directly.
claude-code-mlx-proxy and claude-code-local are the most opinionated (designed for Claude Code, ship presets). vllm-mlx is the most general (continuous batching, vision-language models). LM Studio is the most polished if you want a GUI to manage downloads. mlx-serve is the right pick if you want zero Python in the path. mlx-omni-server is OpenAI-shaped, so you need a thin /v1/chat/completions to /v1/messages translator in front for fazm. Read the README of whichever you pick before downloading a model. Each one has a recommended set of tool-trained MLX checkpoints under huggingface.co/mlx-community.
The data flow when an MLX model drives a fazm session
Inputs and outputs around the local MLX bridge
Notice what is on each side. On the left, every input the loop collects on its own (transcribed voice, current accessibility tree, tool schemas, MCP server registrations) is assembled by the loop before the model ever sees a token. On the right, the model writes tool-use blocks that the loop dispatches with hard wall-clock timeouts. The model is one node in the middle. The loop is the shape around it. Swapping cloud Claude for a local MLX model changes one node, not the shape.
Why a 13B class MLX model can actually drive desktop work
The single design choice that makes a local MLX model viable on a consumer Mac is what the loop sends as the screen. Most computer-use agents send a screenshot every turn. A typical macOS window vision- tokenizes to roughly 1500-3000 input tokens. Multiply that by 8-15 turns to finish a real workflow and you have spent more compute on re-tokenizing pixels than on actually thinking. On a 13B 4-bit MLX model with a Mac mini class chip, that path puts you in two-digit minutes per task.
The accessibility-tree path is roughly 200-400 input tokens for the same window: each element is a role string, optional text, and a four-number bounding box. The model never gets a screenshot unless the loop deliberately calls capture_screenshot. That is the entire reason you can stand up an MLX-driven desktop loop on a 16GB Mac and expect it to finish a task before you lose patience.
Per-turn input cost on the same window
Capture the visible window, encode as PNG, vision-tokenize. The model gets pixels.
- 1500 to 3000 input tokens per turn
- Prefill cost dominates wall-clock latency on 13B-class MLX
- Vision tokens cannot reference elements by ID; clicks are coordinate guesses
- Re-paid every turn even when only one element changed
The three contracts that stay constant when you swap to MLX
Once you have set the URL, three things in the loop are now load-bearing in a way they were not when you were running on frontier cloud Claude. Worth checking each one before blaming the model for a session that went sideways.
1. Tool schemas: same JSON, fewer of them per session
The 20 native tools registered in acp-bridge/src/fazm-tools-stdio.ts still flow into the system prompt, plus any MCP tools the bridge spawned. On cloud Claude this is fine. On a 13B MLX model the tool-list section of the prompt is a real chunk of your context budget, and longer tool schemas push out the actual screen state. The pragmatic move is to disable MCP servers you are not using for a given session, and lean on the 10s/120s/300s timeout tiers (TOOL_TIMEOUT_INTERNAL_MS, TOOL_TIMEOUT_MCP_MS, TOOL_TIMEOUT_DEFAULT_MS at acp-bridge/src/index.ts lines 114-116) as the safety net for tool-call mistakes that smaller models will make more often.
2. Screen state: still AX-tree, still accessed the same way
The macos-use MCP server (bundled binary at MacOS/mcp-server-macos-use, registered at acp-bridge/src/index.ts line 1687) keeps doing what it did. Your MLX model sees the same compact tree the cloud model did. The difference is that the model needs to actually read it, so a prompt that worked on Claude with long, narrative screen descriptions should be tightened to explicit bullet-style summaries. Smaller models reward discipline in the prompt more than larger ones do.
3. Scheduler: cron-runner spawns one fresh ACP session per fire
acp-bridge/src/cron-runner.mjs spawns a new ACP session against the same bridge on each scheduled run, and the agent itself can schedule its future runs through routines_create, routines_list, routines_update, routines_remove, and routines_runs. With a local MLX bridge that means every fire reloads the model into MLX cache once (cold start matters), then runs to completion. If your MLX server supports keeping the model resident across requests, that cold start is one-time per server lifetime; otherwise plan for it on the first scheduled fire each session.
A pragmatic setup, end to end
- Pick a tool-trained MLX checkpoint sized to your hardware: Qwen2.5-Coder-32B-Instruct in 4-bit MLX on a 32GB Mac is a sensible sweet spot, Llama-3.3-70B in 4-bit on 64GB if you can spare the memory.
- Pick a bridge that explicitly speaks Anthropic
/v1/messages. Today that is claude-code-mlx-proxy, vllm-mlx, claude-code-local, or LM Studio 0.4+. Start it, confirm it is bound on a localhost port, hit/v1/messageswith a curl to verify the model loads and returns a valid Anthropic-shaped response. - Open fazm Settings, AI Chat tab, toggle Custom API Endpoint, paste your local URL. The bridge restarts on submit. The next session you start will spawn the agent loop with
ANTHROPIC_BASE_URLpointed at your bridge. - Run a smoke test: ask the loop to capture a screenshot, name what it sees, and click the closest button to a specific element. If it chains those two tool calls correctly, the loop is in sync with the model. If it does not, drop to a smaller, more tool-trained checkpoint or a stricter bridge.
- Trim the surface area. In Settings, disable the MCP servers you are not using this session. Tool schemas you do not need are input tokens you are paying for on every turn.
- Watch the bridge stderr alongside the fazm log at
/tmp/fazm-dev.log(dev) or/tmp/fazm.log(production). Tool-call malformed-JSON errors and TOOL_TIMEOUT events show up there first.
Bringing a local MLX bridge to a real desktop workflow?
If you want to walk through the swap on your machine, share what you tried, and what your model is doing wrong on the loop, I can sit on a call and help.
Frequently asked questions
Does fazm talk to MLX directly?
No, and that is the design. Fazm spawns the open source claude-agent-acp 0.29.2 (acp-bridge/package.json) as its agent loop, which speaks the Anthropic Messages API. To use a local MLX model you put a small Anthropic-compatible HTTP server in front of MLX (claude-code-mlx-proxy, vllm-mlx, mlx-omni-server, claude-code-local, or LM Studio 0.4 plus its /v1/messages endpoint). Fazm then routes through that server. The agent loop never imports MLX. The bridge stays Anthropic-shaped. Only the URL changes.
What exactly do I change in fazm to point at my local MLX bridge?
One field. Settings > AI Chat > toggle Custom API Endpoint, paste your local URL (the placeholder shown in the field is a generic proxy URL, but 127.0.0.1 plus whatever port your bridge picked is fine). On submit the bridge is restarted via restartBridgeForEndpointChange. Under the hood that field is an @AppStorage value at SettingsPage.swift line 885 keyed customApiEndpoint, and the bridge reads it on spawn at ACPBridge.swift lines 468-470 and exports it as ANTHROPIC_BASE_URL. That is the entire surface area.
What size MLX model should I run if I want the loop to actually finish tasks?
Below 8B you will get the loop confused on multi-step tool sequences. 13B is the floor where you can reliably finish a 6-8 step desktop workflow that touches two apps. 30B class (Qwen2.5-Coder-32B-Instruct in 4-bit MLX, Llama-3.3-70B in 4-bit MLX on a 64GB Mac) is where it starts feeling like a normal Claude session, with the catch that prefill speed becomes the bottleneck once accessibility-tree input gets long. Generation tok/s past about 25 stops mattering; what matters is how fast the prefill chews through the system prompt plus tool schemas plus current screen state on each turn.
Why does the loop use an accessibility tree instead of screenshots? Doesn't MLX have vision?
It does, and you can use a vision MLX model if you want, but the math punishes you. A typical macOS window vision-tokenizes to roughly 1500-3000 input tokens. The same window through the macOS accessibility API serializes to roughly 200-400 tokens of role+text+bbox per element. On a 13B MLX model that is the difference between two-minute turns and twelve-minute turns. Fazm exposes both paths: capture_screenshot is a tool the loop can call deliberately, but the default screen-state representation goes through the macos-use MCP server (bundled binary at MacOS/mcp-server-macos-use, registered in acp-bridge/src/index.ts line 1687) and feeds the model a compact tree, not pixels.
What about tool calling? My MLX model wasn't trained on Anthropic tool-use tokens.
That is the bridge's problem, not the loop's problem. claude-code-mlx-proxy and vllm-mlx both translate the Anthropic tool-use block format to whatever native format your model speaks (Hermes-style, Llama-3.1-style, Qwen function calls) and back, so the agent loop on the fazm side keeps emitting valid Anthropic Messages and never knows the model wasn't trained on them. mlx-omni-server presents OpenAI-format tools, so for fazm specifically you want a bridge that explicitly speaks /v1/messages, not /v1/chat/completions. Read the bridge's README before picking a model: each one has a recommended set of tool-trained MLX checkpoints under huggingface.co/mlx-community.
Will the 20 native tools and the cron scheduler still work if I'm pointed at a local MLX bridge?
Yes. The 20 stdio tools (execute_sql, capture_screenshot, request_permission, scan_files, ask_followup, routines_create, routines_list, routines_update, routines_remove, routines_runs, save_knowledge_graph, save_observer_card, speak_response, set_user_preferences, complete_onboarding, extract_browser_profile, edit_browser_profile, query_browser_profile, check_permission_status, and one more) are defined in acp-bridge/src/fazm-tools-stdio.ts and registered with the agent loop independently of the model behind it. The cron scheduler at acp-bridge/src/cron-runner.mjs spawns a fresh ACP session against the same bridge on each scheduled fire. From the model's point of view there is just a tool schema and the same Anthropic Messages format. The runtime swap is invisible to the rest of the system.
What breaks first when I move from cloud Claude to local MLX?
Tool-call discipline. A frontier model will plan a 6-step sequence and dispatch tools in the right order. A 13B MLX model will sometimes call the same tool twice in a row, sometimes skip a confirmation, sometimes return malformed JSON in a tool-input field. The 10s/120s/300s timeout tiers in acp-bridge/src/index.ts (TOOL_TIMEOUT_INTERNAL_MS, TOOL_TIMEOUT_MCP_MS, TOOL_TIMEOUT_DEFAULT_MS at lines 114-116) keep a stuck call from freezing the whole agent, but you will hit more of them. Mitigation is shorter system prompts, fewer tools loaded per session, and using the ask_followup tool aggressively so the model checks in instead of guessing.
Is there an MLX model size where the loop just doesn't work, no matter the bridge?
Yes. Anything below 7B routinely fails to emit the tool-use block format reliably enough for the loop to stay in sync, even when the bridge pretends to translate. A 7B Qwen-Coder will limp through a single-app task. A 13B will do real desktop work for short sessions. A 30B 4-bit will hold up for an hour. A 70B 4-bit on a 64GB Mac is the comfortable floor for doing useful multi-app work all day, but you will pay for it in prefill latency. The cleanest test is: ask the model to capture a screenshot, name what it sees, then click the closest button to a specific element. If it does not chain those two tool calls correctly, the loop will not survive a real workflow.
Can I keep cloud Claude and a local MLX bridge configured at the same time and switch?
Today, no. customApiEndpoint is a single field. Either it is empty (default Anthropic) or it has a value (your bridge). The toggle next to the field clears the value when flipped off, which restores the default endpoint immediately. If you want fast switching, the practical move is to point at a small front proxy on localhost (one nginx config, one route flag) that forwards either to your MLX bridge or to api.anthropic.com depending on which you want. The bridge restart on save takes about 1.5 seconds.
Three pieces that pair with this one if you are setting up a local desktop loop.
Adjacent reading
Local LLM runtime done, agent loop missing: the six pieces the runtime never shipped
The runtime gives you forward pass, KV cache, and decoding. The other six things an agent needs (tool schema, tool sandbox, screen state, conversation state, scheduler, swappable endpoint) live in the loop, with files in the open source app for each.
Local LLM desktop agent throughput: the number that matters is not generation tok/s
Once you swap in a local MLX model, the bottleneck flips from generation to prefill. Per-turn input length divided by prefill tok/s sets your floor on time-to-first-tool-call, which is the latency a user actually feels.
Accessibility tree limits beyond the browser
Why the AX-tree representation works for most macOS apps, where it stops working (canvas surfaces, custom-drawn UIs, web views without ARIA), and which fallbacks earn their place when the tree returns nothing useful.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.