An argument, not a model shootout
The interesting half of "local first AI coding agent" is the agent, not the weights
Almost every guide on this topic is a model shootout. Qwen3-Coder versus Llama 4 versus DeepSeek, with a paragraph on Ollama at the top and a recommended VRAM tier at the bottom. The model question is real, and the new local models are good enough to matter for real coding work. It is also the easier half of the question. The harder half, and the one almost nobody writes about, is what local-first does to the agent: where the loop runs, what its access surface looks like, and what stops being possible when the agent is a SaaS endpoint instead of a process on your machine. This page is about that half, anchored to one shipping desktop agent whose source you can open right now.
Two layers, only one of them is what people argue about
A coding agent is two stacked things. There is the model, which is a stateless function from prompt and tool results to the next message. And there is the agent process, which is the loop and the bookkeeping around that function: the system prompt, the conversation buffer, the registry of tools, the per-tool wall-clock watchdog, the permission layer, the cancellation path, the diagnostic dump on user interrupt. The agent process is what turns a stateless next-token predictor into something that can pick up a task, call tools, fail safely, and resume tomorrow.
When the existing playbooks say "local first", they almost always mean the first layer: the model weights are on your machine instead of someone else's. That is a real distinction, and it is the right thing to care about if your blocker is privacy or per-token cost. But the second layer is where the difference between an editor extension and a desktop agent actually lives. You can run a brilliant local model from inside a hosted IDE and still have an agent process whose entire world is the file tree the IDE chose to expose to it. You can run a remote model from a desktop agent and have an agent process that sees the same things you see. The model question and the agent question are independent.
The rest of this page is about the second layer, with one specific desktop agent (Fazm, the macOS app) as the worked example. The reason it works as an example is that the source for the agent layer is one TypeScript file you can open: acp-bridge/src/index.ts, 2914 lines, MIT-licensed, in the public repo at github.com/m13v/fazm.
What "the agent runs locally" actually means here
Concretely, when the user opens the chat panel in the Fazm Mac app, the Swift side spawns a Node subprocess that runs the bridge. The bridge in turn spawns claude-code-acp as another local subprocess and talks to it over JSON-RPC on stdio. The model API call is the only thing that leaves the machine, and only if the configured endpoint is remote. Sessions, tool registries, MCP subprocesses, conversation buffers, screenshots, accessibility tree dumps, and the watchdog timers all live in process memory on the Mac.
The spawn itself is at acp-bridge/src/index.ts:538: acpProcess = spawn(nodeBin, [acpEntry], { env, stdio: ["pipe", "pipe", "pipe"], detached: true }). The detached: true is meaningful: the agent's lifetime is decoupled from the chat window, so closing the floating bar does not tear it down. The default cwd is the user's home directory, set on line 1294 as const DEFAULT_CWD = homedir();, which is what makes Claude Code's native memory layer persist at a stable path under ~/.claude/projects/ across app launches.
Where the model lives is a separate question. The bridge reads exactly one environment variable, ANTHROPIC_BASE_URL, to decide that. The Swift side, in Desktop/Sources/Chat/ACPBridge.swift at lines 391 to 393, reads the customApiEndpoint UserDefault and exports it into the subprocess environment if it is non-empty:
// Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"),
!customEndpoint.isEmpty {
env["ANTHROPIC_BASE_URL"] = customEndpoint
}The Settings page that drives that default contains a hint that reads, verbatim: Route API calls through a custom endpoint (e.g. local LLM bridge, corporate proxy, or GitHub Copilot bridge). Leave empty to use the default Anthropic API. That is one line of UI, one UserDefault, one env var, one process tree. The model choice is genuinely orthogonal to everything else in this stack.
Where the parts live when you press Send
The access surface is the part the model question hides
A coding agent inside an editor can read your files, run your test suite, and push to your branch. That is genuinely most of the work, and an editor extension is the right shape for that work. What the editor cannot do is look at the failing CI run, read the email from the customer who reported the bug, click the toggle in the Vercel dashboard that controls the env var the bug fix needs, or check whether the change actually rendered the way it was supposed to in a real Chrome window. Each of those tasks is ten seconds of human attention. Stacked across a workday, they are most of the friction.
The relevant function in this stack is buildMcpServers on line 992 of the bridge. It is the one place where the agent's access surface is decided. It returns an array of MCP server configs, one for each subprocess the agent can call into. The set is hardcoded by name on line 1266: new Set(["fazm_tools", "playwright", "macos-use", "whatsapp", "google-workspace"]). Each one is a different angle into your computer that an editor extension does not get.
The agent's access surface, line 1266
Five MCP servers wired by default
fazm_tools
bash, file read/write/edit, screen capture, SQLite, and the framework's internal helpers, exposed through a Unix socket back to the Swift app
playwright
drives the user's real Chrome session through the Fazm extension token, not a headless instance, with snapshots written to /tmp/playwright-mcp
macos-use
native Swift binary that reads and writes the macOS accessibility tree, so the agent can click in any Mac app the way a person does
controls the WhatsApp Catalyst app via accessibility APIs, supports search, open chat, send, read, scroll, and quit
google-workspace
bundled Python server for Gmail, Calendar, Docs, Sheets, Drive, with credentials kept under ~/.google_workspace_mcp/credentials/
The one to look at if you only look at one is macos-use. It is a native Swift binary, bundled at Contents/MacOS/mcp-server-macos-use inside the app, that drives Mac apps through the macOS accessibility tree. The agent does not take a screenshot and ask the model to find a button by pixel. It enumerates the AX tree, reads the button's role, title, and frame, and clicks at the geometric center of that frame. It is how the agent can change a value in System Settings, a CRM you have logged into in Safari, or an email you have open in Mail, while the user is doing something else.
That is also the part the model choice does not affect. A 4-bit local quantization of Llama 4 Scout running on a 64 GB MacBook can drive the accessibility tree exactly as well as a hosted Sonnet, modulo tool-call schema reliability. The access surface is set by the agent, not by the weights.
Three things a desktop agent does that a SaaS agent cannot
The first is auto-approving its own tool permissions. In Fazm's bridge, the session/request_permission handler around line 580 picks the allow_always option whenever it is offered. That is the right default for a single-user machine where the user already granted Accessibility, Screen Recording, and Automation permissions to the parent app at install time. It is the wrong default for a hosted product that runs the same agent against someone else's environment, which is why hosted products either prompt on every tool or run with a much smaller pre-approved tool set. The desktop shape is the only one where bypass-permissions is honest.
The second is using the user's real, logged-in browser session. The playwright MCP, when launched with PLAYWRIGHT_USE_EXTENSION=true and a token from the Fazm Chrome extension, attaches to your running Chrome instead of spawning a fresh, fingerprintable headless instance. That means the agent inherits your Cloudflare Turnstile clearance, your Google sign-in, your saved cookies, your two-factor session. Hosted agents cannot do that without you uploading those cookies, which the better hosted agents refuse to support and the worse ones support insecurely.
The third is the diagnostic dump on user interrupt. When the user hits stop, the bridge walks an in-flight tools map and writes one stderr line per stuck tool, including the tool name, kind, session, elapsed seconds, and the truncated arguments the model passed. That dump goes to the user's local log file at /tmp/fazm-dev.log on dev builds and /tmp/fazm.log on production. A SaaS agent can do the same, but the user does not own the log; debugging a stuck remote agent is exactly as good as the vendor decides to make it. With a desktop agent, the log is yours, and the agent process can be attached to with a debugger if you actually want to know why a tool stalled.
The counterargument, taken seriously
The honest objection is that an agent with this much access surface is a much bigger blast radius than an editor extension that can only touch the repo it is open to. A local agent that auto-approves tools, drives your real browser session, and clicks in arbitrary Mac apps via accessibility can also auto-approve a tool that ships your private SSH keys to a remote MCP, drive your real browser into a phishing page that the model hallucinated, and click "Confirm payment" in a Stripe dashboard tab the user did not realise was open. A bug in the agent and a bug in the model both produce real-world consequences instead of a stack trace.
The mitigations are real but partial. The bridge logs every tool call to a local file and every tool result to a separate one. The user can see which MCP servers are loaded in the chat header. The Chrome extension injects a visual overlay on every tab the agent is driving so the user knows it is happening (the --init-page flag passed to playwright on line 1037 is what enforces that). Tool timeouts cap how long any individual tool can run before the harness forces a synthetic completion. The watchdog is unrelated to safety, but the side effect is that an agent that has gone off the rails inside a tool surfaces fast instead of hanging silently.
None of that mitigations the case where the model actively decides to do something destructive. For that, the answer is the one the model providers already use: keep destructive operations behind a read-write-confirm seam (drop tables, force pushes, payment confirmations, file system rm), and prefer additive operations everywhere else. Whether that seam is enough is a real open question, and a buyer who is not comfortable with it should not run a desktop agent. That is fine. The point of this guide is not that everyone should adopt this shape. The point is that the choice is a real choice, and that the model question ("which weights do I run") is genuinely orthogonal to it.
What "local first" should actually mean for a coding agent in 2026
The clean version is two questions, not one.
- Where do the weights live? This is the model question. Answers: a hosted endpoint (Anthropic, OpenAI, Bedrock), a corporate proxy that rewrites the same wire format, a local llama-server with a small adapter in front, an Ollama instance with LiteLLM in proxy mode. In Fazm, this is one env var.
- Where does the agent process live, and what can it touch? This is the agent question. Answers: a SaaS endpoint that opens a sandboxed VM (Devin-shaped), a hosted-but-tunneled agent that reads your repo over a tunnel (Cursor Composer-shaped), an editor extension that reads your file tree but nothing else (Cline, Continue, Aider), or a process on your own machine that has the access of a logged-in user (Fazm-shaped, also Goose, also various agent CLIs run with full shell).
A search for "local first AI coding agent" today returns mostly combinations of question 1's answers ranked against each other, with question 2 collapsed into a single implicit "VS Code extension" answer. The interesting space is the Cartesian product. A remote model and a local agent (most of Fazm's users today). A local model and a hosted agent (Cursor configured with a self-hosted vLLM endpoint). A local model and a local agent (the maximalist privacy answer, which the new local-coder models finally make practical). A remote model and a hosted agent (the default everywhere else).
For most working programmers right now, the answer that earns its keep is the second column with whichever weights they trust. Privacy buys are satisfied by question 1, and access-surface buys are satisfied by question 2. The two are independent. Pages that conflate them are answering the easier question and skipping the one that actually changes the day-to-day shape of the work.
Want to see this run on your own desktop?
Walk through the bridge with the team, point it at any Anthropic-compatible endpoint, and see what changes when the agent has full Mac access instead of a sandboxed file tree.
Questions readers actually ask about this
If the model still runs in Anthropic's data center by default, in what sense is this 'local first'?
Two things are usually conflated. The model weights live somewhere (could be Anthropic, could be a Cloudflare-fronted proxy, could be llama-server on your Mac). The agent process is something else: the loop that decides which tool to call, the bookkeeping for sessions and timeouts, the MCP server registry, the permission layer, the diagnostic dump. In Fazm, that whole agent process runs on your Mac as a Node subprocess spawned by the Swift app, regardless of where the model is. The reframe is that the local-first conversation has been collapsed into 'where do the weights live', and that is the easier half of the question. The harder half is whether the agent has access to your real desktop or only to a sandboxed file tree.
How does Fazm actually point the agent at a different model endpoint?
There is one setting and one environment variable. The user opens Settings, Advanced, and toggles Custom API Endpoint. The Swift side reads the customApiEndpoint UserDefault and, on bridge restart, exports it as ANTHROPIC_BASE_URL into the spawned claude-code-acp subprocess. The acp-bridge passes the env through unchanged. Anything that speaks the Anthropic Messages API on the other end works: a corporate proxy, GitHub Copilot's Anthropic-compatible bridge, an Ollama or LM Studio endpoint with a small translator in front. The relevant lines are acp-bridge/src/index.ts at lines 391 to 393 in the Swift bridge code, and the spawn happens at acp-bridge/src/index.ts:538.
What is the agent's actual access surface? An editor extension is just the file tree.
Five MCP servers are wired by default in the buildMcpServers function (acp-bridge/src/index.ts, line 992). They are listed explicitly in the BUILTIN_MCP_NAMES Set on line 1266: fazm_tools (the framework's internal tools, including bash, file edit, screen capture, SQLite query), playwright (drives a real Chrome via the Fazm extension token, not a headless browser), macos-use (a native Swift binary that controls Mac apps via accessibility APIs), whatsapp (controls the WhatsApp Catalyst app), and google-workspace (a bundled Python server for Gmail, Calendar, Docs, Sheets, Drive). A user can append more in ~/.fazm/mcp-servers.json. That is the access surface. The same agent that wrote your code can also reply to the customer-support email about that code, in the same session.
What is the workspace and where does memory persist?
The default cwd for ACP sessions is the user's home directory. Line 1294 of acp-bridge/src/index.ts says it plainly: const DEFAULT_CWD = homedir(). The reason is that Claude Code's native memory layer writes to ~/.claude/projects/<encoded-cwd>/memory/, and choosing the home directory gives the broadest memory coverage and shares it with any CLI session started from the same place. When the user changes the workspace inside the app, a new ACP session is created with the new cwd, and the bridge invalidates the cached session entry. The session map is keyed by sessionKey and stores cwd alongside sessionId and model.
If the model is still Anthropic's, isn't this just a wrapper?
It is a wrapper in the same sense that a Mac app that uses CoreData is a wrapper around SQLite. The interesting part is what wraps it. Three things in particular. First, the agent process is detached, lives across many user prompts, and owns its own MCP subprocess fleet: closing the chat window does not tear down the agent. Second, the bridge auto-approves tool permissions in the request_permission handler (line 580), which is the right default for a single-user Mac app and the wrong default for a hosted product, so a hosted product cannot do this even if it wanted to. Third, the access surface is not editor-shaped. None of those follow from the model choice. They follow from the agent running on the user's machine.
Can I actually point this at a local LLM today, or is the custom endpoint hypothetical?
It works the moment the URL on the other end speaks the Anthropic Messages API. For raw Ollama or LM Studio you need a small translator in front (Anthropic format in, OpenAI format to the local server, OpenAI format back, Anthropic format out). The community shipped at least three of those in the second half of 2025, and one of them (LiteLLM in proxy mode) is what most people end up using. The Settings hint inside the app reads, verbatim: 'Route API calls through a custom endpoint (e.g. local LLM bridge, corporate proxy, or GitHub Copilot bridge). Leave empty to use the default Anthropic API.' That is in Desktop/Sources/MainWindow/Pages/SettingsPage.swift around line 946. Coding-quality matters more than tool-routing-quality for this path: a local model good enough to write a unit test is fine, a local model that cannot reliably emit a tool call schema is not.
Why does it matter that the agent can drive a real Chrome session, for coding work?
Real coding work is not 100 percent in the editor. A typical change to a SaaS feature involves: reading the current Stripe invoice in the dashboard to confirm the pricing tier, opening the GitHub PR to see review comments, checking the failing CI run on Vercel, adjusting an env var in the Vercel UI, then writing the fix. An editor-bound agent can write the fix. It cannot read the dashboard or set the env var. A coding agent that lives in the editor and a separate person clicking around in five tabs is the current default. A coding agent that is also the person clicking around in five tabs is the version that compresses the loop. Whether or not that bargain is worth the surface area is a real question, but it is the right question, and it is invisible if 'local first' means only 'which model'.
Is the code for this open and readable, or is this a marketing claim?
Open. The full bridge is github.com/m13v/fazm under acp-bridge/src/index.ts, MIT-licensed, 2914 lines as of this writing. The line numbers in this guide (391 to 393 for the env var, 538 for the spawn, 992 for buildMcpServers, 1266 for the BUILTIN_MCP_NAMES Set, 1294 for DEFAULT_CWD) all resolve in that file. The Swift side is in Desktop/Sources/Chat/ACPBridge.swift, also in the same repo. The native macOS-use accessibility binary lives in github.com/browser-use/macOS-use and is bundled into the app under Contents/MacOS/mcp-server-macos-use. Every claim above can be checked by reading those files.
From the same source tree
Adjacent reading
AI agent harness scaffolding, told as the failure-recovery layer most write-ups skip
Field notes on the part of an agent harness that earns its keep when a tool stalls, anchored to the same file as this guide.
Open source AI agent framework that ships as a Mac app
The same agent process, framed as a framework you can read instead of a SaaS endpoint you can only invoke.
Run vLLM locally on Mac with an AI agent
What changes when the weights actually live on your machine, and what does not.