The April 2026 local LLM wave is real. Your desktop agent still can't drive a button.
Every local-LLM roundup this month is about models: Llama 4, Qwen 3.5, Gemma 4 with day-one vLLM support, DeepSeek, MoE variants that fit in 16 GB of RAM. Every one of them stops at the model. If you want to run a local model and have it actually drive your Mac, the model is not the hard part. The hard part is the context you feed it. This guide reads those April 2026 updates through the lens of one consumer Mac app whose source tree already solved the context problem, with four files you can read today.
The April 2026 update, in one paragraph
A wave of open-weight releases landed inside roughly two weeks. Llama 4 variants reached public weights. Qwen 3.5 shipped in both dense and MoE. Gemma 4 arrived with day-one vLLM support in v0.19.0 across four variants: E2B, E4B, 26B MoE, and 31B Dense. DeepSeek pushed another open release. The tooling layer kept up: vLLM went v0.18.0 to v0.19.0 to v0.19.1rc0 in 14 days; Ollama and LM Studio picked up the new checkpoints within days. The practical consumer headline is that Q4 quantization puts a strong 13B class model inside 16 GB of unified memory.
what shipped
Three shipping clusters, read honestly
Models
Llama 4, Qwen 3.5, Gemma 4, DeepSeek pushed open weights. Consumer takeaway: 16 GB of unified memory now runs a strong 13B-class model at Q4 with usable latency. 32 GB opens Gemma 4 26B MoE, which is the current sweet spot for agent-style reasoning per unit of active parameter cost.
Serving tooling
vLLM v0.18.0 added gRPC and GPU NGram speculative decoding. v0.19.0 shipped Gemma 4 on day one and flipped async scheduler on by default, which lowers time-to-first-token. v0.19.1rc0 followed the next day. Ollama and LM Studio absorbed the new checkpoints quickly.
Client plumbing
Less visible in the SERP. Fazm shipped custom API endpoints (v2.2.0, 2026-04-11) and custom MCP servers via ~/.fazm/mcp-servers.json (v2.4.0, 2026-04-20). This is the part that turns a fast local model into an agent that can actually touch apps, not a chat window.
the gap nobody names
A fast local LLM with screenshot context is still a bad agent
The standard way a desktop agent tells the model what is on screen is a screenshot plus a prompt. That works for frontier vision models. It falls over on a local LLM, for three reasons stacked on top of each other. Tokens: a 1024-wide PNG is an order of magnitude more context than the accessibility tree for the same window. Fidelity: small local models OCR-hallucinate labels under load; the tree names them. Reach: screenshots cannot see background windows, disabled controls, or app-private panels; the accessibility API can.
How the pieces actually connect
A desktop agent that uses a local LLM needs three things in the same room: a local model server, a translation shim between the agent protocol and the server's API, and a screen-context producer that is not a screenshot. Fazm wires the third one in natively, and exposes the first two as configuration. Here is the flow.
Fazm with a local LLM, end to end
the uncopyable part
Four files, two line numbers, and a UserDefaults key
The contract between Fazm and whatever is serving the model is exactly one string. It lives in UserDefaults under the key customApiEndpoint, and it gets exported as ANTHROPIC_BASE_URL on the child process that speaks the Messages API. Four lines of Swift enforce it.
The Settings UI that writes this value is at Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 906 to 952, titled "Custom API Endpoint" with the placeholder https://your-proxy:8766. The second half of the plumbing is tools, not models. Any local-model-backed MCP server registers here:
“The bundled mcp-server-macos-use binary lives at Contents/MacOS in the .app bundle. It reads the accessibility tree for whatever app is focused and returns structured elements, not pixels. That is what your local LLM gets as context.”
run.sh line 212, codemagic.yaml line 200
What you have to do to point Fazm at a local model
Five steps, no magic. The middle three are the ones every roundup skips.
Run a local server
Pick one: Ollama, LM Studio, or vLLM. Pull a model that fits your RAM. April 2026 sweet spots: Llama 3.3 8B Q4 on 8 GB, Qwen 3.5 14B on 16 GB, Gemma 4 26B MoE on 32 GB.
Run an Anthropic-to-OpenAI shim
Fazm speaks the Anthropic Messages API. Your local server speaks OpenAI-compatible. One of the open-source shims in the middle translates the two. Run it on a local port.
Paste the shim URL into Settings
Fazm → Settings → Chat → Custom API Endpoint. Toggle on, paste the shim URL (example: http://127.0.0.1:8766). On submit, the app calls restartBridgeForEndpointChange() and the next query respawns the bridge with the new env var.
Optionally register local-model-backed tools
Edit ~/.fazm/mcp-servers.json. Add any stdio MCP server (including one that wraps your local LLM) with name, command, args, env, enabled. The Settings UI lists it and you can toggle per server.
Send a message
The accessibility tree of the focused app, your message history, and any tool results go to your local model. The model responds. Fazm executes the tool calls it emits via mcp-server-macos-use. No screenshots leave your machine.
What a verification session looks like
Screenshots vs. accessibility tree, line by line
| Feature | Screenshot agents | Fazm (accessibility tree) |
|---|---|---|
| Context size per window | Tens of thousands of tokens as base64 image | A few hundred tokens of structured labels |
| Requires vision-capable model | Yes. Non-vision local models can't consume it | No. Any text LLM works |
| Small-model reliability (7B-14B) | Low. OCR hallucinations compound | High. Labels are explicit |
| Background window reach | No. Screenshot is foreground only | Yes. AX sees every app |
| Disabled / hidden element state | Guessed from pixel cues | Read directly from AXEnabled, AXHidden |
| Local-only data path | Depends on the vendor | Yes. Local binary, local LLM, zero cloud |
What actually works with what
Any OpenAI-compatible local server works once an Anthropic-to-OpenAI shim is in front of it. Fazm only cares about the URL and the Messages API shape.
Models this month, ranked for Mac agent use
What the top search results get right, and what they miss
The top five results on this query are: LLM news aggregators, price-per-token trackers, open-source model comparisons, Ollama-and-LM-Studio guides, and a hardware buying guide. Each is well-written for its audience. All five stop at the model or the hardware.
The question they do not answer: once you have a running local model, why does a small open-source Mac agent still miss the Save button half the time? The answer is not model quality. It is that most agents still ship a screenshot to the model. This guide is written for the reader who already has a local model and is stuck on the next step.
The checklist, if you are shipping a local-LLM Mac agent this month
Shipping checklist
- Pick a local server (Ollama, LM Studio, vLLM)
- Pull a model sized to your RAM, not the biggest one
- Verify the server's OpenAI-compatible endpoint with curl
- Run an Anthropic-to-OpenAI shim on a local port
- Stop sending screenshots as context if you can help it
- Send accessibility-tree context to the model instead
- Keep tool outputs short, local models hate long prompts
- Register local-backed tools in ~/.fazm/mcp-servers.json
- Expect 7B-14B local models to need tighter prompts than Claude
Wiring a local LLM into a real Mac workflow?
Jump on a 20-minute call; bring the model you picked and the app you want it to drive, we will wire it up live.
Book a call →Questions people actually ask about this
Frequently asked questions
What were the actual local LLM updates in April 2026?
The landing news from April 2026 is a clustered shipping wave: Llama 4 variants reached public weights, Qwen 3.5 shipped both dense and MoE, Gemma 4 landed with day-one vLLM support in v0.19.0 across E2B, E4B, 26B MoE, and 31B Dense, DeepSeek pushed another open release, and the tooling layer caught up fast. Ollama and LM Studio picked up the new checkpoints within days. vLLM released v0.18.0, v0.19.0, and v0.19.1rc0 inside a 14-day window. The headline is not any one model; it is that a Q4 quantized 7B model now fits in 4 to 5 GB of VRAM and a 16 GB Mac can serve a strong 13B class model without swapping.
If the local models are this good, why does my Mac agent still feel broken?
Because the model is not the bottleneck anymore. The bottleneck is what you feed the model. Most agents pass a screenshot of the current screen plus a prompt. Screenshots lose the semantic structure of the UI: button roles, field labels, window hierarchy, enabled state, ARIA values. The local LLM guesses instead of reads. On macOS the alternative exists and has for years: the Accessibility API (AXUIElement) returns the actual tree, with every label, role, value, and relationship named. An agent that consumes the accessibility tree does not OCR a button, it asks the system what the button does. Fazm v1.5.0 (2026-03-27) replaced the screenshot path with this, and the bundled binary that does it is mcp-server-macos-use.
Can Fazm use a local LLM today, or is it Claude-only?
It can use one, indirectly, through the Custom API Endpoint setting that shipped in v2.2.0 on April 11, 2026. You put a URL in Settings, Fazm writes it to UserDefaults under the key customApiEndpoint, and the ACP bridge exports that string as env ANTHROPIC_BASE_URL on the Node subprocess at Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382. If you run an Anthropic-to-OpenAI shim in front of Ollama, LM Studio, vLLM, or any other OpenAI-compatible local server, Fazm talks to that endpoint without an app update. The shim is what makes it work; Fazm itself is protocol-agnostic past the Messages API shape.
What file do I write to add a local-LLM-backed tool to Fazm?
~/.fazm/mcp-servers.json. The Settings UI shipped on 2026-04-20 in v2.4.0 and the file is managed by Desktop/Sources/MCPServerManager.swift. Each entry has a name, a command, args, env, and an enabled boolean, mirroring Claude Code's MCP server schema. You can point it at a local-model-backed MCP server (for example one that wraps llama.cpp or an Ollama endpoint) and it becomes a tool the agent can call by name. The config file path is shown in Settings at Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 1841.
Why is the accessibility-tree approach better than a screenshot for a local LLM?
Three reasons, stacked. First, tokens: the accessibility tree for a typical app window is a few hundred tokens of structured labels; a screenshot re-rendered to 1024 pixels wide is an order of magnitude more and requires a vision-capable model you may not be running locally. Second, fidelity: the tree names the button; the screenshot guesses at it. Small local models OCR-hallucinate field labels under load. Third, reach: accessibility sees background windows, disabled controls, and app-private panels that a screenshot cannot capture at all. If the goal is to drive apps, not describe pixels, accessibility is the better substrate.
Does any of this require Claude, or can I run it entirely local?
The desktop side does not require Claude. The accessibility capture layer is a local Swift binary (mcp-server-macos-use) bundled inside Fazm.app at Contents/MacOS; it does not call out to any cloud. The custom MCP servers you register via ~/.fazm/mcp-servers.json are stdio processes on your own machine. The one part that still goes out today is the model. Point customApiEndpoint at a local server and that stops too. The shim layer in front of a local server has to translate Anthropic Messages API shape to whatever the local server exposes (usually OpenAI-compatible); that is the single external dependency and it runs on your machine as well.
Which April 2026 local model is actually a good fit for a Mac agent?
The short answer: Gemma 4 26B MoE if you have 32 GB of unified memory, Qwen 3.5 14B if you have 16 GB, and Llama 3.3 8B Q4 if you have 8 GB. The longer answer depends on whether you want reasoning or snappiness. Agents that emit tool calls in a loop feel latency-bound, so time-to-first-token dominates. MoE models with fewer active parameters feel faster per turn. Gemma 4's 26B MoE has the best ratio of active-parameter cost to agent-style reasoning quality I have seen this month. If you just want a code-biased completion, Qwen2.5-Coder variants are still very strong and run happily at Q4 on consumer silicon.
What exactly is passed to the local model on a typical Fazm request?
Not a screenshot. The message history plus the output of the active tool calls, which for macOS control means structured accessibility data from mcp-server-macos-use. The system prompt tells the model to save snapshots to files and strip inline base64 screenshots to keep context small; that guidance lives in acp-bridge/src/index.ts around line 313. The practical effect is that even a 7B local model has a realistic shot at picking the right button, because the prompt is labels-and-roles, not pixels.
Where can I verify these claims in the source?
Four files. CHANGELOG.json in the repo root, search for 2026-03-27 (accessibility switch), 2026-04-11 (custom API endpoint), 2026-04-20 (custom MCP servers). Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382 for the endpoint export. Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 906 to 952 for the Custom API Endpoint UI. Desktop/Sources/MCPServerManager.swift for the ~/.fazm/mcp-servers.json loader. The bundling step for the accessibility binary is at run.sh line 212 and codemagic.yaml line 200. The tree is MIT-licensed at github.com/mediar-ai/fazm.
Is there a real benchmark showing accessibility-tree context beats screenshots?
The honest answer is that public benchmarks for agent-driven UI control still lean heavily on screenshot-plus-vision because that is what vision-capable frontier models are tuned for. The argument for accessibility-tree context on a local LLM is mechanistic, not benchmarked: small local LLMs handle short structured strings far better than noisy OCR from screenshots, and the tree preserves information (disabled state, role) that OCR discards. If you are running a local model below 30B parameters and care about reliability, ship tree to it, not pixels. If your local model has strong vision and runs at 70B, either works; above that scale the model absorbs the noise.
Does Fazm require installing Ollama or LM Studio to use a local model?
No. Fazm does not care which local server you use, only that it speaks an OpenAI-compatible surface reachable at the URL you put in Custom API Endpoint and that something translates Anthropic Messages into that surface. Ollama has an OpenAI-compatible endpoint built in; LM Studio does too; vLLM does. The Anthropic-shim layer is the piece you install yourself; there are two or three open-source shims that do this well. Then you flip the Custom API Endpoint toggle in Settings, paste the URL, and restart the chat. That is it.
Related guides
vLLM release 2026 update: three ships in 14 days
What changed in vLLM v0.18.0, v0.19.0, and v0.19.1rc0, and why upgrading your server means zero changes to your Mac agent.
Open-source LLM news 2026
A running list of open-weight releases and the practical tradeoffs for running them on consumer hardware.
Model release or new LLM announcement after 2026-04-12
Weekly roundup format focused on what is usable, not what is announced.
The Fazm source tree is MIT-licensed at github.com/mediar-ai/fazm. Every file path in this guide is verifiable there.
0 files, 0 line numbers, 0 UserDefaults key — that is the whole contract between Fazm and your local LLM.