APRIL 2026 — MODELS GOT FAST, CONTEXT STAYED BROKEN

The April 2026 local LLM wave is real. Your desktop agent still can't drive a button.

Every local-LLM roundup this month is about models: Llama 4, Qwen 3.5, Gemma 4 with day-one vLLM support, DeepSeek, MoE variants that fit in 16 GB of RAM. Every one of them stops at the model. If you want to run a local model and have it actually drive your Mac, the model is not the hard part. The hard part is the context you feed it. This guide reads those April 2026 updates through the lens of one consumer Mac app whose source tree already solved the context problem, with four files you can read today.

Matthew Diakonov, Fazm

Published April 21, 202611 min read

4.9from Written from the April 2026 vLLM, Ollama, and Fazm changelogs plus the Fazm source tree

Three release clusters in April 2026: models, tooling, and local-context plumbing

Why screenshot-based agents fall over below 30B parameters

The exact UserDefaults key that points Fazm at a local model

~/.fazm/mcp-servers.json: the v2.4.0 file that turns any local model into a tool

Four file paths and two line numbers you can verify in the MIT-licensed source

Local LLMs, April 2026

Fast models, broken agents, and the fix in four files.

Llama 4, Qwen 3.5, Gemma 4 day-one on vLLM v0.19.0

16 GB of RAM now serves a strong 13B class model

Screenshot context is the real bottleneck below 30B

customApiEndpoint → ANTHROPIC_BASE_URL routes any local server

~/.fazm/mcp-servers.json loads local-model tools by name

0:00 / 0:05

The April 2026 update, in one paragraph

A wave of open-weight releases landed inside roughly two weeks. Llama 4 variants reached public weights. Qwen 3.5 shipped in both dense and MoE. Gemma 4 arrived with day-one vLLM support in v0.19.0 across four variants: E2B, E4B, 26B MoE, and 31B Dense. DeepSeek pushed another open release. The tooling layer kept up: vLLM went v0.18.0 to v0.19.0 to v0.19.1rc0 in 14 days; Ollama and LM Studio picked up the new checkpoints within days. The practical consumer headline is that Q4 quantization puts a strong 13B class model inside 16 GB of unified memory.

0vLLM releases in April 2026

0Gemma 4 variants at day one

0 GBGB RAM for 13B Q4

0Screenshots Fazm sends by default

what shipped

Three shipping clusters, read honestly

Models

Llama 4, Qwen 3.5, Gemma 4, DeepSeek pushed open weights. Consumer takeaway: 16 GB of unified memory now runs a strong 13B-class model at Q4 with usable latency. 32 GB opens Gemma 4 26B MoE, which is the current sweet spot for agent-style reasoning per unit of active parameter cost.

Serving tooling

vLLM v0.18.0 added gRPC and GPU NGram speculative decoding. v0.19.0 shipped Gemma 4 on day one and flipped async scheduler on by default, which lowers time-to-first-token. v0.19.1rc0 followed the next day. Ollama and LM Studio absorbed the new checkpoints quickly.

Client plumbing

Less visible in the SERP. Fazm shipped custom API endpoints (v2.2.0, 2026-04-11) and custom MCP servers via ~/.fazm/mcp-servers.json (v2.4.0, 2026-04-20). This is the part that turns a fast local model into an agent that can actually touch apps, not a chat window.

the gap nobody names

A fast local LLM with screenshot context is still a bad agent

The standard way a desktop agent tells the model what is on screen is a screenshot plus a prompt. That works for frontier vision models. It falls over on a local LLM, for three reasons stacked on top of each other. Tokens: a 1024-wide PNG is an order of magnitude more context than the accessibility tree for the same window. Fidelity: small local models OCR-hallucinate labels under load; the tree names them. Reach: screenshots cannot see background windows, disabled controls, or app-private panels; the accessibility API can.

What the model actually sees

How the pieces actually connect

A desktop agent that uses a local LLM needs three things in the same room: a local model server, a translation shim between the agent protocol and the server's API, and a screen-context producer that is not a screenshot. Fazm wires the third one in natively, and exposes the first two as configuration. Here is the flow.

Fazm with a local LLM, end to end

the uncopyable part

Four files, two line numbers, and a UserDefaults key

The contract between Fazm and whatever is serving the model is exactly one string. It lives in UserDefaults under the key customApiEndpoint, and it gets exported as ANTHROPIC_BASE_URL on the child process that speaks the Messages API. Four lines of Swift enforce it.

Desktop/Sources/Chat/ACPBridge.swift — lines 379-382

The Settings UI that writes this value is at Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 906 to 952, titled "Custom API Endpoint" with the placeholder https://your-proxy:8766. The second half of the plumbing is tools, not models. Any local-model-backed MCP server registers here:

~/.fazm/mcp-servers.json — loaded by MCPServerManager.swift

0 screenshots

“The bundled mcp-server-macos-use binary lives at Contents/MacOS in the .app bundle. It reads the accessibility tree for whatever app is focused and returns structured elements, not pixels. That is what your local LLM gets as context.”

run.sh line 212, codemagic.yaml line 200

What you have to do to point Fazm at a local model

Five steps, no magic. The middle three are the ones every roundup skips.

Run a local server

Pick one: Ollama, LM Studio, or vLLM. Pull a model that fits your RAM. April 2026 sweet spots: Llama 3.3 8B Q4 on 8 GB, Qwen 3.5 14B on 16 GB, Gemma 4 26B MoE on 32 GB.

Run an Anthropic-to-OpenAI shim

Fazm speaks the Anthropic Messages API. Your local server speaks OpenAI-compatible. One of the open-source shims in the middle translates the two. Run it on a local port.

Paste the shim URL into Settings

Fazm → Settings → Chat → Custom API Endpoint. Toggle on, paste the shim URL (example: http://127.0.0.1:8766). On submit, the app calls restartBridgeForEndpointChange() and the next query respawns the bridge with the new env var.

Optionally register local-model-backed tools

Edit ~/.fazm/mcp-servers.json. Add any stdio MCP server (including one that wraps your local LLM) with name, command, args, env, enabled. The Settings UI lists it and you can toggle per server.

Send a message

The accessibility tree of the focused app, your message history, and any tool results go to your local model. The model responds. Fazm executes the tool calls it emits via mcp-server-macos-use. No screenshots leave your machine.

What a verification session looks like

verify.sh

Screenshots vs. accessibility tree, line by line

Feature	Screenshot agents	Fazm (accessibility tree)
Context size per window	Tens of thousands of tokens as base64 image	A few hundred tokens of structured labels
Requires vision-capable model	Yes. Non-vision local models can't consume it	No. Any text LLM works
Small-model reliability (7B-14B)	Low. OCR hallucinations compound	High. Labels are explicit
Background window reach	No. Screenshot is foreground only	Yes. AX sees every app
Disabled / hidden element state	Guessed from pixel cues	Read directly from AXEnabled, AXHidden
Local-only data path	Depends on the vendor	Yes. Local binary, local LLM, zero cloud

What actually works with what

Fazm

Ollama

LM Studio

vLLM

llama.cpp

MLX

TGI

Any OpenAI-compatible local server works once an Anthropic-to-OpenAI shim is in front of it. Fazm only cares about the URL and the Messages API shape.

Models this month, ranked for Mac agent use

Gemma 4 26B MoE

agent sweet spot, 32 GB

Qwen 3.5 14B

balanced, 16 GB

Llama 4 13B

new weights, good tool use

Qwen2.5-Coder 14B

code tasks, very snappy

Llama 3.3 8B Q4

8 GB minimum

DeepSeek latest

reasoning, heavier

Gemma 4 E4B

tiny, single GPU

Gemma 4 31B Dense

quality over speed

What the top search results get right, and what they miss

The top five results on this query are: LLM news aggregators, price-per-token trackers, open-source model comparisons, Ollama-and-LM-Studio guides, and a hardware buying guide. Each is well-written for its audience. All five stop at the model or the hardware.

The question they do not answer: once you have a running local model, why does a small open-source Mac agent still miss the Save button half the time? The answer is not model quality. It is that most agents still ship a screenshot to the model. This guide is written for the reader who already has a local model and is stuck on the next step.

The checklist, if you are shipping a local-LLM Mac agent this month

Shipping checklist

Pick a local server (Ollama, LM Studio, vLLM)
Pull a model sized to your RAM, not the biggest one
Verify the server's OpenAI-compatible endpoint with curl
Run an Anthropic-to-OpenAI shim on a local port
Stop sending screenshots as context if you can help it
Send accessibility-tree context to the model instead
Keep tool outputs short, local models hate long prompts
Register local-backed tools in ~/.fazm/mcp-servers.json
Expect 7B-14B local models to need tighter prompts than Claude

Wiring a local LLM into a real Mac workflow?

Jump on a 20-minute call; bring the model you picked and the app you want it to drive, we will wire it up live.

Book a call →

Questions people actually ask about this

Frequently asked questions

What were the actual local LLM updates in April 2026?

The landing news from April 2026 is a clustered shipping wave: Llama 4 variants reached public weights, Qwen 3.5 shipped both dense and MoE, Gemma 4 landed with day-one vLLM support in v0.19.0 across E2B, E4B, 26B MoE, and 31B Dense, DeepSeek pushed another open release, and the tooling layer caught up fast. Ollama and LM Studio picked up the new checkpoints within days. vLLM released v0.18.0, v0.19.0, and v0.19.1rc0 inside a 14-day window. The headline is not any one model; it is that a Q4 quantized 7B model now fits in 4 to 5 GB of VRAM and a 16 GB Mac can serve a strong 13B class model without swapping.

If the local models are this good, why does my Mac agent still feel broken?

Because the model is not the bottleneck anymore. The bottleneck is what you feed the model. Most agents pass a screenshot of the current screen plus a prompt. Screenshots lose the semantic structure of the UI: button roles, field labels, window hierarchy, enabled state, ARIA values. The local LLM guesses instead of reads. On macOS the alternative exists and has for years: the Accessibility API (AXUIElement) returns the actual tree, with every label, role, value, and relationship named. An agent that consumes the accessibility tree does not OCR a button, it asks the system what the button does. Fazm v1.5.0 (2026-03-27) replaced the screenshot path with this, and the bundled binary that does it is mcp-server-macos-use.

Can Fazm use a local LLM today, or is it Claude-only?

It can use one, indirectly, through the Custom API Endpoint setting that shipped in v2.2.0 on April 11, 2026. You put a URL in Settings, Fazm writes it to UserDefaults under the key customApiEndpoint, and the ACP bridge exports that string as env ANTHROPIC_BASE_URL on the Node subprocess at Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382. If you run an Anthropic-to-OpenAI shim in front of Ollama, LM Studio, vLLM, or any other OpenAI-compatible local server, Fazm talks to that endpoint without an app update. The shim is what makes it work; Fazm itself is protocol-agnostic past the Messages API shape.

What file do I write to add a local-LLM-backed tool to Fazm?

~/.fazm/mcp-servers.json. The Settings UI shipped on 2026-04-20 in v2.4.0 and the file is managed by Desktop/Sources/MCPServerManager.swift. Each entry has a name, a command, args, env, and an enabled boolean, mirroring Claude Code's MCP server schema. You can point it at a local-model-backed MCP server (for example one that wraps llama.cpp or an Ollama endpoint) and it becomes a tool the agent can call by name. The config file path is shown in Settings at Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 1841.

Why is the accessibility-tree approach better than a screenshot for a local LLM?

Three reasons, stacked. First, tokens: the accessibility tree for a typical app window is a few hundred tokens of structured labels; a screenshot re-rendered to 1024 pixels wide is an order of magnitude more and requires a vision-capable model you may not be running locally. Second, fidelity: the tree names the button; the screenshot guesses at it. Small local models OCR-hallucinate field labels under load. Third, reach: accessibility sees background windows, disabled controls, and app-private panels that a screenshot cannot capture at all. If the goal is to drive apps, not describe pixels, accessibility is the better substrate.

Does any of this require Claude, or can I run it entirely local?

The desktop side does not require Claude. The accessibility capture layer is a local Swift binary (mcp-server-macos-use) bundled inside Fazm.app at Contents/MacOS; it does not call out to any cloud. The custom MCP servers you register via ~/.fazm/mcp-servers.json are stdio processes on your own machine. The one part that still goes out today is the model. Point customApiEndpoint at a local server and that stops too. The shim layer in front of a local server has to translate Anthropic Messages API shape to whatever the local server exposes (usually OpenAI-compatible); that is the single external dependency and it runs on your machine as well.

Which April 2026 local model is actually a good fit for a Mac agent?

The short answer: Gemma 4 26B MoE if you have 32 GB of unified memory, Qwen 3.5 14B if you have 16 GB, and Llama 3.3 8B Q4 if you have 8 GB. The longer answer depends on whether you want reasoning or snappiness. Agents that emit tool calls in a loop feel latency-bound, so time-to-first-token dominates. MoE models with fewer active parameters feel faster per turn. Gemma 4's 26B MoE has the best ratio of active-parameter cost to agent-style reasoning quality I have seen this month. If you just want a code-biased completion, Qwen2.5-Coder variants are still very strong and run happily at Q4 on consumer silicon.

What exactly is passed to the local model on a typical Fazm request?

Not a screenshot. The message history plus the output of the active tool calls, which for macOS control means structured accessibility data from mcp-server-macos-use. The system prompt tells the model to save snapshots to files and strip inline base64 screenshots to keep context small; that guidance lives in acp-bridge/src/index.ts around line 313. The practical effect is that even a 7B local model has a realistic shot at picking the right button, because the prompt is labels-and-roles, not pixels.

Where can I verify these claims in the source?

Four files. CHANGELOG.json in the repo root, search for 2026-03-27 (accessibility switch), 2026-04-11 (custom API endpoint), 2026-04-20 (custom MCP servers). Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382 for the endpoint export. Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 906 to 952 for the Custom API Endpoint UI. Desktop/Sources/MCPServerManager.swift for the ~/.fazm/mcp-servers.json loader. The bundling step for the accessibility binary is at run.sh line 212 and codemagic.yaml line 200. The tree is MIT-licensed at github.com/mediar-ai/fazm.

Is there a real benchmark showing accessibility-tree context beats screenshots?

The honest answer is that public benchmarks for agent-driven UI control still lean heavily on screenshot-plus-vision because that is what vision-capable frontier models are tuned for. The argument for accessibility-tree context on a local LLM is mechanistic, not benchmarked: small local LLMs handle short structured strings far better than noisy OCR from screenshots, and the tree preserves information (disabled state, role) that OCR discards. If you are running a local model below 30B parameters and care about reliability, ship tree to it, not pixels. If your local model has strong vision and runs at 70B, either works; above that scale the model absorbs the noise.

Does Fazm require installing Ollama or LM Studio to use a local model?

No. Fazm does not care which local server you use, only that it speaks an OpenAI-compatible surface reachable at the URL you put in Custom API Endpoint and that something translates Anthropic Messages into that surface. Ollama has an OpenAI-compatible endpoint built in; LM Studio does too; vLLM does. The Anthropic-shim layer is the piece you install yourself; there are two or three open-source shims that do this well. Then you flip the Custom API Endpoint toggle in Settings, paste the URL, and restart the chat. That is it.

Related guides

Infra

vLLM release 2026 update: three ships in 14 days

What changed in vLLM v0.18.0, v0.19.0, and v0.19.1rc0, and why upgrading your server means zero changes to your Mac agent.

Read

Models

Open-source LLM news 2026

A running list of open-weight releases and the practical tradeoffs for running them on consumer hardware.

Read

Roundup

Model release or new LLM announcement after 2026-04-12

Weekly roundup format focused on what is usable, not what is announced.

Read

The Fazm source tree is MIT-licensed at github.com/mediar-ai/fazm. Every file path in this guide is verifiable there.

0 files, 0 line numbers, 0 UserDefaults key — that is the whole contract between Fazm and your local LLM.