Local LLM updates, April 2026Context-size angle no roundup covers

Every April 2026 local LLM roundup measures parameters. The number that actually decides which one can drive your Mac is the size of the context you feed in.

Qwen 3 on April 8. Mistral Medium 3 on April 9. Gemma 4 and the TurboQuant paper on April 11. Llama 3.3 70B still at the top of the general-purpose chart. Every roundup published this month ranks them by params, benchmarks, and VRAM. None of them answer the one question that matters if you actually want a local model to take actions on your Mac: how many tokens does a real desktop agent need per turn? This guide measures it, from the traversal files a shipping Mac agent leaves on disk. The number is smaller than every screenshot-based guide would lead you to believe, and that changes which April 2026 models are in play.

Fazm

Published April 16, 202612 min read

Try Fazm free

4.9from 200+

Context-size numbers measured from real /tmp/macos-use/*.txt files on disk

Every file and line reference points at a real location in the Fazm source

ANTHROPIC_BASE_URL bridge lets any local LLM take over today

Local LLM updates, April 2026

Measured against what a real Mac agent actually sends per turn

Apr 8: Qwen 3 lineup, dual-mode thinking

Apr 9: Mistral Medium 3, open weights

Apr 11: Gemma 4, Apache 2.0, Ollama same week

Apr 11: Fazm v2.2.0 ships Custom API Endpoint

Apr 16: Fazm v2.3.2 tightens 'local-first' language

Fazm accessibility tree: 1-42 KB per turn, not 1 MB

0:00 / 0:05

Local LLM updates that landed in April 2026

Qwen 3 (0.6B - 72B)Qwen 3 32B @ 4-bitMistral Medium 3Gemma 4 (27B)Gemma 3 license refreshLlama 3.3 70BTurboQuant (ICLR)OllamaLM StudioMLXvLLM7 open-source drops / 12 days

What actually dropped for local LLM users in April 2026

The six items every aggregator agreed on. Four model releases, one research paper, one license event, plus the steady count of smaller open-weight drops. What follows on this page is the part the aggregators skipped: translating these into a pick-order for a real Mac agent.

Qwen 3 (Alibaba) — April 8

Full lineup from 0.6B to 72B. Dual-mode thinking: each model can run in slow-chain-of-thought or fast-direct modes. Qwen 3 32B at 4-bit fits on a 24 GB GPU and matches GPT-4o on several reasoning benchmarks.

Mistral Medium 3 — April 9

Open weights, strongest European-language performance, first release to ship EU AI Act compliance metadata alongside the weights.

Gemma 4 (Google, Apache 2.0) — April 11

9B and 27B variants. Hits Ollama and MLX within days. Same weekend Gemma 3's license was refreshed to remove the old user-count cap.

Llama 3.3 70B — still the default

Best overall locally-runnable model per llm-stats. Wants 48 GB+ for Q4. Most reliable on structured output and tool-calling.

TurboQuant paper (ICLR, April 11)

PolarQuant rotation plus Quantized Johnson-Lindenstrauss projection cuts KV cache memory. Shifts long-context economics, not shipped weights.

Seven open-source model drops in 12 days

llm-stats counted seven major open-weight releases in the first twelve days of April 2026 alone. Most landed in Ollama and MLX within hours of the official drop.

0Open-source model drops in first 12 days of April

0GB VRAM for Qwen 3 32B at 4-bit quantization

0Bytes in Fazm's real full-window traversal

0UI elements in that same traversal

The number no April 2026 local LLM roundup publishes

Read llm-stats, read Julien Simon's Medium post, read PromptQuorum, read Till Freitag's open-source comparison. Every one of them ranks the April 2026 local LLM updates by parameters, VRAM, and benchmark score. Every one of them stops before answering the next question: can this model actually drive a real Mac, and what does that take?

Driving a Mac means emitting tool calls against the UI state of whatever app is in front of you. The model needs two things: reliable JSON tool-use discipline, and enough context headroom to fit the current app state plus a sensible history plus the tool schema. The first is a model-training problem. The second is a number, and the number depends on whether your agent sends the screen as pixels or as a text tree.

For a screenshot-based agent, a 1024x768 PNG encodes to 1,500 to 6,000 tokens on today's vision LLMs. Five turns of that and you have eaten 30k tokens of context, with no room for history. A 7B local model with 32k context window is out of runway by turn 4. Qwen 3 32B at 4-bit on a 24 GB GPU manages maybe six turns before compaction kicks in.

For an accessibility-tree agent, the math is different by an order of magnitude. The rest of this page is that math, measured from a real shipping consumer Mac app's traversal files on disk.

Real bytes from a real Mac agent session

These are the traversal files left on disk by Fazm's macos-use MCP tool after a recent session. Each file is the plain-text accessibility tree captured at one moment in one app's window, passed verbatim to the selected model as context. Run wc -c /tmp/macos-use/*.txt on any Mac that has run Fazm in the last hour and you get the same shape.

wc -c /tmp/macos-use/*.txt

The initial traversal of Fazm's own window, containing 388 UI elements, is 23,434 bytes. The largest click-and-traverse call (which captured a larger, more complex view) is 42,318 bytes. The smallest, a narrow click on a single-pane result, is 1,351 bytes. The entire nine-call session fits in 112,586 bytes, or roughly 28k tokens, including every app's tree and the raw history.

112,586 B

“The full file list and byte counts above are taken verbatim from /tmp/macos-use on a machine that had just run Fazm. No estimates.”

wc -c /tmp/macos-use/*.txt, captured 2026-04-16

Why the tree is text, not pixels

Fazm reads the macOS accessibility API directly. The call that starts the traversal is in Desktop/Sources/AppState.swift around line 439. No screenshot is ever taken for primary context. The tree arrives as lines that name each element by role, visible text, and screen coordinates, in a format any text-only LLM can parse without a vision encoder.

Desktop/Sources/AppState.swift, ~line 439

The pipeline, end to end

Fazm's desktop process reads the tree locally, hands it to the ACP subprocess, which posts it over an Anthropic-shaped Messages API to whichever endpoint is active. If you set ANTHROPIC_BASE_URL via the Custom API Endpoint field, that endpoint can be your local Qwen 3 or Gemma 4 instead of Anthropic's servers. Every April 2026 local LLM update plugs in at the rightmost node of this diagram without a Fazm release.

Fazm tree + any April 2026 local LLM

Why an accessibility-tree agent beats a screenshot agent for local LLM use

The numbers that decide whether your April 2026 local LLM can actually drive your Mac.

Feature	Screenshot agent	Fazm (AX tree)
Context per turn (typical app window)	500 KB - 2 MB (PNG)	1 KB - 42 KB (text)
Token cost per screenshot on vision LLM	~1,500 - 6,000 tokens	~300 - 10,000 tokens
Vision capability required	Yes (multimodal model)	No (text-only model works)
Click targets named as	Pixel coordinates from OCR	Role + label from AX tree
Context fits on 7B local model (32k ctx)	3-5 turns before overflow	30+ turns with headroom
Deterministic input across model swaps	No (PNGs vary slightly)	Yes (same tree every time)

Wiring any April 2026 local LLM into Fazm in five steps

The setting that makes this work is a single field in Settings > Advanced. No app update, no build from source, no feature flag.

Pick a local inference server

Ollama, LM Studio, vLLM, or MLX all work. Pull the April 2026 model of your choice (Qwen 3 32B, Gemma 4 27B, Mistral Medium 3, Llama 3.3 70B). Confirm you can hit it at a local URL like http://localhost:11434.

Run an Anthropic-shape proxy in front

Fazm speaks the Anthropic Messages API to the ACP subprocess. Put a small translator in front of your local server that converts /v1/messages requests into your server's /api/chat (Ollama) or /v1/chat/completions (LM Studio, vLLM) shape, including tool-use arguments.

Paste the proxy URL into Fazm

Fazm Settings > Advanced > Custom API Endpoint. UI is at Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 933. It writes to @AppStorage('customApiEndpoint'). No app restart required, the value is read on every new chat session.

Verify the env var is getting picked up

When a chat starts, Desktop/Sources/Chat/ACPBridge.swift line 380 sets env['ANTHROPIC_BASE_URL'] to your URL before spawning the ACP subprocess. Tail /tmp/fazm-dev.log and you will see the exact env value. If it is empty, the field did not save.

Run a tool-heavy query first

Ask Fazm to 'open Mail and draft a reply to the top thread.' That exercises app-switching, accessibility-tree traversal, and a multi-step tool chain. If your local model picks the right tool names 9 out of 10 tries, it is usable. If it hallucinates tools more than twice, drop back to Claude Sonnet.

The two lines of Swift that make local LLMs addressable

Every detail above is downstream of these two lines. They live in Desktop/Sources/Chat/ACPBridge.swift at lines 379 and 380. If the user has filled in the Custom API Endpoint setting, the ACP subprocess gets ANTHROPIC_BASE_URL set to that value before it spawns. Every call Fazm makes to the model now goes to your local inference server instead.

Desktop/Sources/Chat/ACPBridge.swift, lines 378-381

That is the entire bridge from Fazm to the April 2026 local LLM of your choice. Two lines. One user-facing setting. Zero model updates shipped by Fazm to add support for Qwen 3, Gemma 4, Mistral Medium 3, or anything else released this month.

Practical pick order for April 2026, for Mac-agent use

Ranked for tool-call reliability against Fazm's accessibility tree, not general benchmark score. A model that wins MMLU is not automatically the one that will correctly pick macos-use.click_and_traverse over macos-use.type_and_traverse ten times in a row.

First pick

Qwen 3 32B, 4-bit, thinking mode

Strongest tool-use on open weights as of April 2026. Fits in 24 GB VRAM. Thinking mode burns latency but picks tool names correctly on longer chains. Context budget is not a concern against Fazm's ~10k-token-per-turn tree.

If you have the hardware

Llama 3.3 70B

Best overall locally-runnable model per llm-stats. Wants 48 GB+ for Q4. Most reliable on structured output. The one local LLM this month that approaches Claude Sonnet on long tool-chains without hallucinated tool names.

If you are in the EU

Mistral Medium 3

Ships with EU AI Act compliance metadata. Best European-language tool-use this month. Weaker on rare tool names than Qwen 3 but meaningfully better on non-English UI trees.

Fastest on smaller hardware

Gemma 4 27B

Apache 2.0 since April 11. Lands in Ollama and MLX within days of release. Fastest tokens-per-second at this size class. Less reliable on multi-tool schemas than Qwen 3; fine for simpler Fazm workflows.

Want the text-tree pipeline without writing the proxy yourself?

Fazm ships with Claude Sonnet 4.6 by default (label: Fast). Run the same accessibility-tree flow the rest of this page describes, then flip the Custom API Endpoint the moment a local model's tool-calling catches up.

Download Fazm →

Frequently asked questions

What are the big local LLM updates in April 2026?

Five land on every roundup. Qwen 3 (0.6B through 72B, dual-mode thinking) on April 8. Mistral Medium 3 with open weights and EU AI Act compliance metadata on April 9. Google Gemma 4 (9B and 27B variants, Apache 2.0) on April 11 alongside a Gemma 3 license refresh that removed the old user-count cap. Llama 3.3 70B remains the best general-purpose locally-runnable model per llm-stats. Qwen 3 32B quantized to 4-bit fits on a single 24 GB GPU and matches or beats GPT-4o on several reasoning benchmarks. What no roundup tells you is which of these can actually drive a Mac agent, which depends on a number no one publishes.

What does 'drive a Mac agent' actually require from a local LLM in April 2026?

Two things: reliable JSON tool-calling, and enough context headroom to fit the target app's UI state plus a reasonable turn history. For a screenshot-based agent, context means 1-2 MB of image tokens per turn, which blows past most 32k-context local models after 3-5 turns. For an accessibility-tree agent like Fazm, context is plain text, and the trees Fazm actually captures on disk in /tmp/macos-use/ run 1,351 bytes for a small click result, 23,434 bytes for a full app window with 388 UI elements, and 42,318 bytes for the largest capture in a recent session. That's 0.3k to 10k tokens of text per turn, not a megabyte of pixels. A 7B local model with a 32k context window has room for the tree, a 30-turn history, and tool definitions combined.

Can I actually point Fazm at Qwen 3, Gemma 4, or any other April 2026 local LLM?

Yes, through one setting. Fazm v2.2.0 on April 11, 2026 added a Custom API Endpoint field. The UI lives at Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 933. It writes to the UserDefault key customApiEndpoint. The plumbing is at Desktop/Sources/Chat/ACPBridge.swift lines 379 to 380: the ACP subprocess gets env.ANTHROPIC_BASE_URL set to that URL. Point it at a local proxy that translates Anthropic Messages API requests into whatever shape your inference server speaks (Ollama, LM Studio, vLLM, MLX) and Fazm sends the same accessibility-tree-powered tool calls to your local model. Latency depends on your hardware. Tool-use reliability on smaller open-source models in April 2026 is meaningfully worse than on Claude Sonnet or Opus, and that's the real reason Fazm ships with Claude even though the context is small enough for local.

How large is Fazm's actual context per turn, and why does it matter for local LLM selection?

The real traversal files sitting in /tmp/macos-use/ from one recent Fazm session: 23,434 bytes for the initial open-and-traverse of the Fazm Dev window, then eight click_and_traverse calls averaging 10,384 bytes each, with a minimum of 1,351 bytes and a maximum of 42,318 bytes. Total for the 9-call session: 112,586 bytes, roughly 28k tokens. That's the entire structured UI state Fazm streams to whichever model is selected, over nine turns, as plain text with [Role] 'text' x:N y:N w:W h:H visible on every line. A screenshot agent sending 1024x768 PNGs at the same cadence would send 9 to 18 megabytes and many multiples more tokens. This is the reason text-tree agents are the only viable shape for local-LLM-driven desktop automation in April 2026.

Which specific local LLM should I try first for desktop automation?

As of April 2026, the practical ranking for tool-calling (not general benchmark scores) is: Qwen 3 32B in thinking mode (strongest tool-use on open weights, fits in 24 GB quantized), Llama 3.3 70B (most reliable on structured output but wants 48 GB+), Mistral Medium 3 (best European-language tool-use, EU AI Act metadata), then Gemma 4 27B (fast but less reliable on complex tool schemas). All of these are text-only calls against Fazm's accessibility-tree context, no vision required. None of them will match Claude Sonnet 4.6 on agentic reliability in April 2026, which is why Fazm's three default labels (Scary, Fast, Smart at ShortcutSettings.swift lines 152 to 154) all point at Claude. The Custom API Endpoint is for the user who values local-first strongly enough to accept the gap.

Why is Fazm not shipping a built-in Ollama integration if it's all just HTTP?

Three reasons, visible in the source. First, the April 2026 release window on-disk shows the product prioritizing reliability on Claude, not local parity. Second, the v2.3.2 changelog dated 2026-04-16 in CHANGELOG.json has an explicit entry: 'Tightened privacy language in onboarding and system prompts to accurately say local-first instead of nothing leaves your device.' Fazm's position is that the app runs locally on your Mac and reads the accessibility tree locally, but inference is remote unless you configure the endpoint yourself. Third, wiring Ollama through ANTHROPIC_BASE_URL requires a proxy that translates the Anthropic Messages API into Ollama's /api/chat shape. The translator has opinions about tool-use formatting and system prompts that vary per model, so Fazm leaves that shim to the user's proxy of choice rather than baking one in.

Does tool-calling reliability on local LLMs in April 2026 match the benchmarks you see on leaderboards?

No. The public leaderboards (MMLU, BBH, reasoning benchmarks in the llm-stats aggregates) measure completion quality, not the narrow skill of 'emit valid JSON tool calls against a 20-tool schema, recover from one malformed call, and never hallucinate a tool that doesn't exist.' Fazm's tool roster includes macos-use, gmail, playwright-extension, and others in the .mcp.json manifest. A 7B local model will pick the right tool on the easy cases and then invent a tool name on the hard ones. The TurboQuant paper from ICLR 2026 on April 11, which cuts KV cache memory with PolarQuant rotation plus Quantized Johnson-Lindenstrauss projection, helps serving economics but does not make small local models better at tool-call discipline. Expect to run a local model at Fast (Sonnet) quality, not Smart (Opus) quality, on tool-heavy workflows this month.

Can I verify the context-size numbers in this guide myself?

Yes. The accessibility tree files live at /tmp/macos-use/*.txt after any Fazm session that uses the macos-use MCP tool. Run 'wc -c /tmp/macos-use/*.txt' on any Mac that has run Fazm recently and you'll see the same shape: a 20k to 40k bytes initial traversal followed by much smaller click traversals. The AXUIElementCreateApplication call that reads the tree is in Desktop/Sources/AppState.swift at line 439 of the Fazm source tree. The Custom API Endpoint plumbing is at lines 379 to 380 of Desktop/Sources/Chat/ACPBridge.swift. The three hardcoded default model labels that make up Fazm's picker live at lines 151 to 155 of Desktop/Sources/FloatingControlBar/ShortcutSettings.swift. Every claim in this guide maps to a specific file and line range.

What this month actually changed

The April 2026 local LLM updates narrowed the tool-calling gap. Qwen 3's thinking mode reliably picks the right tool on short-to-medium chains. Llama 3.3 70B is steady enough that someone with the VRAM to run it can replace Claude Sonnet for non-critical workflows today. TurboQuant makes the serving math cheaper for every model. The gap to Claude Opus on long tool-chains is still real; that is the honest reason Fazm's defaults still point at Claude.

What the month did not change is the context-size story. A text-based accessibility tree remains an order of magnitude smaller than a screenshot. That single fact is what makes local LLM-driven desktop automation viable on a laptop at all in 2026. The agents that send screenshots as primary context will still be context-bound on local models a year from now. The agents that send the tree will not.