Open models, April 2026

Local LLM releases, April 2026: a month of weights, and the macOS half nobody writes up

Six orgs shipped open-weight models in a two-week window. Gemma 4, Qwen 3, Mistral Medium 3, Llama 4 Scout and Maverick, plus smaller entries from DeepSeek and 01.AI. Every currently-popular guide stops at ollama pull. This one keeps going, because on a Mac the weights are half the problem. The other half is the bridge that lets a model drive your actual apps, and that bridge is called accessibility.

M
Matthew Diakonov
11 min read
4.8from 1,400+ active Mac users
Open source on GitHub
Local-first data
Any app, not just the browser

The releases, in order

This is the straight timeline, dense, no commentary. The section after this one is where the interesting part starts. For each release I list the weights that actually turned up on Ollama, LM Studio, and Hugging Face in the same window, since a paper-only release is not a local release.

1

April 2, 2026 — Google Gemma 4

Apache 2.0, four sizes tuned for different deployment surfaces. Mobile, laptop, single-GPU workstation, and server. The 27B variant replaced Gemma 2 27B as the default pick for mixed instruction-following plus retrieval on a single consumer GPU. MLX-ready on day one.

2

April 8, 2026 — Alibaba Qwen 3 family

Seven sizes from 0.6B to 72B, dense architecture across the board, dual-mode thinking built into every checkpoint. The same weights run in a fast standard mode or a slower reasoning mode triggered by a control token. Llama.cpp, MLX, and vLLM kernels landed within 72 hours. Quantized GGUFs on Hugging Face by April 10.

3

April 9, 2026 — Mistral Medium 3

Open weights, dense, aimed squarely at the gap between small local models and frontier proprietary. Strongest April 2026 open model for European languages and long-form French, Italian, and Spanish document work. Hosted pricing on Mistral's own API is $2 per million input tokens and $6 output as a reference point for what you save by running it locally.

4

April 10-14, 2026 — Meta Llama 4 rollout

Two Mixture-of-Experts variants shipped across the week. Scout 17B active parameters, the default pick for 12GB VRAM and 16GB unified-memory Macs. Maverick with a 10M token context window, aimed at document analysis workloads on higher-end rigs. Quantization kernels for MoE landed in llama.cpp by April 12 after a 48-hour gap where MLX was ahead.

5

Throughout April — DeepSeek V3.1 and 01.AI Yi 2

Two smaller but notable releases. DeepSeek V3.1 for code-heavy local work. Yi 2 for English-Chinese bilingual. Neither dominated downloads, but both filled specific niches that the big four do not cover as cleanly.

The half everyone skips

You have weights. They respond. Congratulations, you built a chatbot. Now try to get that chatbot to reply to an email in Mail, reschedule a meeting in Calendar, or move a card in Linear. You hit a wall, because a language model does not have hands. The bridge between text and action is the part of the stack that is almost never covered in a local-model writeup. On macOS, that bridge has a specific name.

Where the bridge lives

Local model
ACP bridge
User intent
macos-use MCP
Mail
Slack
Figma
Calendar

The hub in that diagram is not marketing shorthand. It is a real binary named mcp-server-macos-use that ships inside the Fazm app bundle and speaks the Model Context Protocol over stdio. The rest of this piece is about what that binary does, where it lives in the source tree, and why this is the component that actually turns an April 2026 local model into a usable Mac agent.

Anchor: the five built-in MCP servers

Open acp-bridge/src/index.ts in the Fazm repo. Line 63 resolves a path. Line 1056 checks the binary exists. Line 1059 registers it under the name macos-use. Line 1266 locks the set of built-in names. That is five lines of wiring that decide whether the product is browser-only like most web-scale agents, or any-app like a real desktop assistant.

acp-bridge/src/index.ts

Five names. Three of them wrap an app or a service the user already uses (Playwright for the browser, WhatsApp, Google Workspace). One is Fazm's own tool-exposing MCP. The fifth, macos-use, is the one that makes the product any-app instead of some-apps. When a model says "open TextEdit, paste this paragraph, save as draft.rtf," the tool call fans out through macos-use, which walks the accessibility tree of the focused app, finds the right AXUIElement, and posts an AXPress or AXSetValue. No pixel inference.

Two ways to reach a running Mac app

At the desktop level, a model has two practical options for observing and acting on a native Mac app. Screenshots, or accessibility. They do not cost the same, and they do not fail the same way. I've run both in production and the difference is not small.

FeatureScreenshot loopAccessibility tree (macos-use)
Input shapePNG, usually 1-3MB, base64-encoded into tokensStructured text tree with role, label, value, frame
macOS permission promptScreen Recording plus the always-on menu bar indicatorAccessibility only
Token cost per observationHundreds of KB of image dataSingle-digit KB of text
Behavior under multi-monitor or full-screenCoordinate math breaks on display changesUnaffected, the tree does not move
Click reliabilityPixel match, fragile to zoom, theme, sidebar widthDeterministic, AXPress against a stable AXUIElement
Local-model fitRequires a vision-capable model, doubles your VRAMWorks with a 7B model that can read text

What the accessibility tree actually looks like

Concrete example. Here is what a macos-use snapshot of a Mail compose window gets back, trimmed to the interesting lines. The full snapshot is a few hundred lines. A vision model has to stare at 900KB of pixels to find the equivalent structure. A local Qwen 3 7B can read this snippet in its sleep.

macos-use → accessibility snapshot (trimmed)

Notice the last line. The agent tried to press Send, the tree said AXDisabled, the agent recovered. Same loop on a screenshot stack would have either clicked a greyed-out button and sent nothing, or burned three vision-model calls inferring why the click did not register. This is the reliability lever the accessibility path buys you. It is also the lever local models benefit from most, because they are the ones with the tightest token budget.

0open-weight families shipped in 14 days
0built-in MCP servers in the ACP bridge
0Mtoken context on Llama 4 Maverick
0BQwen 3 top-end dense size

Sizing the April 2026 picks for Apple Silicon

Unified memory changes the math. On an M-series Mac, the model runs out of the same pool as the OS, so the useful threshold is not "fits in VRAM" but "fits in RAM with something left for the rest of your day." Practical sizing by machine:

M2 / M3 with 16GB

Llama 4 Scout 17B MoE at Q4_K_M. Active parameters fit, and the MoE expert routing keeps inference snappy even in unified memory. Qwen 3 8B at Q4 is the fallback if you want dense.

M3 Pro / M4 Pro with 24GB

Qwen 3 32B at Q4_K_M. This is the sweet spot. Strongest tool-use on the list of April releases, dual-mode thinking lets you control cost per call, MLX kernel shipped April 9.

M3 Max / M4 Max with 64GB+

Mistral Medium 3 at Q5 or Q6 if you care about document work. Or Qwen 3 72B at Q4 for the highest reasoning ceiling the month produced on open weights.

Mac mini M4 with 16GB

Gemma 4 9B at Q5 for always-on background use. It runs at low wattage, holds quality for summarization and routing, and leaves headroom for an agent loop that also wants to drive apps.

Older Intel Mac

Skip local inference, run Ollama on a separate box and point your agent at the HTTP endpoint. The accessibility half still works the same way, it just talks to a remote brain.

What changes on your desk

If you wire one of these April releases into an accessibility-aware agent loop, the list of things your Mac will start doing on its own looks like this. These are the verbs that stop being manual.

Unlocked when you pair a local LLM with accessibility orchestration

  • Reply to email in Mail without opening the compose window yourself
  • Reschedule a calendar event by voice while you're in another app
  • Move a Linear issue between columns and paste the status into Slack
  • Dictate to Figma, have the agent adjust layers from the accessibility tree
  • Pull a Notion page into a document, strip the header, save as Markdown
  • Watch a new invoice land in Gmail, extract totals, file it, reply
  • Keep Slack replies short by having the agent pre-draft from context
  • Switch Messages threads and paste the right snippet from your clipboard history

The release marquee

Fast pass through what dropped in April. Rolling reference, not every variant.

Gemma 4 2BGemma 4 9BGemma 4 27BQwen 3 0.6BQwen 3 4BQwen 3 8BQwen 3 14BQwen 3 32BQwen 3 72BMistral Medium 3Llama 4 Scout 17B MoELlama 4 MaverickDeepSeek V3.1Yi 2 34B

One number worth remembering

0
Open-weight families shipped in the first two weeks of April 2026. The part that did not ship alongside them is the orchestration layer. On a Mac, the cleanest layer is a native MCP server that walks the accessibility tree. Fazm bundles one and registers it as the third entry in BUILTIN_MCP_NAMES.

Why we still default the brain to Claude, and how that might change

Honest framing. Fazm's consumer Mac app runs Claude by default today. Opus 4.7 went GA on April 22, 2026, and on the set of real-world agent tasks we care about (multi-step planning, tool argument shaping, recovery from mid-flight errors, handling the kind of ambiguity where a user says "the thing I was working on yesterday") it is still the most reliable lever we can pull. That is a statement about the full loop, not about the weights in isolation. A local Qwen 3 32B is strong at reasoning. The gap is not reasoning, it is the long tail of agent behavior.

That said, the part of Fazm that is already local is most of the product. The app runs on-device. The database lives in an on-device SQLite file. The knowledge graph is two local tables, local_kg_nodes and local_kg_edges. The browser-profile extraction is done locally against the user's Chromium SQLite files. The accessibility loop is local, full stop. The only part of the stack that crosses the network is the call to the Claude API, and when the quality gap closes or when a user strongly prefers a fully-local brain, the ACP bridge is already MCP-shaped, which means a swap is feasible. The repo is at github.com/mediar-ai/fazm if you want to do that swap today.

Field notes for picking a model this month

If the only thing you take from this is what to download, here is the short version.

  • Best all-round local agent brain for a modern Mac: Qwen 3 32B at Q4_K_M. Dual-mode thinking matters for cost control, dense architecture quantizes cleanly, tool-use quality is the best of the April batch on our own tests against macos-use and Playwright MCP.
  • Best small-machine agent brain: Llama 4 Scout 17B MoE. Active parameter count stays in your RAM budget on a 16GB Mac, and the MoE routing makes it feel faster than a dense 17B.
  • Best long-context document worker: Llama 4 Maverick if you have the RAM, Mistral Medium 3 if you want dense and predictable.
  • Best always-on router in the background: Gemma 4 9B. Low wattage, Apache 2.0, good for classification and triage where you do not want a reasoning budget.
  • Code-specific: DeepSeek V3.1 if that is the only thing you do with it.

Running local, wiring into your Mac?

Book a call and I will walk you through how macos-use works end-to-end, and what it takes to point a local Qwen 3 or Llama 4 at the same MCP surface we ship Claude against.

Frequently asked questions

Which local LLMs actually shipped in April 2026?

Between April 2 and April 14, six organizations pushed open-weight releases. Google Gemma 4 landed on April 2 under Apache 2.0 in four deployment-targeted sizes. Alibaba Qwen 3 shipped on April 8 across 0.6B, 1.7B, 4B, 8B, 14B, 32B, and 72B, with a dual-mode system where the same weights run in a reasoning mode and a standard mode. Mistral Medium 3 followed on April 9 with open weights aimed at the gap between small local models and frontier proprietary ones, priced at $2 per million input and $6 output on Mistral's hosted API for reference. Meta Llama 4 ship dates drifted across the month, with the Scout 17B Mixture-of-Experts variant leading the Ollama downloads for 12GB VRAM machines, and the Maverick variant stretching a 10M token context window on higher-end rigs. These are the four families that dominated April downloads on Ollama, LM Studio, and Hugging Face.

Why does running weights locally on a Mac not give you an agent?

Because a language model produces text. An agent produces clicks, keystrokes, file saves, email sends, and API calls. The bridge between the two is what most guides skip. On macOS specifically, the bridge is the system accessibility layer: NSAccessibility and AXUIElement, the same APIs that power VoiceOver. Every native Mac app exposes its UI as a tree of AXUIElement nodes with role, label, value, and position. A model can read that tree, decide what to do, and post a synthetic AXPress or AXSetValue back. That is the shortest, most reliable loop between weights and action on a Mac. You can skip it and send screenshots into a vision model, but you will pay for it in latency, tokens, and brittleness every time a modal pops up or a sidebar changes width.

Where exactly does Fazm wire the accessibility layer?

In acp-bridge/src/index.ts at line 63, the file resolves a path to a native binary called mcp-server-macos-use at Contents/MacOS/mcp-server-macos-use inside the app bundle. At line 1056 the bridge checks that the binary exists, then registers it as an MCP server named macos-use at line 1059 with no arguments and an empty env. That server is one of exactly five built-in MCP names listed at line 1266: fazm_tools, playwright, macos-use, whatsapp, google-workspace. When a Claude or agent query needs to read or act on a Mac app that is not the browser, the call goes through the macos-use MCP, which is doing AXUIElement traversal, not image decoding. That is the behavior that makes the product work with any app.

Can you point a local LLM at this accessibility bridge?

Today Fazm runs the orchestration through Claude via the ACP bridge. The macos-use binary itself is protocol-plain MCP, which means anything that speaks Model Context Protocol over stdio can consume it. If you stand up an MCP-capable local runtime, say a Qwen 3 32B or Llama 4 Scout served through an MCP client shim, you can register the same binary as a server there. The binary does not care what is on the other end of the pipe, only that the caller speaks MCP. So the accessibility half of the story is model-agnostic even though the consumer app ships with Claude.

Why accessibility APIs instead of screenshots for a Mac agent?

Four reasons. First, determinism: AXUIElement returns structured data with stable identifiers, so 'click the Save button in Figma' is a deterministic tree walk, not a pixel-matching gamble. Second, permissions: a screenshot pipeline on macOS requires Screen Recording and triggers the recording indicator at the menu bar the entire time the agent is active. Accessibility does not. Third, latency and tokens: an accessibility snapshot of a focused window is kilobytes of text, a screenshot is hundreds of kilobytes of base64 that the model has to reason over. Fourth, occlusion and scale: screenshots fail at arbitrary zoom, multiple monitors, and full-screen toggles. Accessibility trees survive all of that because they are not pixel-space.

Which local model release from April 2026 is the best fit for desktop orchestration?

For a 24GB Mac with unified memory, Qwen 3 32B at 4-bit quantization is the strongest point on the curve for tool-use and reasoning, and it ships with a dual-mode that lets you pay for thinking only when the task needs it. For a 16GB machine, Llama 4 Scout 17B MoE runs well because the active-parameter count is low even though the total parameter count is higher. For a 64GB M-series machine, Mistral Medium 3 is the most capable local option for mixed-language, document-heavy work. None of these ship with native MCP client behavior, so to plug any of them into a desktop automation loop you need a client shim on top that translates their tool-call outputs into MCP requests.

Does this replace Claude inside Fazm for users who care about running fully local?

Not today. Fazm's consumer product runs Claude by default because the reliability of the full agent loop (tool selection, argument shaping, recovery from errors, multi-step planning) is currently set by frontier model quality, and Opus 4.7 was a meaningful jump there on April 22, 2026. The local side of the story is that the rest of Fazm is already local: the app, the database, the knowledge graph, the browser-profile extraction, and most importantly the accessibility automation itself all run on-device. Swapping the brain to a local Qwen 3 or Llama 4 is a future lever, not a present default. The open-source repo at github.com/mediar-ai/fazm is what makes that lever pullable.

How do the April 2026 releases compare on MoE vs dense architecture?

Two of the four families went MoE. Llama 4 Scout and Maverick are both Mixture-of-Experts, which is why Scout at 17B active parameters runs on a 12GB VRAM card despite having a much larger total parameter count. Qwen 3 stayed dense across the family, which makes it easier to quantize cleanly at 4-bit and 8-bit without the expert-routing artifacts you sometimes see. Mistral Medium 3 is dense. Gemma 4 is dense. The practical takeaway: if you have VRAM constraints and want the strongest parameter count per GB, Llama 4 Scout is the answer. If you want the most predictable quantization behavior and the widest quant-kernel support across llama.cpp, MLX, and vLLM, the dense Qwen 3 family is the safer pick.

What is the MCP server model and why does it matter for April 2026 local LLMs?

MCP, Model Context Protocol, is an open stdio-based protocol for servers that expose tools to an LLM client. Each server ships a tool schema, the client calls tools, the server returns results. The reason MCP matters specifically for this month's releases is that every April 2026 open model is good enough at tool-use that the bottleneck stops being the model and starts being the ecosystem of tools it can actually reach. macos-use is one such server. Playwright, whatsapp, and google-workspace are three more that Fazm bundles. The moment a local model speaks MCP, everything built for frontier-model tool-use ecosystems becomes available to it without a rewrite.

What changed on April 22, 2026 that is relevant to local LLM users?

Claude Opus 4.7 went GA with a 1M context window and a material jump in long-horizon coding quality. That is not a local release, but it resets the ceiling that local models are chasing, and in practice it is the reference point against which the April open-weight releases are measured on real agent tasks. The other thing that shipped the same weekend was a refreshed Claude consumer terms page, which broke a lot of client apps that did not distinguish a terms-not-accepted 400 from an auth-failure 400. That second change is invisible to a user running a fully local loop, which is one of the arguments for running a fully local loop in the first place.