YEAR IN REVIEW · MODEL LAYER → AGENT LAYER

On-device LLM updates 2026, and the 3 Swift lines that point any of them at your Mac

The brain layer matured fast this year. Apple shipped a stable Swift API for the on-device foundation model. MLX became the default Apple Silicon inference framework after three WWDC sessions and landed inside Ollama on March 30. Four frontier-class open-weight models dropped in 48 hours on April 28 and 29. The body layer, the agent loop that actually clicks around inside your real Mac apps, is still where most of the engineering work has to land. Below is the calendar with sources, and the three lines in Fazm that close the gap.

M
Matthew Diakonov
11 min read
4.9from Sourced from Apple, Ollama, vLLM release notes and the Fazm source tree
Apple Foundation Models matured Q1-Q2 2026
Ollama v0.19 (Mar 30): MLX default on Apple Silicon
4 frontier open-weights in 48h (Apr 28-29)
M5 Max prefill: 1,810 tok/s on Qwen3.5-35B-A3B
ACPBridge.swift lines 527-530: the proxy hook

DIRECT ANSWER · VERIFIED 2026-05-17

What actually changed on-device in 2026

Three streams shipped at once. Apple's Foundation Models framework matured through 0 quarters into a stable Swift API exposing the ~3B parameter on-device model behind Apple Intelligence, with guided generation against Swift @Generable types. MLX consolidated as the default Apple Silicon inference framework and reached production-grade integration inside Ollama on March 30, 2026 (v0.19), where M5 Max prefill hits 0 tok/s on Qwen3.5-35B-A3B in NVFP4. The open-weight calendar concentrated into a 0-hour cluster on April 28 and 29 with four frontier-class releases under MIT, NVIDIA Open Model Agreement, Apache 2.0, and Modified MIT licenses. OpenBMB then shipped MiniCPM-V 4.6 1.3B on May 11 under Apache 2.0. The implication for Mac users is not which model to pick, it is that the brain layer is now mature enough that the bottleneck has shifted to the agent loop, and the entire agent-loop integration in Fazm is 0 lines of Swift in ACPBridge.swift at lines 527 to 530. Primary sources: Apple ML Research, Ollama MLX blog, github.com/ollama/ollama/releases.

THE NAMES THAT MATTERED THIS YEAR

Apple Foundation Models
MLX
Ollama v0.19
MiMo-V2.5-Pro
Nemotron 3 Nano Omni
Granite 4.1
Mistral Medium 3.5
MiniCPM-V 4.6
Llama 4 Scout
Gemma 4 31B
vLLM v0.20.1
DeepSeek V4

THE CALENDAR

What shipped, in order

One entry per substantive on-device update through May 17, 2026. The list is short on purpose, the year was concentrated. Most of the open-source work landed in roughly six weeks between March 30 and May 11.

1

WWDC 2025: MLX established as the Apple Silicon framework

Three dedicated WWDC sessions on MLX in 2025 marked the formal shift. The Foundation Models framework, a Swift API for the on-device ~3B model, also launched at the same event and shipped with macOS 26 Tahoe.

Source: developer.apple.com/documentation/FoundationModels and the WWDC 2025 session catalogue. The framework matured through Q1 and Q2 2026 into something an app can depend on, with guided generation against a Swift @Generable type.

2

Feb-Mar 2026: Ollama gains MLX preview, then ships it as default on Apple Silicon

MLX support landed in Ollama v0.17.5 (March 2), expanded in v0.19.0 (March 30), and became the recommended Apple Silicon backend by April. On M5 Max the GPU Neural Accelerators raise prefill from 1,154 to 1,810 tok/s and decode from 58 to 112 tok/s on Qwen3.5-35B-A3B in NVFP4.

Source: ollama.com/blog/mlx and the official Ollama release notes on GitHub.

3

Apr 2: Gemma 4 family lands in Ollama (E2B / E4B / 26B / 31B)

v0.20.0 added the four Gemma 4 sizes for direct pull on Apple Silicon, with the 31B coding model later getting a 2x speedup via MTP speculative decoding in v0.23.1 on May 5.

4

Apr 28-29: four frontier-class open-weight releases in 48 hours

Xiaomi MiMo-V2.5-Pro (1.02T MoE / 42B active, MIT) on Apr 28. NVIDIA Nemotron 3 Nano Omni (30B / 3B active multimodal, runs in 25GB) on Apr 28. IBM Granite 4.1 (3B / 8B / 30B dense, Apache 2.0, up to 512K context) on Apr 29. Mistral Medium 3.5 (128B dense, 256K context, Modified MIT, 77.6% SWE-Bench Verified) on Apr 29.

Three of four ship under a permissive license (MIT or Apache 2.0). Mistral Medium 3.5 needs a careful read of the Modified MIT terms for commercial deployments.

5

May 3: vLLM v0.20.1 stabilises DeepSeek V4 inference

vLLM v0.20.1 on May 3, 2026 ships the configurable VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD knob and stabilises DeepSeek V4 inference. Behind a LiteLLM shim it becomes an Anthropic-compatible endpoint that a Mac agent can target.

6

May 5: Ollama v0.23.1 ships Gemma 4 MTP speculative decoding

The 31B coding model picks up an over-2x speed increase on Apple Silicon. Speculative decoding is exactly the workload an agent loop benefits from, because tool-call generation is structured output.

7

May 11: OpenBMB MiniCPM-V 4.6 1.3B (Apache 2.0)

A 1.3B parameter multimodal model with a 262K context, optimised for on-device inference. Too small to be an agent planner, but the right fit for the always-on screen-understanding pass underneath a frontier model.

THE ARGUMENT

The brain layer is mature. The body layer is not.

Read the 2026 on-device updates back-to-back and a pattern shows up. Every entry is in one of three categories: a new model, a faster runtime, or a new API surface for the same on-device model. Every entry stops at the point where the model emits a token.

On a Mac that is roughly half a product. The other half is the loop that takes the model's tokens, executes them as real actions against real software (Mail, Calendar, the browser tab you actually have open, the CRM your business actually uses), and feeds the resulting state back into the next turn. The runtime layer is silent about that half. So is every model card.

This is not a complaint, it is a division of labour. The Apple, MLX, Ollama, vLLM, and Hugging Face teams are doing exactly what their tools should do: shipping faster, smaller, more permissive models and runtimes. The body layer is a different engineering problem with different constraints: macOS accessibility APIs, multi-tool planning, session persistence, screen capture, Google Workspace OAuth, browser extensions. Nothing about that work belongs inside an inference server.

The interesting question for a Mac user in mid-2026 is therefore not which on-device runtime to install. All of them work. The interesting question is what agent loop sits on top, and whether that loop can swap its brain to the runtime you picked. In Fazm's case the answer happens to be three lines of Swift.

THE ANCHOR FACT

ACPBridge.swift, lines 527 to 530

Open Desktop/Sources/Chat/ACPBridge.swift in github.com/m13v/fazm. Scroll to line 527. The whole bridge between any on-device LLM and the Claude Code agent loop is this:

ACPBridge.swift · L527
// Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"), !customEndpoint.isEmpty {
  env["ANTHROPIC_BASE_URL"] = customEndpoint
}

That is it. The ACP subprocess that is launched right after this block is the unmodified Claude Code agent loop, running locally as a Node child process. It speaks the Anthropic Messages API to whatever URL ANTHROPIC_BASE_URL resolves to. Point that URL at LiteLLM in front of Ollama (or MLX, or llama.cpp, or vLLM), and the same agent loop now reasons with weights that never leave your Mac.

A working local-brain setup

THE REQUEST PATH

What a single turn looks like end to end

The reason an on-device LLM update is interesting to an agent author is that every step in the path below is local. There is no cloud round-trip in this diagram. The user, the UI, the agent loop, the proxy, the model, and the tools all sit on one Mac.

One turn, fully local

UserFazm UIClaude Code (ACP)Local proxyOn-device LLMType or speak a requestSpawn ACP subprocess with ANTHROPIC_BASE_URL=http://127.0.0.1:4000POST /v1/messages (Anthropic shape)Translate, call MLX / Ollama / llama.cppTokens stream backAnthropic-shaped response with tool_use blocksExecute macos-use / playwright tools against the live MacResult rendered in the chat, action visible on screen

THE TOOLS YOU CAN POINT AT IT

Runtimes that plug into the proxy

Eight runtimes covered the 2026 on-device LLM landscape. Any of them, fronted by a LiteLLM Anthropic shim, becomes a valid brain for the agent.

Apple Foundation Models

Built-in ~3B model on macOS 26. Swift API, guided generation, no weights to ship.

MLX

Apple's native ML framework. Unified-memory zero-copy on Apple Silicon. WWDC-default since 2025.

Ollama

MLX backend default on Apple Silicon since v0.19. OpenAI-compatible server on localhost:11434.

LM Studio

GUI + model browser tied to Hugging Face. Runs MLX or GGUF, exposes an OpenAI-compatible local API.

llama.cpp

GGUF reference runtime. Still the broadest model coverage and the easiest CLI to script around.

vLLM

Server-side high-throughput runtime. v0.20.1 (May 3, 2026) stabilises DeepSeek V4 inference.

mlx-lm

Direct MLX CLI / Python entry point. Lowest-overhead path to an MLX model on an M-series Mac.

LiteLLM

Anthropic <-> OpenAI <-> Bedrock <-> local shim. The piece every agent stack ends up adding.

THE GAP, ROW BY ROW

What the runtime layer covers, and what it does not

The runtime layer is excellent at what it does. The agent layer is a different surface area, and the on-device updates of 2026 do not address it.

FeatureOn-device runtimes (Ollama / MLX / vLLM / llama.cpp)Fazm agent loop
Inference (next-token generation)Strong. MLX-backed Ollama runs Qwen3.5-35B at 112 tok/s decode on M5 Max.Delegated entirely. Fazm sends to whatever endpoint you point ANTHROPIC_BASE_URL at.
Model selection / quantisationExcellent. ollama pull, lms get, mlx_lm convert all cover it cleanly.Out of scope. Fazm does not ship a model browser; it sees whatever the proxy speaks.
Reading the screen via macOS accessibility treeNot addressed. Runtime ends at next-token generation.Native. Nine macos-use tools wrap AXUIElement APIs; output is a text file the model greps.
Driving the browser, native apps, Google WorkspaceNot addressed.Native. Playwright MCP for the browser, macos-use for native apps, Google Apps MCP for Workspace.
Persistent chat state across restartsOut of scope.Yes. Every window auto-restored with the full conversation history.
One-click chat forkingOut of scope.Yes. Each chat has a fork button, opens a new window with the full prior context.
Voice-first input (hold a hotkey, talk)Out of scope.Yes, via WhisperKit / Parakeet on-device.

THE FOUR-STEP STACK

How to actually wire one of these models to your Mac

The shortest viable path from a fresh M-series Mac to a working on-device agent loop in 2026 is four steps. Pick a runtime, front it with LiteLLM, paste the URL into Fazm, talk to your Mac. No new languages to learn, no fork of the agent to maintain.

  1. 1

    Pick a runtime

    Ollama v0.19+, mlx-lm, LM Studio, or vLLM. Pull one model that fits your RAM.

  2. 2

    Front it with LiteLLM

    litellm --config litellm.yaml. Map your runtime to the Anthropic Messages shape on port 4000.

  3. 3

    Paste the URL into Fazm

    Settings > AI Chat > Custom API Endpoint = http://127.0.0.1:4000. Save.

  4. 4

    Talk to your Mac

    Hold the hotkey, speak the task. The same agent loop now drives your apps from a local brain.

The model layer is finished work for almost everyone reading this. The interesting engineering left is the body.
F
From the Fazm CLAUDE.md
On the 2026 on-device picture

THE HONEST CAVEATS

Two things that did not improve in 2026

No runtime ships an Anthropic Messages surface natively. Every on-device runtime in the list above speaks either its own dialect or the OpenAI Chat Completions shape. Agents written against the Anthropic Messages API (which is most of the serious agent stacks that use tool_use blocks and the streaming delta shape) still need a shim. LiteLLM is the standard shim, takes about 12 lines of YAML, and runs as a separate process on a separate port. The conversion is mechanical, the operational cost is one extra process to keep alive. Nothing in the 2026 changelog removed this step.

Open-weight models still lag on tool-use planning. The frontier hosted models (Claude, GPT, Gemini) have been trained on millions of tool-use trajectories. The open-weight models will produce a tool call when you ask, and they will produce a structured object when you constrain decoding, but the depth of multi-step planning under the kinds of long-horizon prompts an agent generates is noticeably weaker. The practical answer in mid-2026 is hybrid: a local 8B-30B model on high-volume cheap turns, a frontier model on the planning turns, with the proxy choosing which one to call. The agent loop does not care; the proxy is where the routing lives.

Neither caveat is fatal. Both are visible in the architecture of the stack and easy to engineer around. They are listed here because the rest of this page makes the 2026 on-device picture sound finished, and it is not finished, it is just past the brain bottleneck.

Want to wire your local model into Fazm together?

Twenty minutes on a call. Bring your Mac, your runtime of choice, and the workflow you want the agent to run. We will get from cold install to a working local-brain agent loop on the call.

Frequently asked questions

What is the short version of on-device LLM updates in 2026?

Three streams shipped at once. First, Apple's Foundation Models framework, introduced at WWDC 2025 and matured through Q1 and Q2 2026, exposes the ~3B parameter on-device foundation model behind Apple Intelligence to any Swift app through a stable API, with guided generation that constrains output to a Swift type at the token level. Second, MLX became the default inference framework for Apple Silicon after three dedicated WWDC 2025 sessions, then landed inside Ollama on March 30, 2026 (v0.19), with measured prefill of 1,810 tokens per second and decode of 112 tokens per second on M5 Max running Qwen3.5-35B-A3B in NVFP4 quantization. Third, the open-weight calendar concentrated into a 48-hour cluster on April 28 and 29: Xiaomi MiMo-V2.5-Pro (1.02T MoE, MIT), NVIDIA Nemotron 3 Nano Omni (30B / 3B active, NVIDIA Open Model Agreement), IBM Granite 4.1 (3B / 8B / 30B dense, Apache 2.0), and Mistral Medium 3.5 (128B dense, Modified MIT). OpenBMB then shipped MiniCPM-V 4.6 1.3B under Apache 2.0 on May 11. Five entries in three streams. Everything else in the on-device space this year is a point release inside one of those three.

Did Apple Intelligence and the Foundation Models framework actually change anything for developers in 2026?

Yes, the practical change is that every Apple Silicon Mac running macOS 26 Tahoe ships with a built-in LLM that a third-party app can call without bundling weights, without a network round-trip, and without standing up its own inference server. The Foundation Models framework wraps the same ~3B parameter on-device model that powers Siri, Writing Tools, and notification summaries, and exposes it through Swift. The interesting engineering work is the guided generation feature, which is constrained decoding at the token level against a Swift @Generable type. You declare the shape you want, the model is forced to produce it. The constraint is what makes the framework usable for tool calls and structured extraction, not just text generation. The model is small (~3B), not designed as a general-knowledge chatbot. For the cases where 3B is enough, the integration is one Swift import; for the cases where you need a frontier model, you fall back to the Anthropic-compatible-proxy path described below.

How big a deal was MLX going mainstream in 2026?

It is the difference between Apple Silicon being a place you can technically run a model and a place you actively prefer for inference. MLX is designed around the unified memory architecture: the CPU and GPU share the same physical RAM, so tensor operations cross the boundary with zero-copy, eliminating the CPU-to-GPU transfer overhead that dominates traditional frameworks on discrete GPUs. The 2026 milestone was Ollama v0.19 on March 30, 2026, which made MLX the inference backend on Apple Silicon, not just an optional flag. On M5 series chips the GPU Neural Accelerators apply to both time-to-first-token and tokens per second; Ollama's own blog post on the MLX backend reports prefill jumping from 1,154 to 1,810 tok/s and decode from 58 to 112 tok/s on M5 Max running Qwen3.5-35B-A3B in NVFP4. Those are not theoretical numbers from a research paper, they are the numbers the runtime hits when you `ollama serve` and point a client at localhost:11434. The same generation of changes landed in raw mlx-lm, in LM Studio, and in third-party clients that wrap MLX directly.

Which open-weight models shipped in 2026 that are realistic for a Mac?

The realistic Mac targets sort by parameter count and memory budget. For 64GB Apple Silicon, IBM Granite 4.1 (3B/8B/30B dense, Apache 2.0, up to 512K context) is the cleanest pick, both for license simplicity and because the 30B dense fits with room for a long context. Mistral Medium 3.5 (128B dense, 256K context, Modified MIT) is borderline on 64GB but comfortable on 128GB and benchmarks at 77.6% on SWE-Bench Verified, which makes it the realistic coding pick. NVIDIA Nemotron 3 Nano Omni (30B / 3B active multimodal, runs in 25GB of RAM) is the most efficient multimodal entry, useful when you want the model to read what is on screen as part of an agent loop. Llama 4 70B Q4_K_M sits in roughly 40GB and runs at 25-32 tok/s on M5 Max, which is interactive for chat. Xiaomi MiMo-V2.5-Pro (1.02T MoE, 42B active) is the highest-capability open release but its 42B active footprint is tight on 64GB; it is a 128GB-and-up target. OpenBMB MiniCPM-V 4.6 1.3B is too small to be the planner in an agent loop but is the right always-on screen reader for a Mac that already has a frontier model behind a proxy.

What does any of this have to do with running an agent on macOS?

Most articles in this space stop after listing the models or the runtimes. The model is necessary, the runtime is necessary, and neither one is sufficient. To actually drive your real Mac you need an agent loop that reads the macOS accessibility tree, plans multi-tool sequences, and clicks, types, scrolls inside arbitrary apps. The model layer is the brain, the agent loop is the body. The 2026 picture is that the brain layer matured fast and the body layer is still where most of the engineering work has to land. Fazm is one specific implementation of that body layer: a native Swift app that wraps the Claude Code agent loop via ACP, exposes nine macos-use tools that read and act on the accessibility tree, and lets you swap the brain to any Anthropic-Messages-compatible endpoint, including an on-device proxy in front of an MLX runtime or a local Ollama server. The integration surface is small enough to read in one sitting; see the next question for the exact line numbers.

Where in Fazm does the on-device proxy actually plug in?

Lines 527 to 530 of Desktop/Sources/Chat/ACPBridge.swift. Three lines, plus the comment. The block reads the `customApiEndpoint` string from UserDefaults and, if it is non-empty, writes it to `env["ANTHROPIC_BASE_URL"]` on the spawned ACP subprocess. That is the entire mechanism. Anything that speaks the Anthropic Messages API on a local port (LiteLLM in front of Ollama, LiteLLM in front of vLLM, an MLX-backed proxy, an Apple Foundation Models adapter wrapped in an Anthropic shim) becomes the brain of the agent. The ACP subprocess is the unmodified Claude Code agent loop running locally as a Node process; it sends every model request to ANTHROPIC_BASE_URL, gets back tool calls, executes them against the local macos-use and playwright MCP servers, and feeds the responses back to the model. The code is open at github.com/m13v/fazm and the path is stable; verify it yourself.

Are these on-device updates enough that I can actually go offline?

It depends which definition of offline you mean. If offline means no inference traffic leaves the Mac, yes: pull weights once, point a local proxy at a local MLX or llama.cpp or Ollama server, and the entire agent loop runs without touching the network. The accessibility tree, the tool execution, the chat state, the personal context profile, all of that already stays local in Fazm by design. If offline means no software downloads ever again, no: the agent loop itself, Claude Code via ACP, is Node code that you keep updated through normal channels, and the bundled binaries (Node, ffmpeg, cloudflared) live in Fazm's resources directory and get updated through app releases. The middle ground that most people land on is hybrid: a local 8B or 30B model for high-frequency, low-stakes turns (reading the screen, picking the next click) and a frontier model for hard reasoning, with the proxy in the middle deciding which one to call. The proxy is where you spend the engineering effort; the agent loop just sends requests.

What did NOT change in 2026 that probably should have?

Two things. First, none of the major on-device runtimes ship an Anthropic Messages API natively; you still have to put a shim (most commonly LiteLLM) between the runtime and any agent that expects the Anthropic shape. The conversion is mechanical but it adds a process and a port. Ollama in particular has been adding `ollama launch <integration>` targets all year (`pi`, `hermes`, `kimi-cli`, `cline`, `opencode`, `claude-desktop`) but not a native Anthropic surface on top of its own API. Second, none of the on-device model releases come with the tool-use training that frontier hosted models have. Open-weight models will produce a tool call when you ask, but the depth of multi-step planning is noticeably weaker, especially under the kinds of long-horizon prompts that an agent generates. The practical implication: a hybrid setup is still the right answer for actual agent work in mid-2026, with the local model handling the high-volume cheap turns and a frontier model handling the planning turns.

How do I verify any of these dates and numbers?

Each entry traces to a primary source. Apple Foundation Models updates: machinelearning.apple.com/research/apple-foundation-models-2025-updates and the Foundation Models page in developer.apple.com/documentation. MLX as the WWDC 2025 default and the Apple Silicon throughput numbers: the Ollama blog post 'Ollama is now powered by MLX on Apple Silicon in preview' on ollama.com/blog/mlx. Ollama 2026 release-by-release changelog: github.com/ollama/ollama/releases. The April 28 to 29 open-weight cluster: each model's Hugging Face page (XiaomiMiMo/MiMo-V2.5-Pro, nvidia/Nemotron-3-Nano-Omni-30B, ibm-granite/granite-4.1, mistralai/Mistral-Medium-3.5). MiniCPM-V 4.6: huggingface.co/openbmb/MiniCPM-V-4_6. Fazm's three-line block: github.com/m13v/fazm at Desktop/Sources/Chat/ACPBridge.swift, lines 527 to 530 (verify against the current commit on main; the path is stable but line numbers shift with refactors).

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.