On-device LLMs · 2026

On-device LLMs in 2026: the update that let local models leave the chat window

Most of what got written about local models this year was a leaderboard: which model, which runtime, how many tokens per second on which Mac. Useful, but it misses the change that actually moved local models from a chat box into real work. That change was not a model at all.

M
Matthew Diakonov
8 min read

Direct answer · verified 2026-05-30

The pivotal on-device LLM update of 2026 was interface, not weights. Ollama and LM Studio both shipped a native Anthropic-compatible /v1/messages endpoint, so a local model now drops into any tool that reads ANTHROPIC_BASE_URL by changing one URL. That includes Claude Code and the desktop agents built on it. In Fazm you paste http://localhost:11434 into Settings > Advanced > Custom API Endpoint and the same agent that drives your browser and native apps runs on a model that never leaves your machine.

Confirmed against Ollama's Anthropic compatibility docs and LM Studio's Anthropic endpoint docs.

What actually changed in 2026

Four things moved this year. The first two are the ones everyone covered. The third is the one that mattered, and the fourth is what it unlocked.

1

Small models caught up to last year's big ones

Through 2026 the 3B to 8B range kept absorbing capability that used to require 30B-plus. MoE designs like Google's Gemma family and the Qwen3 line activate only a few billion parameters per token, so a 32 to 48 GB Mac runs them at interactive speed. The practical effect: a model small enough to live on your laptop became good enough to drive a tool loop, not just answer trivia.

2

MLX became the fastest way to run them on Apple Silicon

Apple's MLX framework pulled ahead of llama.cpp's Metal backend for models under roughly 14B, and Ollama added an MLX backend on Apple Silicon. Unified memory means a 64 GB Mac runs models that will not fit on a 24 GB discrete GPU. Runtime stopped being the bottleneck.

3

The runtimes learned to speak Anthropic

This is the update the round-ups skipped. Ollama and LM Studio both added a native Anthropic-compatible /v1/messages endpoint, with streaming, system prompts, and tool calling supported. For the first time a local model could answer in the exact wire shape that Claude Code and the agents built on it already expect, with no translator in the middle.

Ollama: http://localhost:11434 · LM Studio: http://localhost:1234. Both expose /v1/messages.
4

Which unlocked local models for desktop agents

Once the local server answers the Anthropic Messages API, any tool that reads ANTHROPIC_BASE_URL can point at it. That includes Fazm, which already wraps Claude Code in a native Mac UI and reaches into the browser and native apps through accessibility APIs. Setting one URL swaps the cloud model for a local one without touching anything else about how the agent acts.

The update the round-ups skipped

For most of local-model history, every runtime spoke the OpenAI chat format. That was fine for chat clients, but it meant anything built against the Anthropic Messages API could not talk to a local model directly. Claude Code, and every desktop agent built on top of it, sat on the wrong side of a format gap. You could bridge it with a proxy, but a proxy is one more process to run, one more thing to break.

In 2026 the runtimes closed the gap themselves. Ollama and LM Studio each added a native /v1/messages endpoint that follows the Anthropic Messages specification, including streaming, system prompts, and tool calling. From the caller's side, a local model on your laptop now looks exactly like Anthropic's API. You set one environment variable and a tool that was hard-wired for Claude talks to your local model with no translator in between.

point-anything-at-a-local-model.sh

For backends that still only expose the OpenAI shape, a raw llama.cpp server or vLLM, you put LiteLLM in front: it accepts Anthropic-format requests on its own /v1/messages route and forwards them downstream. Either way the contract the agent sees is identical. That uniformity is the whole point.

How Fazm turns that one URL into a desktop agent

Fazm wraps Claude Code in a native macOS UI and, separately, reaches into your browser and native Mac apps through Apple's accessibility APIs. The agent loop is the real Claude Code, so it already speaks the Anthropic Messages API. That is exactly the format the 2026 local runtimes now answer. The Custom API Endpoint field is the seam where the two meet.

What happens when you paste a URL there is specific and worth knowing, because it explains a couple of gotchas. Fazm validates the value as a real URL before doing anything with it: it requires an http or https scheme and a non-empty host. A bare localhost:11434 fails that check, because it has no scheme, and Fazm quietly falls back to the default Anthropic endpoint rather than injecting a malformed base URL that would make every request throw. Type the scheme. Once it passes, the endpoint becomes ANTHROPIC_BASE_URL for the agent process, and Fazm swaps its bundled Anthropic key for the placeholder sk-fazm-custom-endpoint so it never sends real credentials to a server you control.

ACPBridge.swift

That is the entire mechanism. There is no separate "local model mode," no model bundled with the app, no runtime to babysit inside Fazm. You run your own Ollama or LM Studio, you paste its URL, and the agent that was reaching for Anthropic now reaches for your laptop.

What stays local, what the model decides

It helps to separate two things that "on-device" tends to blur. The first is where the actions happen. Fazm reads the macOS accessibility tree of whatever app is open and acts through Apple's AXUIElement APIs, on your machine, in every configuration. It does not stream pixels of your screen to a model to be interpreted. The clicking, typing, and scrolling are local whether the model is in the cloud or on your laptop.

The second is where the reasoning happens, which is what the Custom API Endpoint decides. With the default setup, the prompt and conversation go to Anthropic through your account. With a local endpoint set, they go to your local server instead. Combine the two and you get something that did not have a clean path a year ago: a desktop agent where both the deciding and the doing stay on the Mac. That is the configuration the 2026 runtime update finally made a one-field change instead of a weekend project.

The honest caveats

Running the agent on a local model is not free of friction, and pretending otherwise helps no one. Two things bite in practice.

First, caching. Ollama's Anthropic compat layer does not support prompt caching, so a long agent run re-processes the system prompt and the whole conversation every turn. On a busy context that is real latency you do not pay against the hosted API. Keep tasks scoped and the context lean.

Second, tool reliability. An agent loop lives or dies on tool calling, and smaller local models are less consistent at picking the right tool than a frontier model. A model that mishandles tool selection can loop on a step that a larger model would clear in one shot. Use a tool-capable model, the 2026 Qwen and Llama tool variants are the usual picks, and treat the smallest models as chat companions rather than agents. For privacy-sensitive desktop work where the data must not leave the machine, the tradeoff is usually worth it. For throwaway tasks, the hosted model is still the faster path.

Want to point Fazm at your own local model?

Bring your setup to a working session and we will wire your Ollama or LM Studio endpoint into the desktop agent live.

Frequently asked questions

What was the biggest on-device LLM update in 2026?

Interface, not weights. In early 2026 both Ollama and LM Studio added a native Anthropic-compatible /v1/messages endpoint. Before that, every local runtime spoke the OpenAI chat format, so anything built against the Anthropic Messages API (Claude Code, and the desktop agents built on top of it) needed a translation proxy in front of the model. After it, you point ANTHROPIC_BASE_URL at http://localhost:11434 (Ollama) or http://localhost:1234 (LM Studio) and the local model answers in the shape those tools already expect. New models kept landing all year, but this one change is what let a local model run a tool loop instead of just chatting in a window.

Can I run a local model behind a computer-use agent on my Mac?

Yes, and the path got short in 2026. A computer-use agent built on Claude Code talks to the model over the Anthropic Messages API. Once Ollama and LM Studio exposed /v1/messages, you can swap the cloud model for a local one by changing the base URL the agent uses. In Fazm that field is Settings > Advanced > Custom API Endpoint. You paste http://localhost:11434, and the same agent that reads your accessibility tree and clicks buttons now runs on a model that never leaves the machine. The actions were always local; this makes the reasoning local too.

Does Fazm itself run an on-device LLM out of the box?

No. Out of the box Fazm wraps Claude Code and Codex, which call Anthropic and OpenAI in the cloud through your own account. The on-device part is opt-in: the Custom API Endpoint setting lets you redirect those calls to a local Anthropic-compatible server you run yourself (Ollama, LM Studio, or a llama.cpp or vLLM server behind a LiteLLM /v1/messages gateway). Fazm does not bundle a model or a runtime. It bundles the wiring that lets your local one drive the desktop.

Why does a bare localhost:11434 not work in the Custom API Endpoint field?

Because Fazm validates the value as a full URL before it forwards it. The check in ACPBridge.swift requires an http or https scheme and a non-empty host. A value like localhost:11434 has no scheme, so it is rejected and Fazm falls back to the default Anthropic endpoint rather than injecting a malformed ANTHROPIC_BASE_URL that would make every request throw 'Invalid URL'. Type http://localhost:11434, with the scheme, and it passes. This is a deliberate guard, not a bug.

Which local backends speak the Anthropic format natively, and which need a proxy?

Ollama and LM Studio speak it natively in 2026: both expose /v1/messages, so you can point ANTHROPIC_BASE_URL straight at them. Backends that only expose an OpenAI-compatible API, like a raw llama.cpp server or vLLM, still need a translator. The common answer there is LiteLLM, which accepts Anthropic-format requests on its own /v1/messages route and forwards them to whatever downstream you configure. Either way, the endpoint Fazm talks to has to answer in the Anthropic Messages shape. A raw OpenAI or Gemini key pasted into the field will not work.

What are the honest tradeoffs of running the agent on a local model?

Two real ones. First, Ollama's Anthropic compat layer does not support prompt caching, so a long agent run re-processes the system prompt and history every turn, which costs latency on a busy context. Second, an agent loop leans hard on tool calling, and smaller local models are less reliable at picking the right tool than a frontier model; a model that struggles with tool_choice can loop. Use a tool-capable model (the 2026 Qwen and Llama tool variants are the usual picks) and keep tasks scoped. For privacy-sensitive desktop work where the data must not leave the Mac, the tradeoff is usually worth it.

Do my screen and keystrokes leave the machine when the model is local?

The agent's actions are local in every configuration. Fazm reads the macOS accessibility tree and acts through Apple's AXUIElement APIs on your machine; that part never depends on the model location. What a local endpoint changes is the reasoning: with a Custom API Endpoint set, the prompt and conversation go to your local server instead of Anthropic, and Fazm disables its bundled key so it never sends credentials to your proxy. So with a local model, both the doing and the deciding stay on the Mac.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.