Build range b8913 - b8925

llama.cpp release April 2026, read as a swap-in backend for a Mac agent

The April 2026 llama.cpp builds tightened Metal reporting, fixed prefix caching for Anthropic-style clients, and added a parser fix that matters for tool-calling agents. This page annotates each release from the perspective of a native Mac app that can actually consume them, and shows the exact one-key hook inside Fazm that points it at a loopback instead of Anthropic.

M
Matthew Diakonov
10 min
4.8from early Fazm users
Open-source Mac app at github.com/mediar-ai/fazm
customApiEndpoint key mapped to ANTHROPIC_BASE_URL in one file
Metal GPU description now visible in llama-server boot log
0b89xx tags shipped in a single day
0UserDefaults key to swap backends
0typical llama-server loopback port
0 GBM3 Max working set headroom

The Mac-relevant builds, annotated

llama.cpp tags a new build almost every time someone merges. Most of the April 2026 entries are not load-bearing on an Apple Silicon Mac. These are, in order, the ones that actually are.

Source of truth: github.com/ggml-org/llama.cpp/releases and the weekly project report for llama.cpp dated April 06 to 13, 2026.

1

b8913 - Apr 24, 03:00 UTC, WebGPU shader buffer aliasing fix

Fixed a buffer aliasing edge case in the WebGPU RMS fusion path. Not load-bearing for native Mac builds, but worth noting because it was the first b89xx tag of the day.

2

b8914 - Apr 24, 07:14 UTC, Hexagon SOLVE_TRI + f32 vectorization

Adds SOLVE_TRI on Qualcomm Hexagon and vectorizes some f32 paths. Matters for Android/DSP targets. Does not change anything on Apple Silicon.

3

b8919 - Apr 24, 14:25 UTC, Jinja template warnings under clang 21

Silences a pile of warnings when building the Jinja template engine with clang 21. You already see these if you build from source on a current Xcode toolchain on macOS.

4

b8920 - Apr 24, 15:24 UTC, metal: print GPU description

The headline Mac-relevant change. llama-server now logs the Metal GPU it picked up at startup, including core count and capability flags. First reliable way to verify your M3 or M4 GPU is actually doing the work if you wire llama-server behind a proxy.

5

b8922 - Apr 24, 19:58 UTC, WebGPU flash attention on browsers without subgroup matrix

Adds a tile-based flash-attention fallback so WebGPU can run on browser configurations that lack subgroup matrix extensions. Irrelevant for native Mac builds, but confirms WebGPU is a parallel effort, not a replacement for Metal.

6

b8924 - Apr 24, 23:15 UTC, Hexagon HMX frequency bumped to max corner

Pushes the Hexagon Matrix eXtension clock to its top frequency bucket for throughput. Qualcomm-specific. If you see this in the tag name, you are not on an Apple GPU path.

7

b8925 - Apr 24, 23:39 UTC, parser: fix structured-output bug

Fixes a flaw in the structured-output parser. For a tool-calling agent (Fazm runs tool calls on most turns) this is the kind of fix you actually feel: malformed JSON from the model no longer crashes the response pipeline.

The one hook in Fazm that makes any of this consumable

A new llama.cpp build is only interesting if your consumer Mac app can actually use it. Most native macOS agents hard-code their cloud vendor. Fazm does not. The entire backend-swap mechanism is one UserDefaults read in one Swift file. If you change the value, the next time the ACP bridge starts it passes the new URL to the Claude agent subprocess as ANTHROPIC_BASE_URL.

Desktop/Sources/Chat/ACPBridge.swift - customApiEndpoint handoff

The setting itself is exposed in-app at Settings, Advanced, AI Chat, Custom API Endpoint. The in-app copy describes it as a hook for corporate proxies, which is the most common use case, but the mechanism is the same whether the other side is a gateway or a loopback llama-server.

Wiring llama-server into a Mac agent

llama.cpp speaks an OpenAI-compatible surface natively. Fazm speaks Anthropic. A thin shim sits in between. That is the whole architecture. The beam diagram below is literally what the packets do.

Outbound agent traffic after customApiEndpoint is set

Fazm (Swift)
ACP Bridge
Claude agent
Anthropic-compat shim
llama-server
llama.cpp prefix cache
Metal GPU

What llama-server's boot log looks like after b8920

Before b8920, llama-server booted silently on Metal - if it worked, there was no log line confirming it. After b8920, you get an identifier for the GPU, its family, and a few capability flags. If you wire llama-server behind a shim and notice throughput collapsing, this is the first thing you grep for.

llama-server startup on M3 Max under b8920+

The two fields that matter: GPU family (should show MTLGPUFamilyApple9 on M3 and newer) and hasUnifiedMemory = true. If either is missing, the build fell back to a CPU or software path and prompt-processing will not keep up with a Mac agent workload.

End-to-end: four commands to swap the backend

This is the minimum set of shell commands to take a default Fazm install, point it at a local llama-server, and have the next turn of conversation resolve on loopback. No recompile, no code fork.

Point Fazm at a local llama-server on 127.0.0.1

The defaults write line is the same UserDefaults key the app writes to when you use the Settings UI. They are interchangeable. Use the UI unless you are scripting this for a fleet.

The request-reply path, with prefix caching as the hot spot

The reason the Anthropic-style prefix-caching fix in the April 2026 server builds matters for a desktop agent: every turn prepends a stable system block and a current accessibility tree. The system block is the same across turns. If the cache sees it, the prefill step is skipped entirely.

One Fazm turn against a local llama-server

Fazm (Swift)ACP BridgeAnthropic shimllama-serversendMessage + AX treePOST /v1/messagesPOST /v1/chat/completionsprefix cache hit (stable system)streaming tokensre-shaped to Anthropic SSEonContentBlockDelta

Cause and effect, release by release

Every other write-up of this build range lists commits. The only thing that matters for a Mac agent is which ones change behavior you can actually feel. These are the six that do.

b8920: metal: print GPU description

You can finally verify which GPU llama-server picked up, without attaching a profiler. Two lines in the boot log: GPU name and GPU family. Most useful change for Mac users in the whole April range.

Server: Anthropic-style prefix caching

A desktop agent prepends a stable system block every turn. Before the fix, the cache missed when the client used Anthropic content arrays. After, prefills collapse to zero on repeat turns.

llama-cli --endpoint

Turns llama-cli into a remote client. Adjacent to Fazm rather than on the critical path, but it is the one-liner for sanity-checking a llama-server before wiring it into anything.

b8925: structured-output parser fix

Tool-calling agents get malformed JSON sometimes. This release fixed a crash in the parser on specific bad outputs. Matters for Fazm because every turn runs at least one tool call.

b8922: WebGPU tile flash attention

Fallback for browsers without subgroup matrix. Does not apply to a native Mac agent. Listed here for completeness so readers do not chase it.

Hexagon changes (b8914, b8924)

Qualcomm DSP performance work. Not relevant on Apple Silicon. Again listed so they are accounted for in the April build range.

Why this works on Fazm and not on screenshot-based Mac agents

Any Mac agent can, in principle, swap its cloud backend for a local server. The question is whether the latency budget survives the swap. For a screenshot-based pipeline, it does not. The vision encode step dominates before the language model even starts. For an accessibility-tree pipeline, the budget is mostly prompt processing, which is exactly what the April 2026 prefix caching fix accelerates.

FeatureScreenshot agent + local llama-serverFazm (accessibility tree + local llama-server)
Input to the model each turn1-3 MB JPEG, re-OCR'd2-10 KB structured AXUIElement tree
Effect of prefix caching fixEvery screenshot is different, no cache hitStable system block cache-hits every turn
Works with any Mac appMostly browser tabs and Electron shellsYes, accessibility API is system-wide
Local model size that stays usableNeeds a vision model, doubles the footprint13B-30B Q4/Q5 on a 48 GB Mac is workable
How to swap the backendUsually a full pipeline rewriteOne UserDefaults key, no rebuild
Consumer-friendly or dev frameworkPython, uv, API keys, docker-composeSigned, notarized Mac app

A note on what Fazm ships by default

By default the Custom API Endpoint field is empty and Fazm talks to Anthropic directly. That is deliberate. Claude Opus 4.7 on a 1M context window still outperforms every open model on long-horizon desktop automation, which is what the app is actually optimizing for. The custom endpoint exists for users who want privacy control or who want to experiment. Everything about this page is a hook you can turn on, not the default path.

Fraction of outbound agent traffic that respects the customApiEndpoint setting when set: 0%. Nothing about the bridge bypasses it.

Want to see the customApiEndpoint path live?

Fifteen minutes, screen-shared, walking through the two lines in ACPBridge.swift that do the whole backend swap, and a working llama-server loopback.

Questions about running llama.cpp behind a Mac agent

What actually shipped in llama.cpp in April 2026?

Roughly the b8800 through b8925 build range. The ones that matter for a Mac agent are b8920 (metal: print GPU description on server boot, so you can verify which GPU is doing the work), b8922 (WebGPU flash attention enhancements with a fallback tile path for older browsers), a prefix-caching fix in the server for Anthropic-style clients, a new --endpoint option on llama-cli so it can act as a client to a remote llama-server, and b8925 (a structured-output parser fix). The Hexagon and KleidiAI work in b8914 and b8924 is not load-bearing on an Apple Silicon Mac.

How do I point Fazm at a local llama-server instead of Anthropic?

Fazm reads one UserDefaults key named customApiEndpoint. The ACP bridge picks that value up at process start and exports it as the ANTHROPIC_BASE_URL environment variable for the Claude agent subprocess it spawns. The actual code path is in Desktop/Sources/Chat/ACPBridge.swift at lines 380 and 381. If your local llama-server exposes an Anthropic-compatible surface (or sits behind a shim that does), put that URL in Fazm Settings under Advanced, AI Chat, Custom API Endpoint. Fazm itself does not care whether the other side is Anthropic's real API or a loopback address.

Does that mean Fazm runs fully local?

Not out of the box. Fazm ships with a cloud Claude backend by default because Claude still beats every local model at long-horizon desktop automation. The custom endpoint is a hook for users who want to experiment or who have compliance constraints. The reads from your Mac (accessibility tree, file index, audio transcription) are already on-device. What crosses the network is the model request. Pointing customApiEndpoint at 127.0.0.1 moves that leg to your machine too. This is why the v2.3.2 changelog (April 16, 2026) rewrote the privacy language from 'nothing leaves your device' to 'local-first' - honest, not marketing.

Why does the b8920 'metal: print GPU description' change matter?

Because on a Mac you often have more than one GPU path available (the integrated Apple GPU, an eGPU, or in Asahi Linux setups, a discrete card). Up until b8920, llama.cpp booted silently if Metal initialization succeeded. After b8920 the server prints the GPU name, core count, and a few capability flags at startup. If you are wiring this into Fazm via a reverse proxy and throughput drops, that one log line tells you whether llama-server actually picked up the M-series GPU or fell back to a slower path. This is the single most useful debugging affordance the April builds added for Mac users.

What changed in the llama.cpp server for Anthropic-style clients?

The April 2026 weekly report flags a prefix-caching fix in the server for the Anthropic API path. Before the fix, repeated long-system-prompt calls (exactly the kind of traffic a desktop agent generates, because the accessibility tree snapshot is prepended every turn) were cache-missing when the client used the Anthropic-style content array shape. After the fix, the prefix cache recognizes the stable system block across turns. For a Mac agent that sends a 3-10 KB tree every message, the difference is measurable - cache hits on the prefix skip the prefill step entirely.

What is the --endpoint flag on llama-cli?

It turns llama-cli into a thin client that talks to a remote llama-server over HTTP instead of loading the model locally. Useful if you keep a big-RAM machine on your network running the server and want a laptop-side CLI that chats to it. For Fazm, this is adjacent rather than direct - Fazm goes through the ACP bridge, not llama-cli. But if you want to sanity-check your llama-server is working before wiring it into Fazm, llama-cli --endpoint http://127.0.0.1:8080 is the one-liner.

How big a model can a Mac actually run for a desktop agent workload?

On an M3 Max or M4 Max with 48 to 96 GB of unified memory, you can run a Q4_K_M quant of a 70B class model with reasonable prompt-processing speed. For the specific shape of Fazm's traffic (short system context, a 2-10 KB accessibility tree, tool-call-heavy turns), latency matters more than raw throughput because every user turn is round-tripped. A 30B class instruct-tuned model at Q5 is usually the sweet spot in practice. Below 13B, tool-calling accuracy on structured UI trees drops off fast.

Does the WebGPU flash-attention work in b8922 help on macOS?

Only if you run llama.cpp's WebGPU backend in a browser, which is not the path a native Mac agent takes. The Metal backend is still the right target on Apple Silicon. The WebGPU work is interesting for hosted demo pages, not for a signed, notarized Mac app talking to a loopback server.

What breaks if I point Fazm at an OpenAI-compatible llama-server directly?

The Claude agent Fazm spawns speaks the Anthropic content-array shape, so pointing it at a raw OpenAI-schema endpoint will return 400s for the first tool-heavy turn (content blocks with tool_use will not deserialize). You need a shim in between - something like LiteLLM's Anthropic-compat layer, or one of the small OSS proxies that translate between the two. The address you put in customApiEndpoint should be the shim, not llama-server directly. The fact that llama.cpp's April builds fix prefix caching specifically for the Anthropic API path is what makes this round-trip cheap enough to be worth doing.

Why does the accessibility-tree approach matter for running on a local model?

Because the cheapest thing a local model can consume is structured text. Fazm reads the real macOS accessibility tree via AXUIElement for whatever app is under the cursor and passes that tree in. A 7 KB tree is a few thousand tokens. A screenshot of the same screen, even at tight JPEG quality, is 10x the token cost after vision encoding plus significantly more latency. Screenshot-based Mac agents can technically swap their cloud backend for a local model too, but the vision step destroys the latency budget well before the language model does. With an accessibility-tree pipeline, local inference is actually tractable.