llama.cpp release May 2026, read from inside a Mac agent
Fifty builds shipped between May 8 and May 12, 2026. One Metal change matters on Apple Silicon. Three server changes matter for a desktop agent. The rest is Adreno, SYCL, Hexagon, and Vulkan work that does not touch a Mac. This is what to actually read out of the changelog if you have a Mac AI agent on one side of an HTTP socket and llama-server on the other.
Direct answer (verified 2026-05-12)
source: github.com/ggml-org/llama.cpp/releasesllama.cpp shipped builds b9070 through b9127 between May 8 and May 12, 2026. The four builds a Mac AI agent owner should care about:
- b9077 (May 8) - server gains a Vertex-AI-compatible API surface
- b9101 (May 10) - server prints a warning when HTTP timeouts are exceeded
- b9114 (May 12) - Metal mul_mv/mul_mm batch divisors moved to function constants (Apple Silicon prefill speedup)
- b9124 (May 12) - server exposes per-model modalities at /v1/models
Everything else in the window is Adreno, SYCL, Hexagon, or Vulkan work that does not run on a Mac. There is no v0.x style version tag in this period, llama.cpp ships per-commit build numbers.
The shape of the window
The first half of May 2026 was, by the GitHub release page, almost entirely backend work for hardware Apple does not ship. Adreno q4_0 and q4_1 MoE kernels (b9070, b9113). Hexagon HVX work for L2_NORM and gated delta net (b9082, b9084). SYCL allocation tuning (b9087, b9088, b9089). Vulkan shader fixes (b9072, b9106, b9118, b9119). Three Apple Silicon Macs running the same wheel will not notice any of it.
What does land on the Mac is the small set of changes covered below. They are spread across three thin layers: a single Metal kernel optimization (b9114), a handful of server-API surface changes (b9077, b9101, b9124), and one or two sampling and speculative-decoding improvements that travel through both (b9100, b9109).
For an agent driving local apps, those layers map onto a fixed integration shape. The agent talks to a shim, the shim talks to llama-server, llama-server runs the Metal kernels. A change in one layer affects exactly one thing the agent might care about, and the rest is invisible. The bulk of this page is about which May change shows up at which layer.
Build-by-build, what shipped
The seventeen builds below are the ones with a one-line summary that mentions either Apple Silicon, server, sampling, multimodal, or speculative-decoding. The other thirty-odd builds in the window are GPU-backend or model-loader work on non-Apple paths.
b9070 - May 8, 05:20 UTC, opencl: add q4_0 MoE GEMM for Adreno
Adreno mobile GPU path for mixture-of-experts at 4-bit quant. Not on a Mac code path, but the same week's GPU work shows up in this build's binary.
b9076 - May 8, 18:53 UTC, server: router exposes child model information
The server's model router now reports child-model metadata. Useful for shims that route between several loaded models on the same server.
b9077 - May 8, 19:29 UTC, server: support Vertex AI compatible API
Vertex-shape surface added next to the existing Anthropic and OpenAI shapes. Removes one shim layer for users coming from a Google Cloud workload.
b9082 - May 8, 22:21 UTC, Hexagon L2_NORM HVX kernel
Qualcomm Hexagon DSP work, not a Mac path. Listed for completeness.
b9084 - May 9, 03:27 UTC, Hexagon HTP kernel for gated delta net
Hexagon-only, but the kernel shape (gated delta net) is the same one some long-context attention variants use, so an Apple Silicon port of the same idea is plausible later.
b9085 - May 9, 05:18 UTC, Flash attention MMA/Tiles support for MiMo-V2.5
Adds the matrix-multiply-accumulate tile path for the MiMo v2.5 family. Sets up the multimodal model b9116 will plug into.
b9089 - May 9, 11:03 UTC, SYCL flash attention allocation overhead reduction
Intel oneAPI path. Not on the Mac, but illustrates how much of this release window is non-Apple work.
b9093 - May 9, 21:02 UTC, sarvam_moe architecture support
New MoE architecture supported on the model-loader side. Independent of Metal.
b9100 - May 10, 20:06 UTC, sampling: return post-sampling probabilities
Sampler can now report the probability distribution after sampling, not just the sampled token. Useful for any agent that wants to expose confidence to the user or to its own retry logic.
b9101 - May 10, 20:27 UTC, server: warn when HTTP timeouts are exceeded
The single most useful debugging affordance the May builds added for Mac users. Slow llama-server decodes now log a visible timeout instead of silently hanging the client.
b9109 - May 11, 21:12 UTC, spec : parallel drafting support
Multiple draft models can run in parallel for speculative decoding. Trades memory for higher draft acceptance. On Apple Silicon, you will hit unified-memory bandwidth before you see most of the gain.
b9114 - May 12, 07:47 UTC, metal: promote mul_mv/mul_mm batch divisors to function constants
The headline Apple Silicon change in this window. Integer divisors inside the Metal matrix-multiply kernels are now compile-time constants, so the kernel can fold power-of-two divisors into shifts. Small prefill speedup on M3 and M4.
b9116 - May 12, 12:46 UTC, mtmd: add MiMo v2.5 vision
Vision support for the MiMo v2.5 family lands. Combined with b9085's flash-attention kernel, this is the build where you can actually serve MiMo v2.5 with vision through llama-server.
b9119 - May 12, 15:49 UTC, vulkan: fix Windows Intel GPU BF16 regression
Not relevant on macOS. Listed because users sometimes ask whether the Vulkan path can be used as a fallback on Apple Silicon. The answer is still no.
b9124 - May 12, 19:18 UTC, mtmd, server, common: expose modalities to /v1/models
Clients can ask /v1/models which modalities each loaded model supports. The first time a Mac agent can feature-detect vision and audio support on the server it is about to talk to, without trial-and-error.
b9127 - May 12, 22:29 UTC, opencl: add opt-in Adreno xmem F16xF32 GEMM
Final build of the window, OpenCL only. Marks the end of the May 2026 sprint as of this writing.
b9114, the only Metal change in the window
The Metal kernels in question are mul_mv and mul_mm, matrix-vector and matrix-matrix multiply. Both are batched. Before b9114, the batch divisor was passed in as a regular uniform argument, which means every kernel invocation had to load it and divide by it at runtime. After b9114, the divisor is declared as a Metal function constant, which the Metal compiler resolves at pipeline-creation time.
The practical effect is two things. First, when the divisor is a power of two, the compiler folds the division into a right-shift inside the kernel. Second, even when it is not, the divisor stops being a memory load and becomes an immediate value baked into the kernel binary. On an M3 or M4 doing long-context prefill (the part of a chat turn where a long system block is processed before generation starts), the kernel runs are shorter and the dispatch is cheaper.
The size of the win depends on the model, the quant, and the context length. For the workload an accessibility-tree-based Mac agent generates (a 3 to 10 KB system block prepended every turn, mostly stable across turns, very low generation length per turn), the prefill path is where most of the wall-clock time goes, so the change actually shows up. The April work on Anthropic-style prefix caching (covered on the April page) and this Metal kernel change stack: the prefix cache cuts how often you pay prefill, this kernel change cuts what prefill costs when you do.
One caveat. The release notes list this change against the unified Metal backend, not against any specific Apple Silicon generation. In testing, the absolute speedup on an M1 generation is hard to measure because the kernel was already cheap relative to memory bandwidth. The change is visible on M3 and M4 because their compute is faster relative to bandwidth, so a smaller kernel actually shows up at the timeline level.
The three server changes that matter for a desktop agent
A Mac agent does not link llama.cpp. It talks to llama-server over HTTP, usually through an Anthropic-shape shim. So the things that affect the agent are the things that change the server's wire-level behavior. Three did.
b9077 - Vertex AI compatible API
llama-server already exposed an OpenAI-shape surface and, more recently, an Anthropic-Messages-shape surface. b9077 adds a Vertex AI shape. For a Mac agent that uses an Anthropic-shape shim (which is what the Fazm ACP bridge expects), this is not the path you will use. It is useful only if your team already runs a Vertex AI workload in production and wants to dogfood the same client shape against llama-server in dev. The net is: one less reason to put LiteLLM in front of llama-server for users in that specific corner.
b9101 - HTTP timeout warning
Quietly the most useful change of the week if you actually use llama-server from a Mac client. Before b9101, when the server's internal HTTP timeout fired (default is generous but finite), the client got a connection drop with no server-side breadcrumbs. After b9101, the server logs a visible warning at the timeout instant. The change is in the logging layer, not the behavior, but it cuts the debugging loop on the most common Mac-side llama-server failure (a long decode under load) from minutes to seconds.
b9124 - modalities exposed at /v1/models
Clients can now poll /v1/models and learn, per model, which input modalities the server accepts. Text only, text plus image, text plus audio, all three. For agents that ship a single binary and let the user point at any llama-server, this finally enables feature detection. Before b9124, the only way to know if a model accepted vision was to send it an image and watch what happened. After b9124, a startup probe answers the question cleanly. For an accessibility-tree-based agent like Fazm, this matters less than for a screenshot-based one, but it still affects the moment a user wires in a vision-capable shim or hands the agent an image-bearing PDF.
The four lines of Swift that make any of this matter for Fazm
Fazm is a native Mac app. It reads the macOS accessibility tree, drives apps via AppleScript and AX events, and runs a Claude agent in a child process to plan and step. The Claude agent talks Anthropic Messages over HTTP. There is exactly one configuration field that decides which HTTP endpoint that traffic goes to.
The field is a UserDefaults key called customApiEndpoint, surfaced in Settings under Advanced > AI Chat > Custom API Endpoint. The four lines that pick it up and pass it down to the agent subprocess live in Desktop/Sources/Chat/ACPBridge.swift at lines 467 through 470.
That is the entire integration surface. The Claude agent reads ANTHROPIC_BASE_URL from its environment and uses it instead of the default Anthropic API origin. Whatever sits at that URL needs to speak the Anthropic Messages content-array shape, including the tool_use and tool_result content blocks. llama-server does not natively. So the address you put in the field is the address of a shim (LiteLLM with the Anthropic-compat adapter, claude-code-router, or a small FastAPI translator), and the shim talks to llama-server in whichever shape they share.
Every May 2026 build crosses that four-line block the same way. b9114's Metal speedup is invisible to the bridge, b9101's timeout warning shows up in the shim's upstream logs, b9124's modality field shows up if the shim copies it through. The bridge does not change.
What the request path actually looks like
One turn of a Fazm chat against a local llama-server, through the May 2026 build. Each arrow is a real HTTP hop or a real subprocess boundary.
One chat turn, Fazm to local llama-server
Fazm app reads accessibility tree
Native Swift reads AXUIElement for the focused window, serializes 3-10 KB of text. No network.
ACP bridge subprocess spawns
ACPBridge.swift checks customApiEndpoint, sets ANTHROPIC_BASE_URL on the child env, launches Node ACP bridge.
Anthropic-shape POST to shim
Bridge sends Messages-shaped JSON with tool_use blocks to LiteLLM or claude-code-router at the configured URL.
Shim translates to llama-server
Shim converts Anthropic content-array shape to OpenAI completions shape (or, post-b9077, Vertex shape). Posts to localhost:8080.
llama.cpp b9127 server runs the kernels
Metal mul_mv/mul_mm with b9114's function-constant divisors. Prefill on the accessibility tree, decode of the tool call.
Response flows back, shim translates
OpenAI-shape response becomes Anthropic-shape. b9101 logs any timeout en route. b9124 surfaces modalities if the shim copies them.
The shortest May 2026 setup
What to actually run to wire b9127 into a Fazm chat. Each step has a verifiable output you can check before moving on.
- 1
Build or pull b9127
git clone llama.cpp, git checkout b9127, make. Or pull the prebuilt macos-arm64 binary from the release page. Confirm with `llama-server --version`.
- 2
Run llama-server
llama-server -m /path/to/model.gguf -c 16384 --host 127.0.0.1 --port 8080. On boot, the April b8920 line prints the Metal GPU description. If you see your M-series chip name, you are running on the GPU.
- 3
Front it with a shim
litellm --config litellm.yaml --port 4000. Single model entry mapping anthropic/claude-sonnet to localhost:8080. Or claude-code-router for a smaller footprint.
- 4
Paste the shim URL in Fazm
fazm.ai download, install, Settings > Advanced > AI Chat > Custom API Endpoint, http://127.0.0.1:4000, save. Next chat turn flows through your local b9127.
April 2026 vs May 2026, only what changed for Mac users
If you read the April page on this site and have not upgraded since, this is what is new for you on a Mac. Non-Mac paths (Adreno, SYCL, Hexagon, Vulkan, CUDA) are omitted.
| Feature | April 2026 (b8800-b8925) | May 2026 (b9070-b9127) |
|---|---|---|
| Metal kernel optimization | b8920 prints Metal GPU description on server boot | b9114 promotes mul_mv/mul_mm batch divisors to function constants |
| Server timeout behavior | Silent timeout drop | b9101 logs a warning when HTTP timeouts fire |
| Modality feature detection | Trial and error, send and watch | b9124 exposes per-model modalities at /v1/models |
| Server API shapes | Anthropic and OpenAI only | Anthropic, OpenAI, plus b9077 Vertex AI |
| Sampling probabilities | Sampled token only | b9100 returns post-sampling probabilities to clients |
| Speculative decoding | Single draft model | b9109 supports parallel drafting (multiple draft models) |
| Fazm integration | Same four lines at ACPBridge.swift:467-470, unchanged | Same four lines at ACPBridge.swift:467-470, unchanged |
Frequently asked questions
What did llama.cpp ship in May 2026, in one paragraph?
Roughly fifty builds, from b9070 on May 8 through b9127 on May 12. Most of them are GPU-backend or model-architecture work that does not touch a Mac path (Adreno q4_0 and q4_1 MoE for OpenCL, SYCL flash-attention allocation, Hexagon HVX kernels, Vulkan shader fixes). Four changes do matter for a Mac AI agent: b9077 (server gains a Vertex AI compatible API), b9100 (sampling can return post-sampling probabilities), b9101 (server prints a warning when an HTTP timeout is exceeded instead of hanging), b9109 (parallel drafting support for speculative decoding), b9114 (metal: promote mul_mv and mul_mm batch divisors to Metal function constants), and b9124 (mtmd, server, common: expose modalities at /v1/models so clients can feature-detect vision and audio before sending a request).
What does b9114 'metal: promote mul_mv/mul_mm batch divisors to function constants' actually mean for an M-series Mac?
It means the integer divisors used inside the matrix-vector and matrix-matrix Metal kernels are no longer plain runtime arguments. They are now Metal function constants, which the Metal compiler specializes at pipeline-creation time. The result is a smaller hot-path: the divisor becomes a compile-time integer constant inside the kernel, so the GPU can fold the division into shifts where the divisor is a power of two and skip a load otherwise. For a Mac agent driving local llama-server, this lands on the prefill path that handles your accessibility-tree system block every turn. You will not see a graph-redrawing speedup on the M1 generation, but on M3 and M4 the practical effect is a few percent off prefill latency on long-context turns. Compared with the Hexagon and OpenCL work in the same week, this is the only change that hits Apple Silicon at all.
What is b9124 (modalities at /v1/models) and why should a desktop agent care?
It exposes, on the same /v1/models endpoint clients already poll, which input modalities each loaded model supports. So a client can ask the server, before sending a request, whether the model accepts text only, text plus images, text plus audio, or all three. For a Mac agent that ships a single binary and lets the user point it at any llama-server, this is the difference between sending a tool call with a base64 screenshot and getting a 400 because the model is text-only, vs. detecting modalities at startup and either degrading gracefully or surfacing a clear message. Fazm uses accessibility-tree text, not screenshots, so the modality question rarely bites in normal operation, but the moment the user wires a vision-capable shim in or hands the agent an image-bearing PDF, this matters.
What is b9077 'server: support Vertex AI compatible API' for, and does it change anything on macOS?
It adds a Vertex-AI-shaped surface to llama-server, mirroring what they already do for the Anthropic and OpenAI shapes. If your Mac agent's API shim happens to speak Vertex (because it was built originally for a Google Cloud workload), you can now point that shim directly at llama-server without a third-party translator in the middle. Concretely, this matters for teams who run a Vertex AI workload in production, want to dogfood the same shape against a local llama.cpp instance for development, and only have a Mac. It does not change the Anthropic path, which is what Fazm's ACP bridge actually uses, so the headline is 'one fewer reason to stand up LiteLLM in front of llama-server'.
What is the b9101 server timeout warning change, and is it load-bearing for an agent?
Before b9101, when llama-server's internal HTTP timeout fired on a slow generation (think a long decode on a 70B at Q4 on an M2 Air), the server would just hang from the client's perspective. After b9101, it logs a visible warning the instant a timeout is exceeded. For a Mac agent driving an Anthropic-shaped shim that fronts llama-server, this is the difference between 'why did the chat freeze at turn 12' and 'oh, look in the server log, there is the timeout'. The change is in the logging layer, not the behavior, but it cuts the debugging loop for the most common Mac-side failure mode of a local llama-server setup roughly in half.
How does any of this connect to Fazm specifically?
Fazm reads one UserDefaults key, called customApiEndpoint, inside Desktop/Sources/Chat/ACPBridge.swift. When the value is non-empty, the ACP bridge subprocess is spawned with ANTHROPIC_BASE_URL set to it. That is the entire integration surface for swapping the backend, and it lives at lines 467 through 470 of ACPBridge.swift. The four lines do not care which version of llama.cpp is on the other side, only that whatever the URL points at speaks the Anthropic Messages shape. So every change in this May 2026 build window either flows through transparently (b9114 Metal speedups, b9100 sampling probabilities) or shows up the moment your shim asks llama-server for them (b9124 modalities, b9077 Vertex shape).
Do I need to upgrade past the April 2026 builds if my Mac setup is working?
Two reasons to consider it. First, b9114 is a small but real Metal prefill-path improvement, and it has zero cost to take. Second, b9101's timeout warning will save you the next time the server stalls under load and you cannot tell whether the bottleneck is the model or the network. The rest of the May builds are mostly non-Apple paths (Adreno, SYCL, Hexagon, Vulkan). If you are on a build older than April's b8920 (the metal-print-GPU-description change), you also lose a useful boot-time sanity log line by not upgrading. None of the May changes break the Anthropic shape that a Fazm-style agent depends on.
Does b9109 (parallel drafting for speculative decoding) help a Mac agent?
Only if your shim actually exposes the speculative-decoding knob to the model loader. The May change enables multiple draft models to run in parallel during speculative decoding, which trades a bit of memory for higher draft acceptance. On a Mac, the constraint is almost always unified-memory bandwidth, not draft throughput, so the win is smaller than on a discrete-GPU host. If you are loading a 30B target plus a 1B draft on a 64 GB M3 Max, you can probably fit two drafts and see a measurable improvement on tool-call-heavy turns. On a 36 GB M3, you will run out of memory before you see the gain. Treat it as an advanced setting, not a default.
Why are the WebGPU and Vulkan changes in this window irrelevant for macOS?
Because Apple does not ship Vulkan, and the WebGPU path inside llama.cpp targets browsers, not native applications. A signed and notarized Mac agent talks to llama-server over HTTP through the Metal backend, full stop. The Vulkan Windows-Intel BF16 regression fix in b9119 is genuinely useful, just not on this platform. Same for the OpenCL Adreno work, which targets mobile GPUs.
What is the shortest May 2026 setup that wires the latest llama.cpp into Fazm?
Three steps. One, build or pull llama.cpp at tag b9127 or later, run `llama-server -m <model> -c <ctx> --host 127.0.0.1 --port 8080`. Two, drop an Anthropic-shape shim in front, the simplest path is `litellm --config litellm.yaml --port 4000` with a single model entry mapping `anthropic/claude-sonnet` to your local OpenAI-shape llama-server. Three, install Fazm from fazm.ai, open Settings, click Advanced > AI Chat > Custom API Endpoint, paste http://127.0.0.1:4000 and save. The four lines at ACPBridge.swift:467-470 pick the URL up on the next chat turn and your local b9127 server is in the loop. Verify by checking llama-server's stdout the moment you send a message: you should see the request hit and the Metal GPU description line print (from April's b8920).
Does the accessibility-tree approach change anything for these May 2026 builds?
It changes which gains you actually feel. Screenshot-based Mac agents put a vision-encoder step in front of every turn, so they care intensely about whether b9124's modality exposure matches their model's image input contract, and they pay the encoder latency even when nothing visual changed on screen. An accessibility-tree agent sends 2 to 10 KB of structured text per turn, mostly stable across turns, which is exactly the traffic pattern b9114's Metal mul kernel and the Anthropic prefix-cache work from April benefit. The same llama.cpp build that helps both agent shapes a tiny bit on raw decode helps the accessibility-tree shape a lot on prefill, because the prefill is what dominates the turn.
Want to wire local llama.cpp into a Mac agent?
If you are pointing a desktop agent at your own llama-server and the shim layer is fighting you, book a 30-minute call. We have done this on enough Macs that we can shortcut a week of guessing.
Keep reading
llama.cpp release April 2026 release notes for Mac agents
The April companion piece. b8800-b8925 build range, the b8920 Metal GPU description line, and the Anthropic prefix-caching fix that pairs with this month's kernel work.
vLLM release May 2026, why your Mac agent does not care
The vLLM v0.20.1 patch and v0.20.0 CUDA 13 jump, walked through the same lens. The integration story is the same four-line ACPBridge.swift block.
Run vLLM locally on a Mac with an AI agent
Practical walkthrough of vllm-mlx as an Anthropic-shape backend for a native Mac agent. Adjacent path to the llama-server + shim setup covered here.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.