llama.cpp release, June 2026: the version number is not the news

Most pages that rank for this just mirror the GitHub tag list. Here is the part that actually changes what you can build: a current llama-server now answers the Anthropic Messages API natively, which is the one protocol a native Mac coding agent needs to run on a model sitting on your own SSD.

Matthew Diakonov, Written with AI

Published June 19, 20268 min read

Direct answer, verified June 19, 2026

There is no single named “June 2026” release of llama.cpp. The project ships a build on nearly every merged commit, tagged bNNNN. The latest tag on June 19, 2026 was b0, and roughly a dozen builds were cut that day alone. Pin the highest bNNNN tag that has your platform binary and treat that as your version.

Source: github.com/ggml-org/llama.cpp/releases. The tag is a counter, not semantic versioning, so a higher number just means newer.

Why a rolling project has no “June release”

llama.cpp is not on a quarterly or monthly cadence. Continuous integration tags a build, attaches prebuilt binaries for Android, macOS, Ubuntu, and Windows across several CUDA configurations, and publishes it. That is why a search for a month-named release returns a wall of build numbers instead of release notes. The honest answer to “what is the June 2026 release” is a range, not a point: everything from the early-June builds up to b9723 and climbing.

If you came here looking for the May build-by-build walkthrough, the notable changes in that window are covered separately in the May 2026 build notes. This page is about the one capability that, by June, is stable enough to build a daily workflow on.

The endpoint the trackers skip: native Anthropic Messages API

The under-reported fact about current builds is in llama-server, the HTTP server bundled with llama.cpp. It now answers the Anthropic Messages API at /v1/messages with a companion /v1/messages/count_tokens route. The implementation (contributed in PR #17570 and documented by the ggml team) converts Anthropic’s request format into the existing OpenAI-style pipeline internally, so streaming, tool use, vision, and extended thinking all flow through it.

That matters because the wire protocol a Claude Code agent loop speaks is the Anthropic Messages API. So a June 2026 llama-server is not only an inference engine, it is a complete local backend for any client that lets you override the base URL. You set one environment variable and the agent talks to your Mac instead of a data center:

# serve a tool-capable model from a June build
./llama-server -m ./qwen3-coder.gguf --port 8080 --jinja

# any Anthropic-protocol client now points here
ANTHROPIC_BASE_URL=http://127.0.0.1:8080

Confirmed against the ggml team’s write-up: New in llama.cpp: Anthropic Messages API.

What Fazm actually does with that endpoint

Fazm is a native macOS app that wraps the Claude Code agent loop in a real window with persistent sessions, one-click forking, and no auto-compacting. It exposes a Custom API Endpoint field in Settings, and the wiring behind it is small and public. When you set an endpoint, the bridge code in Desktop/Sources/Chat/ACPBridge.swift does three concrete things:

// ACPBridge.swift — custom endpoint branch
env["ANTHROPIC_BASE_URL"]      = customEndpoint   // your llama-server
env["FAZM_CUSTOM_API_ENDPOINT"] = "true"
// never forward the bundled cloud key to your proxy;
// a placeholder keeps key-accepting local gateways happy
env["ANTHROPIC_API_KEY"]       = "sk-fazm-custom-endpoint"

The placeholder key is the detail no other write-up mentions. A local gateway that accepts any token stays on the simple API-key path instead of trying to start a cloud sign-in flow. And because the bundled Anthropic key is dropped in this mode, there is no silent fallback to the cloud: if your local server is down, the chat reports it rather than quietly re-routing your prompt off-device. The error handler even recognises the common local-server failures and tells you to load a model in LM Studio or Ollama, or to turn the endpoint off to return to built-in Claude.

One request, end to end

Here is the full path a single turn takes once the endpoint is set. Nothing in this chain touches a remote server when the endpoint is a loopback address.

A turn from your keyboard to a local June build and back

You type or speak in Fazm

The native window captures the prompt plus any screen or accessibility context the agent is allowed to see.

Agent loop emits an Anthropic Messages request

Claude Code formats the call and sends it to ANTHROPIC_BASE_URL, which now points at your machine.

llama-server answers /v1/messages

The June build converts the Anthropic format internally and runs inference on your loaded GGUF model, streaming tokens and tool calls.

Tool calls execute on your Mac

Browser actions, file edits, and native-app control run through Fazm's accessibility layer, then results feed back into the same window's full context.

Wiring a June build into Fazm in four steps

Get a current build

Download the latest bNNNN macOS binary from the releases page, or build from source. Any build new enough to expose /v1/messages will do, which by June 2026 is all of them.

Serve a tool-capable model

Run llama-server with a model trained for tool use (Qwen3 Coder, Kimi K2, MiniMax M2, Nemotron). Use --jinja so the chat template and tool calls render correctly, and give it the largest context your RAM allows.

Point Fazm at it

Open Settings, Advanced, AI Chat, Custom API Endpoint, and paste http://127.0.0.1:8080. Fazm rewrites ANTHROPIC_BASE_URL and disables its bundled cloud key for this session.

Run the loop, keep the UX

Chat, fork, and restart as usual. The model is now local, but persistent sessions, forking, no auto-compacting, voice, and Mac-wide control are unchanged.

The honest caveat: not every model survives the agent loop

The endpoint working does not mean every model works. An agent loop is mostly tool calls and long context, and that is where small or chat-only models fall apart. Before you blame the setup, check the model, not the wiring.

What a local model needs to drive the loop

Trained for tool or function calling, not just chat
A context window large enough to hold a working session
Served with --jinja so tool-call templates render
Enough RAM and bandwidth on your Mac for usable throughput
A 7B general chat model as your agent brain
Expecting frontier-level reliability on long multi-step plans

For the general mechanics of overriding the base URL, including the auth-token gotchas that bite people who set the wrong variable, see the dedicated custom base URL guide.

Want to point a Mac agent at your own llama.cpp build?

Bring your setup and we will walk through serving a tool-capable model and wiring it into a native session in a few minutes.

Questions people actually search

Frequently asked questions

Is there a llama.cpp June 2026 release I should download?

There is no single named 'June 2026' release. llama.cpp ships a fresh build on almost every merged commit, tagged with the form bNNNN. June 2026 produced hundreds of them. The newest at the time of writing is b9723, cut on June 19, 2026. If you want a stable point to pin, pick the highest bNNNN tag on the releases page that has the platform binary you need and treat that as your version.

What is the latest llama.cpp build number?

As of June 19, 2026 the latest tag is b9723. The repository was producing roughly a dozen builds in a single nine-hour window that day, so by the time you read this the number is almost certainly higher. The tag is a monotonic counter, not semantic versioning, so a higher number is simply newer, not a major or breaking release.

Why does the June build matter for a Mac AI agent specifically?

Because llama-server, the HTTP server that ships inside llama.cpp, now answers the Anthropic Messages API at /v1/messages and /v1/messages/count_tokens natively. That is the same wire protocol a Claude Code agent loop talks. So a current build is not just a faster inference engine, it is a drop-in local backend for any tool that lets you override the Anthropic base URL, including Fazm.

Do I need a proxy or shim like LiteLLM to use llama-server with Fazm?

Not anymore, for the basic case. Older guides put LiteLLM or a router in front to translate OpenAI format into the Anthropic Messages format. Since llama-server speaks /v1/messages itself, you can point Fazm's Custom API Endpoint straight at http://127.0.0.1:8080 and skip the translation layer. You only reach for a proxy now if you want routing, logging, or to mix several backends behind one URL.

Which local models actually work for the agent loop, not just chat?

The agent loop leans hard on tool calling and long context, which most small models handle poorly. The llama.cpp maintainers recommend tool-capable models such as Qwen3 Coder, Kimi K2, MiniMax M2, or Nemotron for agentic workloads. A 7B general chat model will answer questions but will frequently fail to emit a valid tool call, which shows up as the agent stalling or looping. Pick a model trained for tools and give it as much context window as your RAM allows.

Where exactly does Fazm send my requests once I set a custom endpoint?

Fazm writes your endpoint into the ANTHROPIC_BASE_URL environment variable for the agent process, sets FAZM_CUSTOM_API_ENDPOINT=true, and stops sending its bundled Anthropic key. It substitutes the placeholder key sk-fazm-custom-endpoint so a local gateway that accepts any key stays on the API-key path instead of triggering a cloud OAuth flow. You can read this in Desktop/Sources/Chat/ACPBridge.swift in the public repo.

Will my prompts leave my Mac if I run a local build?

No. When the endpoint points at 127.0.0.1, the request never leaves the machine. Fazm also disables its bundled cloud key in that mode, so there is no silent fallback to Anthropic. The model weights run under llama.cpp on your hardware and the agent loop, including the screen and accessibility context Fazm gathers, stays local.

What breaks when I swap cloud Claude for a local llama.cpp model?

Raw capability drops. Even a strong open model is not a frontier model, so multi-step plans, tricky refactors, and long tool chains are less reliable. Throughput also depends on your Mac. What does not break is the rest of Fazm: persistent sessions across restarts, one-click forking, no auto-compacting, voice input, and browser plus native-app control all behave the same regardless of which model answers the endpoint.