The llama.cpp latest release in 2026, and the new endpoint that lets a Mac client run on it
Two things are true about llama.cpp right now. It has no version number in the way you expect, and it quietly grew the one API surface that lets a native Claude Code app point at a fully local model. This page covers both, then shows the exact wiring.
The latest llama.cpp release is build b9741 (June 20, 2026).
llama.cpp does not ship semantic versions. It publishes a rolling stream of sequential build tags in the form b#### (b9739, b9740, b9741), and several can land in a single day. There is no 1.0, no LTS branch. Whatever number sits at the top of the releases page is the current release. By the time you read this it has almost certainly moved past b9741, and that is normal.
How to read llama.cpp release tags
People search for a "latest version" and expect a number that means something semantic. llama.cpp does not work that way, and trying to force it into that mental model is where most confusion comes from. The practical rules:
- The tag is just a counter. b9741 is the 9,741st CI build, not "version 9.741". A higher number is newer, that is the only thing it tells you.
- Each tag ships prebuilt binaries. macOS (arm64 and x64), Linux, Windows, Android, and accelerator variants for Metal, CUDA, Vulkan, ROCm, and SYCL. You rarely need to build from source unless you want a custom backend.
- Pin when you need reproducibility. If a workflow depends on exact behavior, record the b#### you tested and install that one, because behavior does change between builds.
- Read the release body, not the number. The interesting part of any tag is the commit list attached to it. That is where features like the one below first show up.
The change worth caring about: a native Anthropic Messages API
Buried in the recent build stream is the feature that actually changes what you can do with a local model. llama-server now exposes a native Anthropic Messages API. It serves POST /v1/messages and POST /v1/messages/count_tokens, converting Anthropic-format requests into its internal pipeline. The ggml-org write-up went up January 19, 2026.
Before this, connecting an Anthropic-style client to a local model meant running a translation proxy: the client spoke /v1/messages (Anthropic format), the server only spoke /v1/chat/completions (OpenAI format), and something like LiteLLM sat in the middle. With the native endpoint, on a recent build, the proxy disappears. That single fact is why this release matters more than the model-loading speedups that get all the changelog attention.
Start the server with a tool-capable model and it is listening on the Anthropic path immediately:
Why this connects to a native Mac app
Here is the part no other write-up on this topic makes. The Anthropic SDK that Claude Code uses builds its request URL by appending /v1/messages to whatever ANTHROPIC_BASE_URL is set to. Recent llama-server serves exactly that path. So if you set the base URL to http://127.0.0.1:8080, the entire Claude Code agent loop runs against your local model. No code changes, no fork of Claude Code, just an environment variable.
fazm is a native macOS app that wraps that exact agent loop over ACP, and it exposes the base URL as a settings field instead of an environment variable. Open Settings, Advanced, AI Chat, and paste the server URL into Custom API Endpoint. The relevant code, verbatim from the app source:
Two details in that snippet are the anchor of this whole approach. First, fazm replaces its bundled Anthropic key with the placeholder sk-fazm-custom-endpoint the moment you set a custom endpoint, so your subscription key never reaches a local or third-party server. Second, the endpoint validator only accepts an absolute http(s) URL with a host, which is why http://127.0.0.1:8080 works and a bare localhost:8080 is silently rejected and falls back to the default. fazm also detects local-server-specific failures (it recognizes an LM Studio or Ollama "no models loaded" error) and tells you to load a model instead of blaming the app.
What happens to a request, end to end
llama-server (local)
A b#### build of llama.cpp serving a tool-capable GGUF model. Recent builds expose POST /v1/messages natively, so it answers Anthropic-format requests on http://127.0.0.1:8080.
ANTHROPIC_BASE_URL
fazm's Custom API Endpoint setting. ACPBridge.swift writes your URL into ANTHROPIC_BASE_URL and swaps the bundled key for sk-fazm-custom-endpoint so nothing leaks to Anthropic.
Claude Code agent loop
The same agent loop fazm always runs, over ACP. The Anthropic SDK appends /v1/messages to your base URL, which lands on llama-server. Tool calls, file edits, MCP servers all still work.
Native macOS UI
Persistent windows that survive a restart, one-click chat forking, no auto-compacting. Now backed by a model running entirely on your Mac.
Local model in a terminal vs. local model in fazm
The model is the same either way. What differs is everything around the agent loop.
| Feature | Raw Claude Code | fazm |
|---|---|---|
| Where the model runs | Anthropic's servers (cloud round-trip every turn) | Your Mac, via local llama-server on 127.0.0.1 |
| Setup to switch backends | Export ANTHROPIC_BASE_URL, relaunch the terminal | Paste the URL into Settings > Advanced > AI Chat, no env juggling |
| Session after a restart | Gone; you re-establish context by hand | Auto-restored window with full history intact |
| Forking a conversation | Manual session-id dance, if at all | One click; new window with full prior context |
| Long-session context | Auto-compacting silently drops earlier decisions | No auto-compacting for the life of the window |
fazm wraps the real Claude Code loop, so the agent capabilities are identical. The differences are in session durability and UX, not in what the agent can do.
The honest caveats
Routing an agent at a local model is genuinely useful, but it is not free. A few things to set expectations on before you switch:
- You need a tool-capable model. The Claude Code loop calls tools to read files, run commands, and edit code. A chat-only GGUF will connect and answer, but it will stumble on multi-step work. Use a function-calling model such as a recent Qwen3 quant.
- Quality is a real tradeoff. At a size and quantization that fits on a laptop, a local model usually will not match a frontier Claude model on hard refactors. The reason to do this is privacy, offline use, and cost, not raw capability.
- Build age matters. The native
/v1/messagesendpoint is recent. On an older b#### you will not have it, and you are back to a LiteLLM-style shim in front of the OpenAI route. - You can keep both. The Custom API Endpoint is a toggle. Leave fazm on your Claude Pro or Max account for the hard work and flip to the local server when you want everything to stay on the machine.
If the base URL part is what you came for, the dedicated walkthrough on setting a Claude Code custom base URL goes deeper, and the broader map of AI agents for macOS covers where a native-UI agent fits among the alternatives.
Want help wiring a local model into a real agent UI?
Bring your llama-server setup and we will get fazm pointed at it, with persistent sessions and forking, in a few minutes.
llama.cpp release and local-agent FAQ
What is the latest llama.cpp release in 2026?
As of June 20, 2026, the latest tagged release on GitHub is build b9741. llama.cpp does not use semantic versions like 1.0 or 2.3.1. It ships a rolling stream of sequential build tags in the form b#### (b9739, b9740, b9741, and so on), and several can land on a single day. To get the truly current one, read the top entry at github.com/ggml-org/llama.cpp/releases; whatever number is there is the latest.
Why does llama.cpp not have a normal version number?
It is a fast-moving inference engine, not a packaged product. Each merged batch of commits gets a CI-built tag (b####) with prebuilt binaries for macOS arm64/x64, Linux, Windows, Android, and the GPU backends (Metal, CUDA, Vulkan, ROCm, SYCL). There is no release train and no LTS line. You pin to a specific b#### if you need reproducibility, otherwise you take the newest.
What changed recently that matters for agent use?
llama-server gained a native Anthropic Messages API. It exposes POST /v1/messages and POST /v1/messages/count_tokens, converting Anthropic-format requests to its internal OpenAI-style pipeline. That means any client that speaks the Anthropic Messages API, including Claude Code, can talk to a local model by setting ANTHROPIC_BASE_URL, with no translation proxy in between. The ggml-org write-up went up January 19, 2026.
How do I point Claude Code at a local llama.cpp model?
Start llama-server with a tool-capable GGUF model, then set ANTHROPIC_BASE_URL to the server address (default http://127.0.0.1:8080). The Anthropic SDK that Claude Code uses appends /v1/messages to that base URL, which is exactly the path recent llama-server serves. You also set a throwaway ANTHROPIC_API_KEY because local servers accept any key.
Does fazm run on llama.cpp?
Not by default. fazm wraps the real Claude Code (and Codex) agent loop, which normally talks to Anthropic. But fazm has a Custom API Endpoint setting under Settings > Advanced > AI Chat. When you set it, ACPBridge.swift writes that value into ANTHROPIC_BASE_URL and replaces the bundled key with the placeholder sk-fazm-custom-endpoint, so the whole agent loop is routed to your endpoint instead. Point it at a local llama-server and the agent runs on your GGUF model.
Do I need a translation proxy like LiteLLM?
On a recent llama.cpp build, no. The native /v1/messages endpoint speaks Anthropic format directly. On older builds, or with a backend that only exposes the OpenAI /v1/chat/completions route, you put an Anthropic-to-OpenAI shim such as LiteLLM in front and point the base URL at the shim. fazm does not care which one it is; it just forwards whatever URL you set into ANTHROPIC_BASE_URL.
What kind of model do I need for the agent to actually work?
A model with tool-calling (function-calling) support. The Claude Code loop is agentic: it reads files, runs commands, and calls tools, which requires the model to emit structured tool calls. The llama.cpp Claude Code example uses a tool-capable model such as Qwen3 quantized to GGUF. A plain chat-only model will connect but will not drive multi-step edits well.
Will a local model match Claude on coding quality?
Honestly, usually not at the same size and quantization. The point of routing fazm at a local model is privacy, offline use, cost, and control, not beating a frontier model. A practical pattern is to keep fazm on your Claude Pro or Max account for hard work and flip the Custom API Endpoint on when you want everything to stay on the machine.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.