vLLM release notes, April 2026: which v0.19.0 changes survive the Anthropic-shape shim a Mac agent has to put in front
vLLM v0.19.0 shipped 448 commits on April 3, 2026. Every other roundup transcribes the changelog. This one re-reads each release-notes item from the seat of someone driving the server with a desktop Mac AI agent through an Anthropic-shape translator, and sorts each line into reaches you, dies at the shim, or irrelevant for an interactive agent loop.
The vLLM v0.19.0 numbers, then the numbers that matter for a Mac agent operator
The first number is what the SERP transcribes from the vLLM release notes. The other three are the shape of the surface those 448 commits actually have to drive when the consumer is a Mac AI agent and not an OpenAI client. None of the existing roundups describe them.
“Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)”
Desktop/Sources/Chat/ACPBridge.swift line 379-381, April 2026
The anchor: one env var line is the entire bridge between vLLM v0.19.0 and a Mac agent
Fazm's bundled ACP subprocess is Anthropic-native. It speaks POST /v1/messages with content_block_delta streaming. To swap the brain out for a self-hosted vLLM v0.19.0, exactly one env var line in the subprocess environment changes.
The matching settings UI sits inSettingsPage.swiftline 906-952. The text field placeholder is literallyhttps://your-proxy:8766and the helper text reads Route API calls through a custom endpoint (e.g. corporate proxy, GitHub Copilot bridge). Leave empty to use the default Anthropic API.That field is the entire seat for plugging vLLM v0.19.0 into a Mac AI agent. None of the v0.19.0 release-notes roundups describe the contract on the other end of that field, which is why most of them mis-rank which v0.19.0 features actually reach an agent.
Where the Anthropic-shape shim has to sit
Two sources on the left, one hub in the middle, two destinations on the right. The hub is the only piece that mediates the shape mismatch between an Anthropic-native Mac agent and an OpenAI-shape vLLM v0.19.0.
Fazm + ACPBridge -> Anthropic-shape shim -> vllm serve v0.19.0
Reading the diagram: the/v1/chat/completions/batchendpoint is shown on the right because it lives on vllm serve. The hub does not forward it. That is what dies at the shim looks like geometrically.
The vLLM v0.19.0 release-notes items, sorted for a Mac agent seat
Same release notes everyone else has. Different rubric. Reaches you means the change shows up in your agent loop the day you point Fazm at the shim. Dies at the shim means the change lives on vllm serve but the Anthropic-shape translator does not forward it, so the agent never sees it. Irrelevant means it is server-side ops or hardware-only.
Gemma 4 family support
Reaches you. The model itself is what runs behind vLLM. The Anthropic-shape shim sits in front and is independent of which model you point vLLM at. Requires transformers>=5.5.0 in your vLLM environment.
Online MXFP8 quantization
Reaches you. Lower latency per token shows up directly in agent loop response time, which is the variable a Mac agent operator can feel.
CPU KV cache offloading
Reaches you. Long multi-turn Ask Fazm sessions blow past per-GPU VRAM eventually. Pluggable eviction policies let you keep recent tool_result blocks resident.
Model Runner V2 maturation
Reaches you. Piecewise CUDA graphs for pipeline parallelism plus rejection sampler support land in your throughput numbers regardless of which API shape sits in front.
/v1/chat/completions/batch
Dies at the shim. The Anthropic-shape shim listens on /v1/messages with streaming SSE, not on /batch. Useful for offline eval runs you trigger separately, irrelevant to interactive Ask Fazm turns.
New tool parsers (Gemma 4, GigaChat, Kimi-K2.5)
Reaches you only if the shim is thin. If your shim parses tool calls itself from the model's raw text, vLLM's native parser is bypassed. Configure the shim to forward vLLM's parsed structured output verbatim.
NVIDIA B300/GB300 support
Reaches you only if you are on that hardware. AllReduce fusion for SM 10.3 (Blackwell Ultra) lands in vllm serve startup; the shim and Fazm are unchanged.
Vision encoder CUDA graphs
Mostly irrelevant. Fazm's main loop reads the macOS accessibility tree as text (AppState.swift line 439), not pixels. Multimodal speedups land only if your shim is configured to forward Anthropic image content blocks.
One end-to-end turn: from Ask Fazm to vLLM v0.19.0 and back
Five frames, each a concrete step. This is the real shape of a single Ask Fazm turn when the model behind the curtain is your self-hosted vLLM v0.19.0. The five-MCP server surface and accessibility-tree text input are unchanged from the hosted setup; only the brain in the middle is swapped.
One Ask Fazm turn against vLLM v0.19.0 through an Anthropic-shape shim
1. You upgrade vllm serve to v0.19.0 on your dedicated GPU box
Two ways to point a client at vLLM v0.19.0, side by side
The standard OpenAI-shape path is what almost every v0.19.0 release-notes article tacitly assumes. The Anthropic-shape path via a translator is what a Mac AI agent has to take. The release-notes items reach the two seats differently.
| Feature | Standard OpenAI client path (direct) | Mac agent path (Anthropic-shape via shim) |
|---|---|---|
| API shape Fazm expects on the wire | OpenAI /v1/chat/completions with delta chunks (vLLM's default) | Anthropic /v1/messages with content_block_delta streaming |
| Translator required | No (if your client is OpenAI-shape) | Yes. claude-relay, LiteLLM's anthropic adapter, or one-file FastAPI shim |
| Where Fazm points the Custom API Endpoint | OpenAI clients point straight at vllm serve | At the translator, on a separate port from vllm serve |
| Env var that does the routing | OPENAI_API_BASE in OpenAI client config | ANTHROPIC_BASE_URL, set by ACPBridge.swift line 381 |
| Which v0.19.0 release-notes items reach you | Almost all of them, including /batch and structured-output flags | Engine performance, model support, quantization, KV cache; not /batch, not most server tuning |
| Tool-use JSON path | vLLM tool parser → OpenAI tool_calls → client dispatches | vLLM tool parser → shim translates to Anthropic tool_use envelope → Fazm dispatches |
| What breaks if you skip the shim | Nothing, this is the default vLLM path | Streaming format mismatch, 4xx on every turn, content_block_delta absent |
The v0.19.x release-notes timeline an operator should know about
v0.19.0 is the headline. The two patches that landed in April 2026 are the parts you actually want pinned in production. The zero-bubble + speculative decoding regression in particular is the kind of thing that release-notes summaries miss because it shipped as a hotfix.
March 31, 2026 — vLLM v0.18.1 patch
Fixed SM100 MLA prefill issues and DeepGEMM accuracy problems for Qwen3.5. Bundled forward into v0.19.0, so straight-line upgrades inherited the fix silently.
April 3, 2026 — vLLM v0.19.0 release
448 commits from 197 contributors. Gemma 4 family, Model Runner V2 maturation, /v1/chat/completions/batch, CPU KV cache offloading, NVIDIA B300, online MXFP8, new tool parsers for Gemma 4, GigaChat, Kimi-K2.5.
April 8, 2026 — vLLM v0.19.1 hotfix
Patched a regression in zero-bubble async scheduling when speculative decoding was enabled simultaneously. This is one of the headline new combinations from v0.19.0, so the regression hit fast.
April 14, 2026 — vLLM v0.19.2 patch
Cleaned up an MXFP8 numerical edge case for Llama-shape models. Stable as of this writing.
Standing up vLLM v0.19.0 + shim + Fazm in one terminal session
One vllm serve on port 8000, one claude-relay on port 8766, one defaults write that flips Fazm's Custom API Endpoint, and one Ask Fazm turn that exercises the loop. If anything breaks, the port separation tells you which side dropped the request.
The MXFP8 quantization line in the vllm serve output is the v0.19.0 release-notes item that actually reaches the agent loop. The shim line in claude-relay's output is the one that none of the other v0.19.0 roundups describe.
The seven things the Anthropic-shape shim has to do
This is the contract a translator in front of vLLM v0.19.0 has to satisfy for Fazm to drive the model end to end. Most one-file FastAPI shims people write get the first two right and miss the rest, then wonder why the agent stalls mid-session.
Anthropic-shape shim contract for vLLM v0.19.0
- Listens on POST /v1/messages and translates to vLLM's POST /v1/chat/completions
- Streams as Anthropic content_block_delta SSE events, not OpenAI delta chunks
- Forwards Anthropic tool_use input_schema as vLLM response_format / structured output
- Returns tool_result content blocks that round-trip through the shim back to the model on the next turn
- Maps Anthropic 429 rate-limit semantics so ACPBridge retries cleanly instead of stalling
- Either forwards vLLM's native tool parser output verbatim, or parses raw text into Anthropic tool_use envelopes itself, but not both
- Runs on a different port from vLLM so a single curl localhost test isolates which side dropped a request
The numbers, the way a Mac agent operator counts them
Drive your Mac with your self-hosted vLLM v0.19.0
Install Fazm, stand up an Anthropic-shape shim in front of your vllm serve, paste the shim URL into Custom API Endpoint, and the next Ask Fazm turn runs against your own vLLM. No app rebuild. No pixel piping.
Download Fazm →Frequently asked questions
What did the vLLM v0.19.0 release notes ship on April 3, 2026?
vLLM v0.19.0 landed on April 3, 2026 with 448 commits from 197 contributors. The headline items in the release notes: full Gemma 4 family support (MoE, multimodal, reasoning, tool-use, requires transformers>=5.5.0), Model Runner V2 maturation with piecewise CUDA graphs for pipeline parallelism, zero-bubble async scheduling now compatible with speculative decoding, CPU KV cache offloading with pluggable eviction policies, a new /v1/chat/completions/batch endpoint for offline batch processing, NVIDIA B300/GB300 (Blackwell Ultra) support with AllReduce fusion for SM 10.3, online MXFP8 quantization for both MoE and dense models, NVFP4 accuracy fixes, the new QeRL quantization method, vision encoder CUDA graphs, DBO microbatch generalization across model types, and new model architectures including Cohere ASR, ColQwen3.5, and Granite 4.0 Speech, with new tool parsers for GigaChat, Kimi-K2.5, and Gemma 4. The follow-up patch releases v0.19.1 and v0.19.2 landed mid-April with regression fixes around SM100 MLA prefill and DeepGEMM accuracy on Qwen3.5.
Why does the vLLM v0.19.0 release reach a desktop Mac AI agent differently than it reaches an OpenAI client?
vLLM serves OpenAI-shape API on /v1/chat/completions by default. A consumer Mac agent like Fazm injects ANTHROPIC_BASE_URL into the bundled Claude Code/ACP subprocess (Desktop/Sources/Chat/ACPBridge.swift line 381) and expects the Anthropic POST /v1/messages contract with content_block_delta streaming. The two shapes do not line up. So the v0.19.0 release notes apply to a Mac agent only after an Anthropic-shape translator (claude-relay, LiteLLM's anthropic adapter, or a one-file FastAPI shim that maps /v1/messages onto vLLM's /v1/chat/completions) sits in front of vllm serve on a different port. Every release-notes line item then sorts into one of three buckets: reaches you through the shim, dies at the shim because the shim does not translate it, or is irrelevant for an interactive Mac agent loop.
Which vLLM v0.19.0 release-notes items reach a Mac agent through the Anthropic-shape shim and which die at the shim?
Reaches you: Gemma 4 model support (the model itself is what runs, the shape in front of it is independent), Model Runner V2 throughput improvements, online MXFP8 quantization (lower latency per token), NVIDIA B300 / Blackwell Ultra support if you are on that hardware, CPU KV cache offloading (longer context stays usable), DBO microbatch generalization, vision encoder CUDA graphs (only matters if your shim translates Anthropic image content blocks into vLLM multimodal input). Dies at the shim unless the shim explicitly forwards them: the new /v1/chat/completions/batch endpoint (the shim listens on /v1/messages, not on /batch), the new tool parsers (GigaChat, Kimi-K2.5, Gemma 4 native parsing) only matter if the shim hands the model's raw text back through Anthropic's tool_use envelope rather than parsing it itself, structured-output JSON Schema enforcement (the shim has to forward Anthropic's tool input_schema as vLLM's response_format), guided decoding flags. Irrelevant for an interactive Mac agent: most server-side tuning knobs and operations metrics, since the agent loop is pure request/response from the agent's seat.
How do I plug a self-hosted vLLM v0.19.0 server into Fazm's Custom API Endpoint?
Three pieces in front of one open-source model. First, run vllm serve with v0.19.0 (vllm serve google/gemma-4-31b-it --port 8000 or whatever model you want). Second, run an Anthropic-shape translator on a different port: claude-relay --backend vllm --base-url http://127.0.0.1:8000 --port 8766 is the simplest one-liner, or use LiteLLM's anthropic adapter as a long-running proxy. Third, open Fazm Settings, find AI Chat > Custom API Endpoint, flip the toggle on, paste http://127.0.0.1:8766. The setting is backed by the AppStorage key customApiEndpoint (Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 840) and the placeholder text in the field is exactly https://your-proxy:8766. When you save, ChatProvider.restartBridgeForEndpointChange runs and the next subprocess inherits ANTHROPIC_BASE_URL pointing at your shim. From that turn forward, every Ask Fazm query routes through your local vLLM v0.19.0.
Does vLLM v0.19.0's new /v1/chat/completions/batch endpoint help an interactive Mac agent at all?
No, not directly. The new batch endpoint is async by design: you submit a batch of prompts and poll for results, similar to OpenAI's batch API. A Mac agent like Fazm runs a tight interactive loop where each turn needs a streaming response within a couple of seconds because the user is watching. The shim that sits in front of vLLM listens on /v1/messages with streaming SSE; the batch endpoint is a separate path that the shim has no reason to forward. So the headline release-notes item that closes a real gap with SGLang for offline workloads is one of the v0.19.0 features that does not change how a Mac agent feels at all. Where it does help indirectly: if you run agent traces overnight to evaluate a new model, you can hit /v1/chat/completions/batch directly without going through the shim, then read the results in the morning.
Which vLLM v0.19.0 release-notes item is most useful in practice for a Mac agent operator?
Gemma 4 native tool-parser support, paired with online MXFP8 quantization. The Mac agent loop fires tool_use JSON content blocks every turn against a five-MCP server surface (fazm_tools, playwright, macos-use, whatsapp, google-workspace). When the model's tool-use JSON is malformed, the Claude Code runtime rejects the turn and the session stalls. v0.19.0's new Gemma 4 tool parser means vLLM itself parses the structured-output blocks, so the shim does not have to invent a brittle text-to-tool-use translation. Combined with MXFP8 dropping per-token latency on the same B-series hardware, the practical effect is that a 31B Gemma 4 dense behind a self-hosted vLLM v0.19.0 plus a thin Anthropic-shape shim feels closer to a hosted API than v0.18.x ever did.
What broke between v0.18.x and v0.19.0 that the release notes flag?
Three to know. v0.19.0 requires transformers>=5.5.0 for Gemma 4, so any environment that pinned an older transformers will fail model loading on first launch (the v0.19.0 release notes call this out under breaking changes). The DBO microbatch optimization generalization changed behavior for some custom model architectures that relied on the older single-architecture path. The v0.18.1 patch from March 31 fixed SM100 MLA prefill issues and DeepGEMM accuracy problems for Qwen3.5, and those fixes are bundled into v0.19.0, so if you skipped 0.18.1 and went straight to 0.19.0 you also picked up that fix silently. The v0.19.1 hotfix on April 8 then patched a regression in zero-bubble async scheduling when speculative decoding was enabled simultaneously, which is one of the headline new combinations. v0.19.2 on April 14 cleaned up an MXFP8 numerical edge case for Llama-shape models.
How does Fazm read the vLLM v0.19.0 release-notes page while you are mid-upgrade?
The Fazm desktop app captures the macOS accessibility tree of the foreground window through AXUIElementCreateApplication (Desktop/Sources/AppState.swift line 439) and feeds the structured tree to the model as text on every turn. While you are upgrading, you can have the GitHub release-notes page open in a browser and Terminal.app showing your vllm serve logs side by side, then ask Fazm to compare them. Because Fazm reads each app through accessibility APIs and Playwright (not by sending screenshots), it gets verbatim text from the release page and exact text of the log output, which means the model can pick out which release-notes items are relevant to the actual flags your vLLM is launched with.
Why is no other April 2026 vLLM roundup written from this seat?
Two reasons. First, the SERP for vllm release notes april 2026 is dominated by transcribed changelogs aimed at vLLM operators serving production traffic to OpenAI-shape clients, which is most of vLLM's userbase. Second, the small overlap of people who both run vLLM and want to drive a desktop Mac AI agent through it has, until recently, mostly been DIYing through OpenAI-shape clients (which means a different agent than Fazm) or through the hosted Anthropic API directly (which means no vLLM at all). Fazm's Custom API Endpoint setting plus the maturation of Anthropic-shape translators in early 2026 is what makes this seat a real one. None of the existing v0.19.0 roundups cover it.
What hardware do I need to run vLLM v0.19.0 well enough to drive a Mac agent in real time?
It depends on the model behind vLLM, not on vLLM itself. For interactive Ask Fazm latency (the agent feels live below about 2 seconds per turn), Gemma 4 9B or Qwen 3 7B on a single H100 with MXFP8 quantization is comfortable for a small group of users. Gemma 4 31B dense fits on one H100 80 GB and gives stronger tool-use JSON stability, at the cost of per-turn latency. The 754B-class GLM-5.1 is not a self-host story for a Mac agent operator, period. The v0.19.0 release notes also add Blackwell Ultra B300/GB300 support which matters if you are on that hardware tier, otherwise the SM 10.3 paths do not affect you. CPU KV cache offloading (also new in v0.19.0) is useful when you push past the per-GPU VRAM budget on long multi-turn agent sessions.
Source anchors for this guide, all verifiable in the Fazm desktop codebase as of April 2026: Desktop/Sources/Chat/ACPBridge.swift line 381 (ANTHROPIC_BASE_URL injection), Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 840-952 (Custom API Endpoint setting and the https://your-proxy:8766 placeholder), Desktop/Sources/Providers/ChatProvider.swift line 2101 (restartBridgeForEndpointChange that re-launches the ACP subprocess on save), and the v0.19.0, v0.19.1, and v0.19.2 release notes published by the vLLM project on April 3, April 8, and April 14, 2026.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.