vLLM release notes 2026, and the one toggle that turns a vLLM server into a Mac agent
Every release note page lists the server-side wins. gRPC serving, GPU speculative decoding, Gemma 4 support, async scheduler on by default, CVE-2026-0994 patched. None of them finish the sentence. This guide walks the 2026 changelog with real version numbers, then points at the single field in an MIT-licensed Mac agent that turns whatever you serve on port 8000 into real clicks and keystrokes inside Mail, Finder, Chrome, or any other Mac app.
THE CHANGELOG, IN ORDER
Every vLLM 2026 release that matters, with dates
Four rows. Two majors, one release candidate, one security patch. Drawn from the official vllm-project/vllm releases page and the NVIDIA vLLM Release Notes PDF dated March 2026.
v0.18.0 (late March 2026)
Native gRPC serving via --grpc, alongside the existing HTTP/REST and OpenAI-compatible endpoints. NGram speculative decoding moved from CPU to GPU and became compatible with the async scheduler. KV cache offloading got smarter: FlexKV was added as a new backend, multiple KV groups landed, and only frequently reused blocks get offloaded to CPU. vllm launch render let multimodal preprocessing run on CPU-only nodes separated from GPU inference. Production teams serving at scale were the primary audience.
v0.19.0 (April 2, 2026)
Full Google Gemma 4 architecture. All four variants: E2B (effective 2B), E4B (effective 4B), 26B MoE, 31B Dense. Native MoE routing, multimodal inputs, reasoning traces, tool-use handled inside vLLM. The async scheduler, which overlaps engine scheduling with GPU execution, is now on by default with no configuration. Model Runner V2 landed. Intel XPU picked up CUDA graph support and GPUDirect RDMA via NIXL. BF16 cross-compilation for ARM CPU, FP16 + vector intrinsics for s390x (IBM Z), and prefix caching for ppc64le (POWER) all shipped.
v0.19.1rc0 (April 3, 2026)
Release candidate with stability fixes stacked on v0.19.0. If you are early-adopting Gemma 4 or the async-scheduler-by-default path, this is the version to reach for before v0.19.1 stable.
Security patch: CVE-2026-0994 (April 2026)
A critical deserialization vulnerability in the prompt_embeds handling of the Completions API endpoint, affecting vLLM 0.10.2 and later. Patched in the April release cycle. If you run vLLM in production and expose /v1/completions, upgrade. If you only expose /v1/chat/completions and you strip prompt_embeds at the gateway, exposure is smaller but not zero.
The numbers that characterize vLLM in 2026
Five counters, all pulled from the April 2026 release data. Use these when someone asks 'is vLLM worth standing up for my team' in 2026 terms, not 2024.
The v0.19.0 release alone was 0 commits from 0 contributors. That is a high-velocity server project, not an experimental one. Production deployments in 2026 are a reasonable default, not an edge case.
THE GAP IN THE COVERAGE
Every release note ends at 'now run it.' This page keeps going.
Read the vLLM v0.19.0 GitHub release. You will learn about async-by-default scheduling, Gemma 4 MoE routing, and NIXL-backed GPUDirect RDMA. What you will not learn is how to take whatever model vLLM is serving and make it drive Finder, Mail, or a Chrome tab on your laptop. That question has nothing to do with the inference engine and everything to do with the substrate between the model and the operating system.
That substrate has to read app state as structured data (not pixels), expose a tool-calling surface the model can drive, and be swappable at the model endpoint. Almost nothing in the consumer space does all three. Fazm is the one I know about. It is MIT-licensed at github.com/mediar-ai/fazm, reads the macOS accessibility tree directly, and its model endpoint is a single field in Settings that maps cleanly onto ANTHROPIC_BASE_URL.
The rest of this page is the second half of the story the vLLM release notes stop telling.
“The release note tells you what got faster. It does not tell you what you can now do on a Mac with the endpoint it just produced.”
Fazm Settings > Advanced > AI Chat > Custom API Endpoint
THE ANCHOR
The exact lines in Fazm that accept your vLLM endpoint
No marketing, no recompile. The entire swap lives in ACPBridge.swift at lines 379 to 382, which read acustomApiEndpointUserDefault and set it asANTHROPIC_BASE_URLon the Node bridge process. Put an Anthropic-compatible shim in front of your vLLM deployment and the value you paste here is the shim URL. Fazm talks to the shim, the shim talks to vLLM, vLLM talks to the model.
The default the bridge falls back to when nothing is configured is a single constant in the Node bridge source.
The Settings UI the user actually touches, preserved from the Swift source at SettingsPage.swift lines 906 to 952.
How your vLLM server becomes a Mac agent
From a running vLLM instance to a model clicking buttons on your laptop, the hops are short. vLLM sits on the left. An Anthropic-compatible shim sits in the middle, translating /v1/messages shape into vLLM's OpenAI-compatible /v1/chat/completions shape. Fazm sits on the right, listening on ANTHROPIC_BASE_URL.
vLLM 0.19 + Anthropic shim + Fazm on your Mac
The middle hop is the piece you bring. Both ends already ship: vLLM is the server, Fazm is the desktop agent. 2026 is the year both ends got mature enough that the middle hop is a copy-paste exercise, not a weekend project.
The request path in sequence form
If you prefer to see it as who-talks-to-whom, here is the round trip of one chat turn.
One turn: Fazm -> shim -> vLLM -> shim -> Fazm
What it looks like end to end, in terminal form
Drop this in after you already have vLLM running on localhost:8000 and an Anthropic-compatible shim on localhost:4000. No mockup, this is the real sequence.
Why reading the accessibility tree matters for vLLM-served models
vLLM v0.19.0 Gemma 4 can handle multimodal input, which means you could send it screenshots. You should not, if you have a choice. A screenshot is a lossy raster. The same UI state already exists as a typed tree of roles, titles, values, and positions via the macOS Accessibility API. Fazm reads that tree directly, which is what AppState.swift line 439 is doing below.
The smaller the input, the fewer tokens the model spends reconstructing state from pixels. That saves context window, cuts latency, and leaves more budget for reasoning. On a self-hosted vLLM deployment where you are paying for every GPU-second, the difference compounds.
Screenshot-based agents vs. accessibility-tree agents
A screenshot goes into the context window. The model pays tokens to OCR 'Send' on a button. It also has to infer click coordinates in pixel space. A UI refresh or a font change can break everything.
- Lossy raster input
- Model wastes context on OCR
- Coordinate guessing, pixel-level
- Breaks on UI redesign or zoom
Default Fazm vs. Fazm pointed at your vLLM box
| Feature | Fazm -> shim -> vLLM | Default Fazm |
|---|---|---|
| Model family | Claude Sonnet 4.6 (default), Opus selectable | Whatever vLLM is serving (Gemma 4, Llama 4, DeepSeek, Qwen3, ...) |
| Configured via | Model picker in floating bar | Settings > Advanced > AI Chat > Custom API Endpoint |
| Underlying env var | None required | ANTHROPIC_BASE_URL (set by Fazm from customApiEndpoint UserDefault) |
| Middle hop | None | Anthropic-to-OpenAI shim (LiteLLM, claude-code-router, custom) |
| Tool-calling fidelity | First-class (native Anthropic) | Depends on the shim's tool_use <-> function-calling mapping |
| Cost model | Per-token to Anthropic | GPU-seconds you already own, or managed vLLM provider pricing |
| Data exfil risk | Prompts leave your machine | Prompts can stay entirely on your network if vLLM is local |
| Recompile required | No | No |
End-to-end quality of the vLLM path is a function of your shim plus the model you load, not of Fazm itself.
THE SPECIFIC FILES TO GREP
Verify every claim on this page from the source
Fazm is MIT-licensed. Clone github.com/mediar-ai/fazm and search for these exact strings. No marketing, no hand-waving.
Desktop/Sources/Chat/ACPBridge.swift:379Four lines that read customApiEndpoint from UserDefaults and export ANTHROPIC_BASE_URL to the Node subprocess environment before spawning it.Desktop/Sources/MainWindow/Pages/SettingsPage.swift:906The Settings card titled 'Custom API Endpoint', with the 'https://your-proxy:8766' placeholder and a toggle that clears the value on disable.acp-bridge/src/index.ts:1245DEFAULT_MODEL = "claude-sonnet-4-6". The one constant that defines what runs when no endpoint is set.Desktop/Sources/AppState.swift:439AXUIElementCreateApplication + kAXFocusedWindowAttribute. This is how the model sees your Mac: as a typed accessibility tree, not a screenshot.
What 2026 changed about the vLLM + agent pairing
Four shifts, each concrete. Pulled from the v0.18.0 and v0.19.0 release notes and the MIT Fazm source.
Async scheduler is default in v0.19.0
No more manual --enable-async-output-proc flag. Spec decoding and scheduling overlap GPU execution automatically. For interactive agent turns with tool calls, this is the single biggest latency win of 2026.
Gemma 4 MoE, natively
26B total, roughly 8B active. Dense-model reasoning at MoE cost. A good default for self-hosted agent backends in 2026.
gRPC for fleets, HTTP for you
v0.18.0's --grpc mode is for team infra. Single-Mac users keep the plain OpenAI-compatible HTTP path. Both work behind an Anthropic shim.
CVE-2026-0994 is real
If you expose /v1/completions, upgrade. Do not skip the security patch because your changelog summary buried it under Gemma 4.
One field in Fazm closes the loop
Custom API Endpoint in Settings > Advanced > AI Chat. Set it to your shim URL. Done. No recompile, no fork, no .env file editing. The 2026 version of 'plug in any model.'
The shims that actually work in front of vLLM
Fazm speaks Anthropic Messages. vLLM speaks OpenAI chat completions. The shim is the middle. None of these ship with Fazm, they are independent projects. Each one presents an Anthropic-shaped endpoint and routes to whatever you put behind it, vLLM included.
Test the shim with your real tasks, not a benchmark. Streaming, parallel tool calls, and large tool results are where shim quality diverges, and they are exactly what matters in an agent loop.
Four-step quickstart
From vLLM running to Fazm driving apps
- 1
Serve a model in vLLM v0.19.0
vllm serve <model> --port 8000. Async scheduler is on by default. For Gemma 4, any of E2B, E4B, 26B MoE, 31B Dense.
- 2
Front it with an Anthropic shim
LiteLLM --config anthropic.yaml, or claude-code-router. Point upstream at localhost:8000.
- 3
Install Fazm, grant Accessibility
Download from fazm.ai. Grant macOS Accessibility permission when prompted. Fazm needs it to read AX trees.
- 4
Paste the shim URL in Custom API Endpoint
Settings > Advanced > AI Chat > Custom API Endpoint. Toggle on, paste, hit return. Done.
The honest tradeoffs
The vLLM path is not free. If you are shipping Fazm to production users with an opinion about model quality, this is where you need to be clear-eyed.
What you gain, what you pay
Gain: model sovereignty
Running vLLM and want to wire it into a real Mac agent?
Talk through your shim choice, tool-calling edge cases, and whether your deployment shape fits Fazm's Custom API Endpoint pattern.
Book a call →Frequently asked questions
What are the vLLM release notes for 2026 so far?
Two major releases have landed in 2026: v0.18.0 in late March and v0.19.0 on April 2, 2026. v0.18.0 introduced native gRPC serving via the --grpc flag (running alongside HTTP/REST), moved NGram speculative decoding from CPU to GPU, and shipped a smarter KV cache offloading system with FlexKV as a new backend. v0.19.0 added full Google Gemma 4 architecture support (E2B, E4B, 26B MoE, 31B Dense), turned the async scheduler on by default, and updated Model Runner V2. v0.19.1rc0 followed on April 3, 2026 as a release candidate. A critical CVE-2026-0994 was patched in the April cycle, affecting the Completions API endpoint in versions 0.10.2 and later.
What is CVE-2026-0994 and do I need to upgrade?
CVE-2026-0994 is a deserialization vulnerability in the prompt_embeds handling of vLLM's Completions API endpoint, affecting vLLM 0.10.2 and later. If you are running any production vLLM deployment that exposes the Completions API, you should upgrade to the patched v0.19.x release. If the server is already behind a gateway that strips prompt_embeds, the exposure is smaller, but upgrading is still the right answer. Do not rely on patch-level isolation alone.
Why does vLLM gRPC serving in v0.18.0 matter for desktop agent workflows?
It probably does not matter for you as a single Mac user. gRPC is a server-to-server protocol and a latency optimization for fleets: binary payloads, HTTP/2 multiplexing, less serialization overhead than JSON. If you are running one vLLM instance on a workstation and routing a single laptop through it, the vanilla OpenAI-compatible HTTP endpoint at /v1/chat/completions is fine. The gRPC path is useful when you are standing up vLLM as shared infrastructure for a team or a product, and an Anthropic-compatible shim like LiteLLM sits in the middle talking HTTP to the client and either gRPC or HTTP to vLLM.
Can I actually use vLLM as the backend for Fazm?
Yes, with one caveat. Fazm's chat engine speaks the Anthropic Messages API shape, and vLLM's first-class API is OpenAI-compatible. You need an Anthropic-to-OpenAI shim between them. LiteLLM running in Anthropic proxy mode, claude-code-router, or a small custom FastAPI bridge all work. Point the shim at your vLLM server, then paste the shim URL into Fazm's Custom API Endpoint setting. Fazm's ACPBridge.swift at lines 379 to 382 reads that value from UserDefaults and exports it as ANTHROPIC_BASE_URL on the Node subprocess it spawns. No recompile, no fork. The whole switch is one field in Settings.
Where exactly is the Fazm code that enables this?
Three files. Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382 contain the four-line block that reads customApiEndpoint from UserDefaults and sets env["ANTHROPIC_BASE_URL"] on the spawned Node process. Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 906 to 952 define the Settings card titled 'Custom API Endpoint' with the placeholder 'https://your-proxy:8766' and a toggle that clears the value on disable. acp-bridge/src/index.ts line 1245 declares DEFAULT_MODEL = 'claude-sonnet-4-6', which is what the bridge warms up when nothing overrides it. All three are in the MIT-licensed repository at github.com/mediar-ai/fazm.
What is the practical latency overhead of this stack compared to calling vLLM directly?
Most of the measurable overhead is the Anthropic-to-OpenAI translation, not the network. If your shim is on the same machine as vLLM, add a few milliseconds for JSON reshaping. If the shim is on a different box, add a normal HTTP hop. The real variability comes from tool-calling translation: Anthropic tool_use blocks have a slightly different shape than OpenAI function calling, and a sloppy shim can double-round-trip on parallel tool calls. Start with LiteLLM or claude-code-router, both are reasonable defaults in 2026. Only hand-roll a shim if you are measuring a specific loss.
Does Gemma 4 support in vLLM v0.19.0 help or hurt agent use cases?
It helps. The Gemma 4 lineup in v0.19.0 covers E2B (effective 2B), E4B (effective 4B), 26B MoE, and 31B Dense, with native MoE routing, multimodal inputs, reasoning traces, and tool-use handled inside vLLM. For agent workflows that live and die by tool use and structured output, the 26B MoE variant is particularly interesting on self-hosted hardware because you get dense-model reasoning quality at MoE active-parameter cost. Paired with Fazm's accessibility-tree substrate (structured state rather than screenshots) the model does not have to burn capacity on OCR.
Why do the top vLLM release notes sources not mention desktop agents?
Because vLLM is server infrastructure and its audience is ML ops. The official changelog is written for people operating inference clusters. None of the mainstream release note summaries cover what to do once you have vLLM serving, beyond 'point your app at it'. The missing half of the story is the consumer substrate: something running on the user's own Mac that turns model output into actual clicks and keystrokes in real apps. Fazm's Custom API Endpoint field is a consumer-grade answer to that question, and it happens to plug straight into a vLLM deployment via any Anthropic-compatible shim.
What should I read alongside the vLLM 2026 release notes?
The vllm-project/vllm GitHub releases page for the raw changelog and commit lists. NVIDIA's vLLM Release Notes PDF for NGC-packaged versions and container compatibility. The vLLM blog at blog.vllm.ai for design rationale on larger changes. Fazm's April 2026 vLLM update post for release commentary. And Fazm's own source tree at github.com/mediar-ai/fazm to see how a consumer-side agent hooks into the endpoint that vLLM exposes.
Is this only useful for self-hosting, or does it work for managed vLLM too?
Either. Anyscale, Modal, RunPod, Fireworks, and Together all offer managed vLLM-style deployments in 2026, typically behind an OpenAI-compatible endpoint. The Anthropic-to-OpenAI shim sits in front of whatever URL your provider gives you. The only thing Fazm cares about is that the endpoint behind ANTHROPIC_BASE_URL speaks Anthropic Messages. How you got there, self-hosted vLLM on a single H100, a multi-node gRPC cluster, or a managed provider, does not change the Fazm side.