VERSION LOOKUP / VERIFIED 2026-05-17

vLLM v0.16.0 shipped on February 25, 2026

Branch cut February 8, tag pushed February 25. 440 commits from 203 contributors. The headline is a WebSocket Realtime API at /v1/realtime built on Voxtral. The throughput claim is async scheduling plus pipeline parallelism, 30.8 percent E2E and 31.8 percent TPOT on that specific path. The thing no other writeup is honest about is that the Realtime API does not slot into a voice-first Mac agent the way the post titles imply, because desktop agents do their transcription on-device and send text over HTTP. Four lines of Swift in Fazm's macOS source absorb the whole upgrade. The Realtime WebSocket is not one of them.

Matthew Diakonov, Written with AI

Published May 17, 20267 min read

4.9from Cross-checked against the vllm-project release page, PR #33187, the vLLM blog, and the Fazm source tree

v0.16.0 tagged 2026-02-25, branch cut 2026-02-08

440 commits, 203 contributors, 7 new

WebSocket Realtime API at /v1/realtime, PR #33187

Async scheduling + PP, 30.8% E2E / 31.8% TPOT

ACPBridge.swift:527-530 only forwards HTTP

Direct answer, verified 2026-05-17

vLLM v0.16.0 was tagged on February 25, 2026.

Sources I cross-checked today: github.com/vllm-project/vllm/releases/tag/v0.16.0, pypi.org/project/vllm, the Realtime API PR at vllm-project/vllm#33187, and the design post at blog.vllm.ai/2026/01/31/streaming-realtime.html. All four agree on the date, the headline feature, and the breaking removals listed below.

PyPI install: pip install vllm==0.16.0. Container: vllm/vllm-openai:v0.16.0. The current latest tag has since moved to v0.21.0 on May 15, 2026; v0.16.0 is the February cut.

v0.16.0 - February 25, 2026

Branch cut - February 8, 2026

440 commits

203 contributors

7 new contributors

WebSocket Realtime API

PR #33187

/v1/realtime endpoint

Voxtral Mini 4B day-0

Async scheduling + PP

30.8% E2E throughput

31.8% TPOT improvement

Unified Parallel Drafting

BitBlas removed

Marlin 24 removed

reasoning_content removed

CVE-2026-0994 patched

The WebSocket Realtime trap, and why every voice-agent post about v0.16.0 glosses past it

The post titles you will read on v0.16.0 are some variation of "vLLM ships Realtime API for voice agents." That is technically true and practically misleading for most readers running a desktop agent on macOS. The Realtime API is a WebSocket protocol at /v1/realtime, mirroring the OpenAI Realtime interface. It opens a single long-lived connection and streams audio frames in and audio (or text) frames out. The client on the other end has to speak that protocol natively. There is no HTTP-only fallback.

A voice-first Mac agent does not work that way. The agent's microphone path runs on-device: hold a hotkey, capture audio frames, run them through a local speech-to-text model (the WhisperKit family is the common choice on Apple Silicon), then hand a plain text string to the chat engine. The chat engine then sends a standard POST /v1/messages (or its OpenAI Chat Completions equivalent) over plain HTTP. The audio never leaves the laptop. The server only ever sees text.

That architectural choice is not a Fazm quirk. It is what every privacy-conscious desktop agent does, and it is also what most agents that want low first-token latency do, because round-tripping audio over the public internet to a hosted vLLM server is slower than running tiny STT locally. The result is that v0.16.0's headline feature is interesting if you are building a phone-style voice product against your own vLLM box, and irrelevant if you are using a desktop agent that already does the transcription itself.

Two distinct paths, only one of which matters for a Mac agent

The throughput numbers, in context

The release page reports a 30.8 percent end-to-end throughput improvement and a 31.8 percent improvement in time-per-output-token. These are not whole-server numbers. They apply to the path that was completed in this release: async scheduling stacked with pipeline parallelism. If your deployment does not run pipeline parallelism, the line that changes is the one your traffic does not touch.

The honest read is two-pronged. For a multi-node serving cluster that already runs pipeline parallelism (typical for the largest open weights, where a single node cannot hold the model), v0.16.0 is a real upgrade. For a single A100 or H100 running a 32B-or-smaller model with tensor parallelism only, the change is a no-op. Read your launch flags before deciding whether the number is real for you.

0commits in v0.16.0

0contributors (7 new)

0%E2E throughput on async + PP

0%TPOT gain on the same path

What v0.16.0 removed (and what that breaks)

The breaking changes are the part you want to read before you pull the tag. None of them affect a downstream Anthropic-shaped client, because the chat surface is unchanged. They all affect either the server operator's quantization choice, an Intel XPU deployment, or any tool that parsed the deprecated reasoning_content field from streamed responses. That last one is the most common silent break: small wrappers and tools that started life on the v0.12 line and never moved.

BitBlas quantization, removed

If your serving config referenced BitBlas as a quantization backend, v0.16.0 will refuse to load. Move to one of the supported quant paths (AWQ, GPTQ, FP8) before the upgrade.

Marlin 24 quantization, removed

Marlin 24 quantized weights no longer load. The Marlin path itself stays for the supported precisions; only the 2:4 sparse variant is gone.

reasoning_content message field, removed

If any downstream consumer parsed the deprecated reasoning_content field on streamed responses, that consumer breaks on v0.16.0. The replacement has been the standard tool/output channel for cycles.

Deprecated pooling items, removed

The older pooling endpoints flagged deprecated in the v0.14 line are gone. Embeddings consumers should already be on the current pooling path; this just removes the fallback.

VLLM_ALL2ALL_BACKEND env var, removed

The all-to-all dispatch is now selected by the new dispatcher and the env var no longer takes effect. If your launch scripts still set it, they will look fine and silently do nothing.

IPEX deprecated for Intel XPU

Intel XPU users move to vllm-xpu-kernels, which gains MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE. IPEX still works in this release but is on the deprecation clock.

The anchor fact: four lines of Swift that ignore which vLLM you run

The reason a vLLM upgrade is a non-event for a Mac agent like Fazm is the same reason the Realtime API does not slot in: the wiring between the agent and the server is one HTTP base URL, and that is it. Open Desktop/Sources/Chat/ACPBridge.swift in the MIT-licensed Fazm repo, scroll to line 527, and this is the entire integration. The block reads a UserDefaults string ("customApiEndpoint") that the Settings page lets the user paste, and if it is non-empty it sets ANTHROPIC_BASE_URL on the Node ACP subprocess that wraps Claude Code.

That subprocess speaks Anthropic Messages over HTTP. It does not open WebSockets. It does not stream audio. It does not negotiate a Realtime session. So the version of vLLM behind the URL can move from v0.14 to v0.16.0 to v0.21.0 and the Swift side notices nothing, provided whatever Anthropic-to-OpenAI shim you have in the middle (LiteLLM, claude-code-router, or a small custom FastAPI bridge) keeps speaking the same HTTP shape.

Desktop/Sources/Chat/ACPBridge.swift

So should you upgrade?

The honest decision tree is short. If your stack still uses BitBlas, Marlin 24, or the reasoning_content field, do the migration before you upgrade; v0.16.0 will break you on launch. If you serve Voxtral or another audio model and you actually have a client that speaks WebSocket, the new /v1/realtime endpoint is the reason to pull the tag. If you run multi-node pipeline parallelism on a large model, the throughput line is real for you. In every other case, v0.16.0 is a fine but optional upgrade, and your downstream Mac agent will not notice the difference.

The post-v0.16 line has moved on. v0.18.0 in March added gRPC serving. v0.19.0 in April added Gemma 4 and patched CVE-2026-0994. v0.20.0 in late April raised the dependency floor to CUDA 13, PyTorch 2.11, and Transformers v5. v0.21.0 in May made the CUDA 13.0 wheel the PyPI default. If you are picking a version to pin in production today, v0.16.0 is one of the older supported tags; weigh the Realtime API against five months of further patching before pinning it.

Need a sanity check on which vLLM tag to pin?

Fifteen minutes on a call to walk through your stack: which vLLM cut, which shim, and how a Mac agent reaches it. No pitch attached.

Frequently asked questions

When did vLLM v0.16.0 actually ship?

February 25, 2026 on the vllm-project/vllm GitHub releases page, with the branch cut on February 8, 2026. Anything merged to main between February 8 and February 25 did not make the cut and landed in v0.17.x. The release totals 440 commits from 203 contributors, 7 of whom were new. Confirmed against github.com/vllm-project/vllm/releases/tag/v0.16.0 on May 17, 2026.

What is the headline feature of v0.16.0?

A new WebSocket-based Realtime API at /v1/realtime, mirroring the OpenAI Realtime interface. It was merged via PR #33187 and is built on the Voxtral realtime infrastructure that landed earlier in the cycle. The Realtime API speaks WebSocket and streams audio frames; it is not a new variant of the existing OpenAI Chat Completions or Anthropic Messages HTTP endpoints. The vLLM blog post 'Streaming Requests & Realtime API in vLLM' from January 31, 2026 covers the intent in detail at blog.vllm.ai/2026/01/31/streaming-realtime.html.

Did v0.16.0 actually move performance, or is the headline marketing?

It moved real numbers, but they apply to a narrow path. Async scheduling with pipeline parallelism is now fully supported, and the release notes cite a 30.8 percent E2E throughput improvement and a 31.8 percent TPOT (time-per-output-token) improvement on that path. The qualifier matters: those gains are measured for workloads that use pipeline parallelism, not for a single-GPU single-tensor-parallel setup. If you run one card and one shard, the change is invisible. If you run multi-node pipeline parallelism for a large model, the change is real.

What did v0.16.0 break or remove?

Five removals worth pinning before you upgrade. BitBlas quantization is gone. Marlin 24 quantization is gone. The deprecated reasoning_content message field is gone. Deprecated pooling items are gone. The VLLM_ALL2ALL_BACKEND environment variable is removed in favor of the new dispatch path. Separately, IPEX is deprecated for the Intel XPU backend in favor of vllm-xpu-kernels, which gains MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE support. If your service still reads reasoning_content from streamed responses, that breaks on v0.16.0.

Was there a security fix in v0.16.0?

Yes, a protobuf-related CVE (tracked as CVE-2026-0994 in the vLLM advisory feed) was patched in the v0.16.0 cycle. The earlier prompt_embeds deserialization issue that affected the Completions API in 0.10.2 and later was patched in the v0.19.x cycle, not here. If your production server exposes the Completions endpoint or is on the v0.10 to v0.15 line, the v0.19.x cycle is the more important upgrade for that specific issue.

Will the new Realtime API let me speak to my Mac agent and have it talk to my own vLLM server?

Probably not the way you are picturing it. The Realtime API at /v1/realtime is a WebSocket protocol that streams audio frames in and audio (or text) frames out. A voice-first desktop agent like Fazm does its speech-to-text on-device, then sends plain text to the model over Anthropic Messages HTTP. Those are two different protocols on two different connection types. The Fazm-to-vLLM path runs through an Anthropic-to-OpenAI shim (LiteLLM, claude-code-router, or a small custom FastAPI bridge) and only forwards HTTP requests speaking the Anthropic Messages shape. There is no place in that path to inject a WebSocket. To use the Realtime API end-to-end you need a client that opens a WebSocket directly to vLLM and streams microphone frames, which is a different application than a desktop coding agent.

What is the literal Swift wiring that decides which vLLM your Mac agent reaches?

Four lines at Desktop/Sources/Chat/ACPBridge.swift, lines 527 to 530, in the MIT-licensed Fazm repo at github.com/mediar-ai/fazm. The block reads UserDefaults.standard.string(forKey: 'customApiEndpoint'); if the string is non-empty it sets env['ANTHROPIC_BASE_URL'] on the Node ACP subprocess Fazm spawns. The Node subprocess is the @agentclientprotocol/claude-agent-acp bridge wrapping Claude Code. Because the block only sets ANTHROPIC_BASE_URL, the spawned bridge speaks HTTP Anthropic Messages to whatever URL you paste. v0.16.0 did not change that contract, so the four lines do not change either.

If I run a Mac agent against my own vLLM server, do I need to upgrade to v0.16.0?

Only if one of three things is true. (1) You serve Voxtral or a similar audio model and you want the streaming WebSocket path. (2) You run multi-node pipeline parallelism and care about the async-scheduling throughput gain. (3) Something in your stack still uses BitBlas, Marlin 24, reasoning_content, or VLLM_ALL2ALL_BACKEND, in which case you need to plan a migration before pulling the tag. If none of those apply, the upgrade is fine but optional, and your downstream agent (which only consumes the HTTP endpoint) will not notice the difference.

Where can I verify this myself instead of trusting one page?

Four sources I cross-checked on May 17, 2026. The GitHub release page at github.com/vllm-project/vllm/releases/tag/v0.16.0 has the tag, the date, and the changelog. PyPI lists v0.16.0 under pypi.org/project/vllm. The PR that introduced the Realtime API lives at github.com/vllm-project/vllm/pull/33187. The companion vLLM blog post on the design intent is at blog.vllm.ai/2026/01/31/streaming-realtime.html. The Red Hat day-zero writeup of Voxtral Mini 4B on vLLM is at developers.redhat.com/articles/2026/02/06/run-voxtral-mini-4b-realtime-vllm-red-hat-ai.