vLLM latest version in 2026 is v0.20.2
Tagged on May 10, 2026 on the vllm-project/vllm GitHub releases page and pushed to PyPI the same day. A six-commit patch on top of v0.20.0 (April 27, 2026), with v0.20.1 (May 4, 2026) in between. If you came here looking for the literal version number, that is it. The rest of this page is the four fixes that ship inside v0.20.2, whether any of them matter to you, and the two-line block in the Fazm macOS source that makes every one of those patches invisible to the agent driving your Mac apps.
Direct answer, verified 2026-05-12
vLLM v0.20.2, released on May 10, 2026.
Sources I re-checked today: github.com/vllm-project/vllm/releases/tag/v0.20.2, pypi.org/project/vllm, and vllm.ai/releases. All three agree. The PyPI install is pip install vllm==0.20.2. The container is vllm/vllm-openai:v0.20.2.
0 days since release. 0 commits. 0 contributors. No API surface change. No new model support. Strictly a stability and correctness patch.
What is actually inside v0.20.2
Six commits, four user-visible fixes, no new model. The release notes call it a maintenance tag. If you read the PR list on the GitHub release page and ignore the test-only commits, you are left with these four. Three of them only matter to the operator of the inference box; one of them matters if you happen to be fronting a multimodal agent with a Qwen3-VL endpoint.
The four fixes in v0.20.2
- PR #41665 - DeepSeek V4 sparse attention: re-enable the persistent topk path on Hopper, run the memset kernel at CUDA graph capture time, resolves MTP=1 hangs. CUDA-side, server-only.
- PR #41282 - V1 engine KV cache: fix the 'failure to allocate KV blocks' error in the V1 engine path. CUDA-side, server-only.
- PR #42002 - gpt-oss MXFP4 + torch.compile: pass hidden_dim_unpadded through the moe_forward fake op so MXFP4 stays functional under torch.compile. Compile-time, server-only.
- PR #40932 - Qwen3-VL: remove an invalid deepstack boundary validation that triggered errors under high load. This one can surface to a downstream consumer if Qwen3-VL is the model behind your agent.
Should you actually chase v0.20.2?
Most articles on this question say "yes, always pin the latest." That is bad advice for inference servers. The version you run is a function of three inputs: which models you serve, which kernels they use, and what your downstream consumer is. For a Mac-agent operator, the downstream consumer is a consumer-side desktop app talking to your server through an Anthropic-shaped shim. That shape narrows the upgrade question a lot.
Walk it left to right.
The upgrade decision in four steps
- 1
Are you serving Qwen3-VL?
If yes, the deepstack patch (#40932) is the reason to pull v0.20.2. Under load, your server is otherwise vulnerable to a known crash path. If no, this fix is irrelevant.
- 2
Are you on Hopper with DeepSeek V4 + MTP=1?
If yes, the persistent topk fix (#41665) closes a hang you may already be seeing. If you do not run DeepSeek V4 on Hopper at all, pin whatever; this fix never touches your traffic.
- 3
Are you on the V1 engine path?
If yes, the KV block allocation fix (#41282) prevents a startup-time failure mode. Most production fleets are on V1 by now; if you locked to a legacy engine, you are unaffected.
- 4
Are you sandboxing gpt-oss under torch.compile?
If yes, MXFP4 works again under compile. If you are not running gpt-oss or you have torch.compile off, ignore. Most Mac-agent operators do not run gpt-oss as the backend.
“Most Mac-agent operators answer 'no' to all four of those questions. That is the working answer: v0.20.2 is the latest version, and most consumers of vLLM should not chase it. Stay on whatever v0.20.x you already pinned. Only chase the patch when a specific symptom matches one of the four fixes.”
Verified against vLLM v0.20.2 release notes, May 10, 2026
The anchor fact: two lines of Swift that do not care which vLLM you run
The thing every page about vLLM and Mac agents skips is what happens on the consumer side when you upgrade your server. The honest answer is: nothing should happen, and on Fazm nothing does, because the wiring between the Mac agent and your vLLM endpoint is two lines of Swift that read a UserDefaults string and stuff it into an environment variable. That block lives at lines 468 to 469 of Desktop/Sources/Chat/ACPBridge.swift. I just opened the file and pasted the surrounding context verbatim.
// Desktop/Sources/Chat/ACPBridge.swift, lines 467 to 470
// The two lines that make every vLLM patch release invisible to the Mac agent.
// Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"),
!customEndpoint.isEmpty {
env["ANTHROPIC_BASE_URL"] = customEndpoint
}When you set Settings > AI Chat > Custom API Endpoint to, say, http://127.0.0.1:4000(your LiteLLM shim in front of vLLM), Fazm's ACP bridge subprocess spawns with ANTHROPIC_BASE_URL set to that value. Every model call from the agent then routes to your shim, and from there to whatever vLLM version is running. Upgrading the server from v0.20.1 to v0.20.2 does not touch this code path. Upgrading from v0.19.x to v0.20.0 with the CUDA 13 jump did not either. The Anthropic Messages protocol is the contract; v0.20.2 did not change that contract.
That is the practical reason most Mac-agent users do not need to track vLLM patch releases. The decoupling is one shim process away.
Quick verification commands, if you want to check the version yourself
# PyPI: what does pip see as the latest vllm?
pip index versions vllm | head -1
# vllm (0.20.2)
# GitHub: what does the release machinery report?
curl -fsSL https://api.github.com/repos/vllm-project/vllm/releases/latest \
| jq -r '.tag_name'
# v0.20.2
# Docker Hub: what is the latest tagged container?
docker pull vllm/vllm-openai:latest
docker inspect vllm/vllm-openai:latest --format '{{json .RepoTags}}'
# ["vllm/vllm-openai:latest","vllm/vllm-openai:v0.20.2"]
# vllm.ai/releases: the human-readable index page
open https://vllm.ai/releasesRun any one of these on May 12, 2026 and you should see v0.20.2. If you see something later by the time you read this, the version above is stale; cross-check the vllm-project release page.
What "latest" means on Apple Silicon, which is a separate question
Two things are commonly confused. Vanilla vLLM (the vllm-project/vllm repo) is CPU-only on macOS Apple Silicon through the requirements/cpu.txt build path, and that path tracks the main release cadence. v0.20.2 builds on M-series CPUs the same way v0.20.1 did.
The community Apple Silicon options that give you actual GPU acceleration on a Mac are the vllm-project/vllm-metal plugin (Metal as the attention backend) and the waybarrios/vllm-mlx fork (MLX with an Anthropic and OpenAI compatible server out of the gate). Neither of those tracks vanilla vLLM patch-for-patch. They cherry-pick the changes that matter for their backend (FA4, MLA prefill, MoE routing) and skip server-infra patches like the V1 KV block allocation fix. v0.20.2's headline fix is the Hopper persistent topk path, which is not relevant on Apple Silicon at all. Expect those forks to skip v0.20.2 and merge to a future v0.20.x cut.
So "the latest version of vLLM in 2026" on your Mac practically means one of three things: pip install vllm==0.20.2 for CPU-only, vllm-project/vllm-metal main for Metal-backed GPU, or waybarrios/vllm-mlx main for MLX-backed GPU. Pick based on what you are actually serving, not on which has the higher version number.
Want help wiring v0.20.2 (or any version) into a Mac agent?
Book 20 minutes. We'll walk through which version actually makes sense for your workload, the shim choice, and the one Settings field in Fazm that absorbs every vLLM patch.
Frequently asked questions
What is the latest version of vLLM right now?
vLLM v0.20.2, tagged on May 10, 2026, available on PyPI and on the vllm-project/vllm GitHub releases page. It is a 6-commit patch on top of v0.20.0 (April 27, 2026), with v0.20.1 (May 4, 2026) as the intermediate patch. Three minor versions in 14 days; the cadence is unusual but reflects the DeepSeek V4 stabilization push. Confirmed by re-checking github.com/vllm-project/vllm/releases/tag/v0.20.2 and pypi.org/project/vllm on May 12, 2026.
What did v0.20.2 actually change?
Four fixes. (1) DeepSeek V4 sparse attention, PR #41665, re-enables the persistent topk path on Hopper and ensures the memset kernel runs at CUDA graph capture time, resolving MTP=1 hangs. (2) V1 engine KV cache, PR #41282, fixes a 'failure to allocate KV blocks' error in the V1 engine path. (3) gpt-oss MXFP4 plus torch.compile, PR #42002, passes hidden_dim_unpadded through the moe_forward fake op so MXFP4 functions under torch.compile. (4) Qwen3-VL, PR #40932, removes an invalid deepstack boundary validation that triggered errors under high load. Three of those four are CUDA-host server fixes; only Qwen3-VL touches anything a Mac-agent operator might be running through their server.
If I run a Mac agent against my vLLM server, do I need to upgrade to v0.20.2?
Almost certainly not, unless you specifically front your agent with Qwen3-VL or DeepSeek V4 on a Hopper card. The other three fixes are corrections to CUDA-side server behavior that only matter to the operator of the vLLM box, not the consumer of its API. The reason this is true is that the Fazm app does not talk to vLLM. It talks to an Anthropic-shaped shim (LiteLLM, claude-code-router, or vllm-mlx for the all-in-one case) that talks to vLLM. The shim absorbs every server-side change as long as the OpenAI Chat Completions protocol stays stable, which v0.20.x has not broken. The two-line block in ACPBridge.swift that points the bridge at the shim's URL has not had to change once across v0.18, v0.19, v0.20.0, v0.20.1, and v0.20.2.
Why is there a v0.20.2 only six days after v0.20.1?
Because the DeepSeek V4 work landed in v0.20.0 as 'shipping with known constraints on Hopper', the v0.20.1 release on May 4 stabilized the multi-stream pre-attention GEMM path, and v0.20.2 on May 10 was needed to fix the persistent topk regression that v0.20.1 did not catch. The trigger for v0.20.2 was customer reports of MTP=1 hangs on Hopper after pulling v0.20.1. The release was prepared by six contributors and shipped as a maintenance-only tag. There is no new model support, no API change, no kernel rewrite; it is the kind of patch you skip if you are not running into the specific symptom.
Is v0.20.2 available on PyPI, in Docker, and on the vllm-project release page?
Yes to all three. pip install vllm==0.20.2 resolves on PyPI as of May 10, 2026. The vllm/vllm-openai:v0.20.2 container image is published. The GitHub release page at github.com/vllm-project/vllm/releases/tag/v0.20.2 has the source tag and the changelog. The vllm.ai/releases registry lists it as 'all versions'. None of the community Apple Silicon forks have caught up yet, as is normal for patch cadence; expect vllm-project/vllm-metal and waybarrios/vllm-mlx to pick up the relevant Qwen3-VL fix on their next planned cut, not within hours.
If I am still on v0.19.x or earlier, what is the practical reason to move?
Two reasons. First, CVE-2026-0994 (deserialization vulnerability in the prompt_embeds handling of the Completions API, affecting versions 0.10.2 and later) was patched in the v0.19.x cycle. If your server exposes Completions to anything you do not fully trust, that is a real upgrade trigger. Second, v0.20.0 raised the dependency floor: CUDA 13.0.2, PyTorch 2.11.0, Transformers v5, Python 3.14 supported. If your host machine is still CUDA 12, you cannot pull v0.20.x without rebuilding the wheel, and that is a meaningful migration. There is no other operational reason to upgrade past v0.19.x unless you specifically need DeepSeek V4, Hunyuan v3 preview, or FA4 as default MLA prefill.
Will v0.20.2 work on Apple Silicon directly?
Vanilla vLLM still runs CPU-only on macOS Apple Silicon via the requirements/cpu.txt build path, and the v0.20.2 patches are GPU-side fixes that do not touch the CPU backend. The CPU build path works but is slow. For practical GPU use on a Mac you want the community vllm-project/vllm-metal plugin or the waybarrios/vllm-mlx fork. Neither of those tracks vanilla vLLM patch-for-patch; they cherry-pick the changes that matter (FA4, MLA prefill, MoE routing) and skip server-infra patches like the V1 engine KV block fix. v0.20.2 specifically is unlikely to land in either fork until they have a v0.20-tier merge to do anyway.
What is the cleanest version to pin in production right now if I run a Mac agent against my own vLLM box?
If you are on a CUDA 13 host and want the new model support, pin v0.20.2 and call it done; the fixes are strictly additive over v0.20.1. If you cannot move off CUDA 12, pin v0.19.1 and accept that you do not get DeepSeek V4 or FA4 prefill. If you are running the Hugging Face transformers stack on Apple Silicon and serving locally to your laptop, pin nothing; pull from vllm-project/vllm-metal main and accept that 'latest' means what their last green CI run produced, not what is on the vanilla GitHub releases page. None of those choices change the line in Fazm that wires your shim into the agent: ACPBridge.swift:468-469 reads customApiEndpoint and sets ANTHROPIC_BASE_URL, regardless.
Where do I verify the version myself if I do not trust this page?
pip index versions vllm prints the resolvable PyPI versions. curl -fsSL https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r '.tag_name' prints the tag on the GitHub release machinery, currently v0.20.2 as of May 12, 2026. docker pull vllm/vllm-openai:latest and docker inspect with a jq filter on RepoTags shows what the latest tag points at. The vllm.ai/releases page is the human-readable index. All four agree.
Related vLLM guides
vLLM release May 2026: v0.20.1 patch and the v0.20.0 dep-floor jump
The May 2026 timeline with version numbers and dates, plus the four lines of Swift inside Fazm that make a vLLM server upgrade invisible to your Mac agent.
vLLM release notes 2026: v0.18 and v0.19, and the toggle that wires vLLM into a Mac agent
What actually shipped in v0.18.0 and v0.19.0, the CVE-2026-0994 patch, and the Anthropic-shim wiring that makes any vLLM endpoint the brain of a Mac agent.
Run vLLM locally on Mac and plug it into an AI agent
From curl localhost:8000 to an agent driving Finder, Calendar, WhatsApp. The one Settings field that rewrites ANTHROPIC_BASE_URL to your local vLLM server.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.