APRIL 2026 - THREE VLLM RELEASES, 14 DAYS, ZERO MAC BUILDS

vLLM release notes, April 2026. Your Mac still can't run any of them.

vLLM shipped v0.18.0, v0.19.0, and v0.19.1rc0 inside a 14-day window in April 2026. gRPC. GPU NGram speculative decoding. Day-one Gemma 4. Async scheduler flipped on by default. Every release-notes roundup lists the bullets. None of them name the thing a Mac user notices first: vLLM does not run natively on macOS. This guide decodes each release, states that truth honestly, and shows the single Swift line inside the Fazm desktop app that lets a consumer point a Mac agent at a vLLM endpoint running somewhere else, no app update required.

M
Matthew Diakonov
10 min read
4.9from Written from the April 2026 vLLM release notes and the MIT-licensed Fazm source tree
Three vLLM releases in April 2026, decoded feature-by-feature
Why vLLM is CUDA-first and what that means for a Mac user
The single Swift line that points Fazm at any vLLM endpoint
Exact file paths: ACPBridge.swift:381 and SettingsPage.swift:840
What accessibility-tree context adds when vLLM is serving the model

The cadence, in one paragraph

vLLM went three releases deep inside two weeks. v0.18.0 on the first side of the window added gRPC serving and stabilized GPU NGram speculative decoding. v0.19.0 was the one you noticed: day one Gemma 4 across four variants, async scheduler enabled by default, and a handful of smaller scheduler and prefix-cache improvements that matter for agent-style workloads. v0.19.1rc0 shipped the following day with small fixes. The cluster coincided with public-weight drops of Llama 4, Qwen 3.5, Gemma 4, and DeepSeek, which is why every local-LLM roundup this month reads like a model launch press release.

0vLLM releases in April 2026
0Days from v0.18.0 to v0.19.1rc0
0Gemma 4 variants at day one
0Of them that ship a Mac GPU build

release by release

Three releases, read honestly

1

v0.18.0 - gRPC and GPU NGram speculative decoding

Adds a first-class gRPC serving path alongside the HTTP one, which matters for multi-tenant and low-latency setups where HTTP framing cost is a measurable share of per-request time. GPU NGram speculative decoding stabilizes: a running n-gram index on the GPU proposes a few tokens ahead, the main model verifies in a single pass, and tokens-per-second climbs on long-form completions. Agent workloads that emit short, structured tool calls see a smaller share of that gain, but they still see some.

Biggest wins: throughput on long completions; cleaner integration for services that already speak gRPC.
2

v0.19.0 - day-one Gemma 4, async scheduler default

The headline release. Gemma 4 support lands on the same day the weights go public, across E2B, E4B, 26B MoE, and 31B Dense. The async scheduler, which had been behind a flag, flips on by default: time-to-first-token drops for typical workloads because scheduling and execution overlap rather than serialize. For desktop agents that feel latency-bound on every turn, this is the single most important change of the month.

Biggest wins: agent snappiness (lower TTFT), first-day inference on Gemma 4 26B MoE.
3

v0.19.1rc0 - small fixes, shipped the next day

A release candidate that followed v0.19.0 by a day with scheduler fixes and miscellaneous polish. Worth upgrading to if you hit edge cases under the new async-scheduler default; otherwise v0.19.0 is the substantive release of the month.

Biggest wins: stability on top of v0.19.0. Nothing load-bearing changes.

the Mac truth

vLLM does not run natively on macOS

vLLM is CUDA-first. The benchmarks in the release notes, the speculative-decoding paths, the paged KV cache, the async scheduler: all of it is designed around an NVIDIA GPU. Experimental CPU and Apple Silicon paths exist, but performance parity with the happy CUDA path is not the project's goal. If you are on a Mac and you want the April 2026 gains, there are two honest paths: run vLLM on a Linux box with a GPU and talk to it over the network, or run Ollama or LM Studio on your Mac and accept a different (and less serverlike) throughput profile.

FeatureOllama / LM Studio (local Mac)vLLM (remote)
Where it runsOn your Mac, Apple Silicon directlyLinux + CUDA, usually a remote box
Async scheduler (v0.19.0)Different scheduling model; not comparableOn by default
Day-one new checkpointsUsually within days, via tool updatesYes, as model support lands in vLLM
Throughput under concurrencySingle-user oriented, fine for one agentHigh. Designed for multi-request serving
API shapeOpenAI-compatible (both tools)OpenAI-compatible (chat completions)
Fazm integration pathCustom API Endpoint + Anthropic shimCustom API Endpoint + Anthropic shim

the seam

One Swift line is the whole Mac-to-vLLM bridge

Fazm is a consumer macOS app. It talks Anthropic Messages to a Node child process over ACP. Anywhere you can point that child process at a different base URL, you can point Fazm at something else. That seam exists and it is small: a single UserDefaults key on one side, a single env-var export on the other. Here is the code, unchanged, straight from the MIT-licensed repo.

Desktop/Sources/Chat/ACPBridge.swift
Desktop/Sources/MainWindow/Pages/SettingsPage.swift
v2.2.0

Custom API Endpoint shipped in Fazm v2.2.0 on 2026-04-11. That is the release that made every vLLM build after it addressable from the desktop.

CHANGELOG.json

How the pieces connect

vLLM speaks OpenAI-compatible chat completions. Fazm speaks Anthropic Messages. The two interlock through a translation shim that anyone can run on any machine. Fazm's accessibility layer reads the tree of the focused app and sends the result as the tool-call payload, so whatever model vLLM is serving gets structured labels and roles instead of pixels. The diagram below is the end-to-end flow.

Fazm on macOS, vLLM on the other side of the network

Fazm.app
Accessibility tree
Your MCP tools
ACP bridge
Anthropic shim
vLLM server
Your app

what each release actually unlocks

The bullets, translated to what a Mac agent feels

gRPC serving (v0.18.0)

If your Anthropic-to-OpenAI shim speaks gRPC, you skip HTTP framing for every turn. On short agent calls, framing is a measurable share of wall time. Most shims in the wild still use HTTP, so treat this as headroom, not a day-one win.

GPU NGram speculative decoding (v0.18.0)

Throughput gain on long completions without the memory cost of a draft model. Agent tool calls are short, so the gain is smaller here than for chat. Still, it is free once you enable it, and it compounds with v0.19.0's scheduler.

Day-one Gemma 4 (v0.19.0)

Four variants supported the moment Gemma 4 weights went public: E2B, E4B, 26B MoE, 31B Dense. The 26B MoE is the current sweet spot for agent reasoning per unit of active-parameter cost on a GPU with 24-48 GB of memory.

Async scheduler default (v0.19.0)

The single most important line in the April 2026 notes for agent feel. Scheduling and execution overlap instead of serializing. TTFT drops on typical workloads. Agents emit many short tool calls in a loop, so every TTFT improvement multiplies.

v0.19.1rc0 polish

Fixes on top of the async-scheduler default. Worth the upgrade if you hit a scheduler edge case. Nothing to rewrite around.

What did not change

vLLM is still CUDA-first. No official macOS GPU path. If you are shopping for a Mac build, skip these notes and look at Ollama or LM Studio. If you want vLLM's serving profile on a Mac client, route to a remote vLLM from Fazm.

try it

Start a vLLM server and point Fazm at it

Assume a Linux box with an NVIDIA GPU you can reach from your Mac. The steps below are what actually happens, no pseudo-code: launch vLLM with an OpenAI-compatible API, run an Anthropic-to-OpenAI shim in front of it, paste the shim URL into Fazm's Custom API Endpoint field. Restart the chat. Done.

gpu-box -> mac client

the stack of numbers

Two numbers that matter more than the rest

0 days

between vLLM v0.18.0 and v0.19.1rc0 in April 2026. The cadence is the story: models landed, vLLM absorbed them, scheduling improved, and every agent hitting a vLLM endpoint got faster without changing a line of client code.

0 line

of Swift is the Mac-to-vLLM seam inside Fazm.env["ANTHROPIC_BASE_URL"] = customEndpointat ACPBridge.swift line 381. Everything else, shim, server, model, scheduler, is on the other side of that variable.

Have a vLLM box and a Mac you want to drive from it?

We walk through the shim, the endpoint setting, and the accessibility-tree payload on a quick call.

Book a call

vLLM April 2026 and the Mac side, answered

What did vLLM release in April 2026?

Three releases inside roughly two weeks. v0.18.0 added gRPC serving and GPU NGram speculative decoding as a stable feature. v0.19.0 arrived with day-one Gemma 4 support across E2B, E4B, 26B MoE, and 31B Dense variants, and it flipped the async scheduler on by default, which lowers time-to-first-token for most workloads. v0.19.1rc0 shipped the next day as a release candidate with small fixes. The practical story is that vLLM kept pace with the open-weight model wave so the newest checkpoints were servable the moment they landed.

Can I run vLLM on my Mac?

Not comfortably, and not the way you run Ollama or LM Studio. vLLM is CUDA-first. The project has experimental CPU and Apple Silicon paths but the performance story, and therefore every vLLM release note, is written for NVIDIA GPUs. If you searched for vLLM April 2026 release notes from a Mac hoping to upgrade your local setup, the honest answer is that you should run vLLM on a Linux box with a GPU and talk to it over the network, or run Ollama or LM Studio on your Mac instead.

Then why would a Mac user care about vLLM release notes at all?

Because the model throughput and agent quality you get at your desk depends on whoever is serving the weights, even if that someone is a small server in your closet, a vLLM endpoint at work, or a hosted vLLM shim like any of the Anthropic-compatible proxies. A 14-day release cadence with async scheduler flipped on by default is the difference between a tool call that returns in 700 ms and one that returns in 2.4 s. It is also how Gemma 4 served on vLLM v0.19.0 starts answering faster than the same weights served on v0.18.0.

How do I point Fazm at a vLLM server?

Settings > Advanced > Custom API Endpoint, paste the URL, restart the chat. Behind the scenes that field is persisted as a UserDefaults key named customApiEndpoint at Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 840, then exported as ANTHROPIC_BASE_URL onto the Node child process at Desktop/Sources/Chat/ACPBridge.swift lines 380 to 381. vLLM does not speak the Anthropic Messages API directly; you need a small shim in front of it that translates Messages to OpenAI-compatible, which is what vLLM exposes. The Fazm side is protocol-agnostic; the shim is where you wire it to vLLM.

Does vLLM speak the Anthropic Messages API that Fazm uses?

No. vLLM exposes an OpenAI-compatible /v1/chat/completions surface by default. Fazm speaks Anthropic Messages. The two interlock through an Anthropic-to-OpenAI shim, of which several open-source options exist. You run the shim on any machine, point it at your vLLM URL, and hand the shim URL to Fazm's Custom API Endpoint. ANTHROPIC_BASE_URL on Fazm's end, OpenAI-compatible on vLLM's end, shim in the middle.

Which vLLM release is the most important for a desktop agent use case?

v0.19.0. The async scheduler flipped on by default reduces time-to-first-token, which is the metric that dominates agent feel because an agent emits many short tool calls in a loop. Every one of those calls is a fresh TTFT bill. Faster scheduling means the agent feels snappier regardless of model size. The Gemma 4 26B MoE support is the second-most-important item if you care about reasoning per active parameter, which is where MoE math pays off.

What is GPU NGram speculative decoding from v0.18.0 and why did it matter?

Speculative decoding uses a small draft path to propose a few tokens ahead of the main model, then verifies them in a single pass. GPU NGram specifically proposes continuations from a running n-gram index on the GPU, so you do not have to run an entire separate draft model. It is a speed win on long-form completion without the memory cost of a draft-and-verify two-model pipeline. For agents that emit short structured tool calls, the gain is smaller than it is for chat-completion workloads.

Does Fazm actually need vLLM, or is this just a compatibility story?

Fazm itself does not need vLLM. It talks Anthropic Messages to a Node child process over ACP, and that child process can be pointed at any Anthropic-shaped endpoint via ANTHROPIC_BASE_URL. If you want vLLM's throughput, you wire a shim in front of it and put the shim URL in Settings. If you prefer Ollama or LM Studio, the same seam works; both of those expose an OpenAI-compatible endpoint natively and you still put a Messages shim in front of them. The design is that the serving stack is yours to choose; Fazm is a consumer Mac app, not a model server.

What does Fazm send to the model that is different from a typical desktop agent?

Not a screenshot. Fazm ships a bundled macOS binary called mcp-server-macos-use that returns the accessibility tree (AXUIElement roles, labels, values, window hierarchy) for the focused app as a structured document. That document goes to the model inside the tool-call response, so even a locally served 7B to 13B-class model has a realistic shot at picking the right button. The Fazm onboarding tells you this too; the feature shipped the same month as the vLLM v0.19.0 release and they compose.

Where can I verify these claims in the source?

Four files in the MIT-licensed repo. Desktop/Sources/Chat/ACPBridge.swift lines 380 to 381 for the ANTHROPIC_BASE_URL export. Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 840 for the @AppStorage("customApiEndpoint") declaration and line 936 for the TextField that writes to it. Desktop/Sources/Providers/ChatProvider.swift line 2103 for the consumer reading the same key. CHANGELOG.json for v2.2.0 on 2026-04-11, which is the release that shipped the Custom API Endpoint setting into the public Fazm build.

If I run vLLM at home, should I bother with accessibility-tree context on the Mac side?

Yes. The throughput gain from vLLM is about tokens per second; the agent-quality gain on a Mac is about whether the model can identify the right element. A fast local model that receives a screenshot and hallucinates the Save button is worse than a slower model that receives a structured accessibility tree and names it. The two gains stack. vLLM v0.19.0 plus accessibility-tree context is roughly where a consumer Mac agent becomes actually useful, not a demo.