vllm v0.19.0 + v0.19.1April 3 and April 18, 2026macOS agent layer

vLLM release April 2026 changelog: two tags inside the inference server, zero that moved the Mac-agent layer above it

v0.19.0 shipped April 3, 2026 with Gemma 4, zero-bubble async scheduling plus spec decode, Model Runner V2 piecewise CUDA graphs, ViT full CUDA graph capture, CPU KV offload, B300 / GB300 support, and transformers v5 compatibility. v0.19.1 followed on April 18 with Gemma 4 stabilization, Eagle3 drafter, quantized MoE, and a LoRA loading fix. Plus a CVE-2026-0994 patch. Every item is a change at or below localhost:8000. None of them touch the Mac-desktop action layer above it.

F
Fazm
12 min read
4.9from 200+
Uses real accessibility APIs, not screenshots
Works on any Mac app, not just the browser
Consumer app, no vLLM config to touch

What's inside April 2026, at a glance

Two main tags and one edge-of-April patch. Each chip is a change at or below the token boundary on localhost:8000.

v0.18.1 (2026-03-31)
v0.19.0 (2026-04-03)
v0.19.1 (2026-04-18)
Gemma 4
Zero-bubble async + spec decode
Model Runner V2
ViT CUDA graph
CPU KV offload
B300 / GB300 SM 10.3
Blackwell SM120 FP8 GEMM
transformers v5.5.4
Eagle3 drafter
quantized MoE
CVE-2026-0994

What every April 2026 vLLM roundup covers, and what they miss

The top ten results for "vllm release april 2026 changelog" all cover the same surface. The GitHub releases page, the NVIDIA NGC vLLM release notes, the official vLLM blog, the vllm-ascend docs, vllm.ai/releases, PyPI, deps.dev, discuss.vllm.ai, and the AMD ROCm vLLM benchmark docs. Each of them lists v0.19.0, v0.19.1, the commit counts, the headline features (Gemma 4, zero-bubble spec decode, Model Runner V2, ViT CUDA graphs, CPU KV offload), the device matrix (NVIDIA B300 and GB300, Blackwell SM120, Intel XPU, ARM BF16, s390x FP16, ppc64le prefix caching), and the security patch (CVE-2026-0994). All of that is correct and useful.

What those articles skip, because it is not in scope for a vLLM release, is the layer above vLLM. The release notes tell you what the inference server can do with your GPU. They do not tell you what a local model served by vLLM can see or touch on your actual Mac. That second question is where a real Mac agent lives, and its answer is not in any vLLM version number.

This guide walks the v0.19.0 and v0.19.1 changelogs, then drops into the exact shipping code where the Mac-desktop action boundary sits. The code is in Fazm. The anchor fact is a binary at Fazm.app/Contents/MacOS/mcp-server-macos-use registered with args: [] and env: [], which is what makes the boundary portable across inference backends.

The v0.19.0 and v0.19.1 changelog, entry by entry

Nine groups of line items from the April 2026 tags. Read the whole list and notice what category is missing.

Gemma 4 (v0.19.0)

PRs #38826 and #38847. Multimodal, reasoning, and tool-use variants. Requires transformers >= 5.5.0. This is the v0.19.0 headline.

Zero-bubble async scheduling + spec decode

PR #32951. Overlaps scheduling with speculative decoding. Cuts first-token latency on any request shape, including short AX-tree tool-call turns.

Model Runner V2 piecewise CUDA graphs

PR #35162. Piecewise graphs for pipeline parallelism, streaming inputs, EPLB, and a spec-decode rejection sampler. The V2 runner is no longer experimental.

ViT full CUDA graph capture

PR #35963. Matters for multimodal Gemma 4 runs. A screenshot-free AX-tree Mac agent barely touches this path.

CPU KV cache offloading

PRs #37160 and #37874. General CPU KV offload with a pluggable cache policy. Lets long sessions with dozens of tool-call turns stay resident.

NVIDIA B300 / GB300 (SM 10.3)

PRs #37755 and #37756. Allreduce fusion for Blackwell-next generation parts. Blackwell SM120 CUTLASS FP8 GEMM also landed in the same tag.

Transformers v5 compatibility

A sweep across v0.19.0 for the transformers v5 API. v0.19.1 bumped to transformers v5.5.4. Bring your own HF cache and it mostly just works.

Gemma 4 stability (v0.19.1)

PRs #38992 (streaming tool-call JSON fix), #39450 (Eagle3 drafter), #39045 (quantized MoE), #38844 (LoRA loading). The April 18 patch is Gemma 4 hardening.

CVE-2026-0994 patch

Completions API endpoint patch in the April cycle. Affects vLLM >= 0.10.2. Unrelated to feature work. Upgrade past the patched tag.

Missing category: anything that touches the Mac-desktop GUI surface. No accessibility API work. No CGEvent synthesis. No window-frame resolution. vLLM's scope stops at the token boundary on port 8000, and both April tags respect that scope.

What a v0.19.1 server looks like on boot

The server launches with log lines that map directly to the April release notes. Zero-bubble async, Model Runner V2, ViT CUDA graphs, CPU KV offload. Then it serves /v1/chat/completions and goes to sleep waiting for clients.

vllm serve

Notice the endpoint that vLLM is now ready to answer on. Everything described above that line item is inside vLLM's scope. Everything required to click inside a Mac application is above it.

Where v0.19.x stops and where the Mac-agent layer begins

vLLM's surface is the OpenAI-compatible REST on localhost:8000. Clients plug into it from the left. Downstream of it, if you want a Mac agent rather than a chat completion, you need a perception layer and an action layer. Those live on the right.

vLLM v0.19.x boundary (left of hub) versus the Mac-desktop agent layer (right of hub)

Gemma 4 tool use
Zero-bubble async + spec decode
Model Runner V2
ViT full CUDA graph
CPU KV offload
localhost:8000
Perception: AX tree walk
Action: CGEvent synthesis
mcp-server-macos-use
Six _and_traverse tools

v0.19.0 and v0.19.1 additions live on the left of the hub. Fazm's mcp-server-macos-use lives on the right of the hub. The hub is the token boundary where the two stacks meet.

Anchor fact: the registration code is already provider-agnostic

This is the block in Fazm that decides how the Mac-desktop action boundary is wired. The path is resolved on line 63. The registration block is lines 1057 through 1064. The default model identifier is line 1245. The authoritative built-in MCP list is line 1266. Open acp-bridge/src/index.ts and grep for any of these line numbers and you will find them.

acp-bridge/src/index.ts

Why this is the uncopyable part: the registration passes args: [] and env: [] to the binary. No Claude-specific flag, no Anthropic-specific environment variable, and nothing that would have to be renamed or rewired on the day vLLM becomes the backend. The binary speaks MCP over stdio, and the MCP tool_use / tool_result shape is defined by the model-provider layer above the binary, not by the binary itself. The only line that names a provider is DEFAULT_MODEL at line 1245.

vLLM on its own versus vLLM behind a Mac agent

Same tool_use block. Two very different outcomes depending on whether anything hosts the model above the token boundary.

Before a client exists above vLLM, versus after

# A tool-use turn AGAINST vLLM v0.19.1 alone

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-9b-it",
    "messages": [
      {"role": "user",
       "content": "Click Send in the Mail compose window."}
    ],
    "tools": [ /* whatever you pass in */ ]
  }'

# The response is a tool_use JSON block.
# vLLM has now done its job: tokens out.
#
# vLLM does not click anything.
# vLLM does not have a screen to click on.
# vLLM does not know what "Mail compose window" is.
#
# Above this line is where a Mac agent has to exist.
-38% fewer lines

From vLLM endpoint to Mac CGEvent, hop by hop

1

vLLM v0.19.1 starts on localhost:8000

OpenAI-compatible REST: POST /v1/chat/completions, /v1/completions, /v1/models, /v1/embeddings. That is the top of vLLM's stack.

Anything that calls into this endpoint, a Python script, a coding CLI, a macOS agent, sits above the vLLM process boundary.
2

A client decides what the model reasons over

For a chat UI: user messages. For a coding CLI: stdin. For a Mac agent: the accessibility tree as a few kilobytes of text per turn.

The choice of observation format is a client-side decision, not a vLLM-side one. vLLM does not ship a perception layer.
3

The model emits a tool_use block

For Gemma 4 tool-use variants (with v0.19.0 auto-tool-choice and the v0.19.1 streaming JSON fix), this is native tool-call mode. JSON with a tool name and arguments.

vLLM's job is done at this point: tokens out. Where those tokens go next is a client concern.
4

An action binary translates the tool call into OS events

In Fazm this is mcp-server-macos-use: the six _and_traverse tools wrap AXPress, CGEvent mouse, CGEvent keyboard, and Core Graphics scroll.

Registration: acp-bridge/src/index.ts lines 1057-1064 with args: [] and env: []. No model-specific flags. No vLLM-specific flags. No Anthropic-specific flags.
5

The same binary re-walks the tree and returns it

The response in the MCP tool_result carries both the action result and the post-action accessibility tree.

Observe-act-observe collapses to one round trip. This property is a function of the MCP tool schema, not of vLLM or any model provider.

vLLM v0.19.x scope vs the Mac-desktop agent layer

A line-by-line accounting of what the April 3 and April 18 release notes cover, versus what the layer above vLLM has to answer on its own.

FeaturevLLM v0.19.0 + v0.19.1 release notesFazm Mac-desktop agent layer
Scope of the April 2026 release notesinference server layer: kernels, scheduler, quantization, device supportapplication-above-inference-server layer: perception + action on macOS
Default endpoint the release targetsPOST http://localhost:8000/v1/chat/completionsMCP tool_use to mcp-server-macos-use over stdio
What a v0.19.0 / v0.19.1 server sees on the screennothing; its surface is HTTP and tokens441 elements from AXUIElementCreateApplication as text
How the release notes let a model click in Mailthey do not; vLLM has no click primitivemcp-server-macos-use synthesizes a CGEvent click by role + title
Provider coupling in the binary aboven/a; vLLM is the inference layerzero: registered with args: [] and env: [] at index.ts:1057-1064
Single-point swap to try a new backendvllm serve <new-model-id>DEFAULT_MODEL string at acp-bridge/src/index.ts:1245
Observation payload per Mac-agent turnwhatever the calling client chose to includea few kilobytes of UTF-8 AX-tree text per tool response
Round trip shapechat completion request -> SSE token streamMCP tool_use -> action + re-walked tree in one tool_result

The numbers across the April 2026 boundary

These are not benchmarks. The first two come from the v0.19.0 release page. The last two come from a real traversal of a Fazm Dev window on an M-series Mac.

0Commits in v0.19.0 (April 3, 2026)
0Contributors to v0.19.0
0Elements in a real AX tree walk
0sWalk + serialize time

Compare against a base64-encoded 4K screenshot observation, which is typically 0 KB to 0 MB of text in the request body. On a 9B Gemma 4 served by vLLM at a 32K context, that is the gap between one step and several dozen.

One line of what any vLLM-served model would read

The binary emits one text line per AX element. Role, title, frame, visibility. This is the exact format that lands in the MCP tool_result. A model served by vLLM on localhost:8000 substring-searches for the word it wants, reads x/y/w/h off the same line, and passes those four values into the next tool call:

[AXButton (button)] "Send" x:6272 y:-1754 w:56 h:28 visible

Nothing about this line is Claude-specific. A Gemma 4 9B or Llama 3.1 instruct or DeepSeek-R1 tool-use-capable model served by vLLM on port 8000 would read the same UTF-8 text.

What it takes to drive the same Mac binary from a local vLLM

Six steps. Five of them are on the vLLM side and fit inside the v0.19.1 release notes. The sixth is the single-line adapter seam on the client side.

From pip install to an AX-tree click, end to end

  • Install vllm==0.19.1 and transformers==5.5.4 on a GPU host.
  • vllm serve a tool-use-capable model (e.g., Gemma 4 9B Instruct with --enable-auto-tool-choice).
  • Confirm /v1/models returns the model ID and /v1/chat/completions responds to a health check.
  • In the Mac client, set DEFAULT_MODEL to the vLLM model ID and route /v1/messages through an OpenAI-compatible adapter.
  • Keep acp-bridge/src/index.ts lines 1057-1064 exactly as they are; the action layer does not need vLLM-specific plumbing.
  • Let the model emit tool_use JSON; the Mac binary translates it to a CGEvent and returns the re-walked AX tree.

What the top 10 "vllm release april 2026 changelog" results do and do not cover

What the SERP actually says

  • github.com/vllm-project/vllm/releases lists v0.19.0 and v0.19.1 in full, with every PR number and device matrix entry.
  • docs.nvidia.com vLLM release notes map NGC container versions to vLLM tags for monthly alignment.
  • vllm.ai/blog covered Gemma 4 on April 2, prefill-decode disaggregation on MI300X on April 7, and the Korea meetup on April 14.
  • vllm-ascend release notes enumerate NPU kernel and op coverage; ROCm docs enumerate MI300X throughput deltas.
  • deps.dev and pypi.org list the packaging metadata and dependency graph for the April tags.
  • discuss.vllm.ai covers the bi-weekly release cadence at a process level. None of the above describes a consumer Mac-agent layer above the inference server.

Each of those pages is correct inside vLLM's scope. None of them describe a Mac-agent layer above the inference server, because vLLM's release notes, by design, never include a layer above the token boundary.

How to read future vLLM release notes on a Mac

Every vLLM release will grow the set of models, devices, and kernels served on localhost:8000. Expect more entries like Gemma 5, Model Runner V3, more speculative-decoding variants, and more accelerators. Expect occasional security backports. Expect Blackwell and B300 follow-ups.

What you should not expect, because it is not what vLLM ships, is a line item that makes a local model click inside Mail, type into Notes, or pick an option in System Settings. Those capabilities live one layer up, in the code that hosts vLLM as a backend. The release-day question for Mac users is not "did vLLM grow". It is "did the client that hosts vLLM grow".

Fazm's answer today is that the client hosts Anthropic Claude, not a local vLLM endpoint. But the action layer beneath the client, registered at acp-bridge/src/index.ts lines 1057 to 1064, is already the layer a vLLM-hosted client would use verbatim. That is the release-note-independent part of the stack.

Want to see the Mac-agent boundary above vLLM running live?

Thirty minutes on a call. We open acp-bridge/src/index.ts at line 1057, point at the mcp-server-macos-use binary, and run a workflow end-to-end with a real local model in the loop.

Book a call

Frequently asked questions

Which vLLM versions shipped in April 2026?

Two tags land inside April 2026. v0.19.0 was cut on April 3, 2026, with 448 commits from 197 contributors (54 of them new). v0.19.1 was cut on April 18, 2026, as a patch release. A third tag, v0.18.1, was cut on March 31, 2026, so it shows up in some April summaries because its effects reach April users; the headline items in v0.18.1 are a revert of the SM100 MLA prefill default to TRT-LLM, a FlashInfer header pre-download fix, and a DeepGemm E8M0 accuracy fix for Qwen3.5 FP8 on Blackwell.

What are the headline features of vLLM v0.19.0?

Gemma 4 support (PRs #38826 and #38847, requires transformers >= 5.5.0), zero-bubble async scheduling combined with speculative decoding (PR #32951), Model Runner V2 maturation with piecewise CUDA graphs for pipeline parallelism (PR #35162), ViT full CUDA graph capture (PR #35963), general CPU KV cache offloading with a pluggable cache policy (PRs #37160 and #37874), Dual-Batch Overlap generalized to arbitrary models (PR #37926), NVIDIA B300 and GB300 SM 10.3 allreduce fusion (PRs #37755 and #37756), Blackwell SM120 CUTLASS FP8 GEMM, and a transformers v5 compatibility sweep.

What does vLLM v0.19.1 change on top of v0.19.0?

Transformers v5.5.4 upgrade, a Gemma 4 streaming tool-call JSON fix (PR #38992), an Eagle3 drafter for Gemma 4 (PR #39450), quantized MoE for Gemma 4 (PR #39045), and a LoRA loading fix for Gemma 4 (PR #38844). The direction of travel inside v0.19.1 is clear: Gemma 4 went from working to working reliably under tool use, LoRA adapters, and Eagle-style speculative drafting.

Is there a security fix in the April 2026 vLLM cycle?

Yes. CVE-2026-0994 was patched during the April release cycle. It affects the Completions API endpoint in vLLM versions >= 0.10.2. If you run vLLM with Completions enabled and expose it to anything that is not strictly localhost, upgrade past the patched tag in the April cycle. That is unrelated to the April feature work; it is a straight security backport.

Does any of v0.19.0 or v0.19.1 add a way to click inside Mac apps?

No. Every item in both tags lives at or below the token boundary on `localhost:8000`. v0.19.0 is a GPU-kernel, scheduler, model-runner, and device-support release. v0.19.1 is a Gemma 4 stabilization patch. vLLM's scope is the inference server: requests come in as `POST /v1/chat/completions`, tokens go out as SSE chunks or a completed JSON body. Everything above that boundary, including every pixel and every CGEvent on a Mac screen, lives in the code that hosts vLLM as a backend, not inside vLLM itself.

Where does the Mac-desktop action layer above vLLM actually live in shipping code?

In Fazm's case it lives in a 21 MB ARM64 Mach-O at `Fazm.app/Contents/MacOS/mcp-server-macos-use`. The ACP bridge, a Node process, registers that binary as a local MCP server. The registration block is in `acp-bridge/src/index.ts` at lines 1057 through 1064: a single `existsSync(macosUseBinary)` guard, then `servers.push({ name: "macos-use", command: macosUseBinary, args: [], env: [] })`. Zero provider-specific arguments. Zero Anthropic-specific environment variables. The binary speaks MCP over stdio and exposes six `_and_traverse` tools that walk the frontmost app's accessibility tree via `AXUIElementCreateApplication(pid)` and return the re-walked tree in the same response as the action result.

What would it take for a local vLLM endpoint to drive that same Mac binary?

At the code level it is a single-line swap at `acp-bridge/src/index.ts` line 1245 where `DEFAULT_MODEL = "claude-sonnet-4-6"` lives today, plus an inference-loop adapter that speaks vLLM's OpenAI-compatible `POST http://localhost:8000/v1/chat/completions` instead of Anthropic's `POST /v1/messages`, and translates the tool-call JSON shape between the two. That adapter sits where `ClaudeAcpAgent` sits in the ACP SDK today. The perception and action primitives, the `mcp-server-macos-use` binary and its six `_and_traverse` tools, do not change at all because they were registered with `args: []` and `env: []` on purpose.

Which of the April 2026 vLLM changes matter most if you run it behind a Mac agent?

Three. First, zero-bubble async scheduling plus speculative decoding (PR #32951) cuts first-token latency, which is what a Mac agent feels on every `click_and_traverse` round trip. Second, ViT full CUDA graph capture (PR #35963) matters for multimodal models if you are running a vision-capable Gemma 4 variant, though note that a screenshot-free AX-tree Mac agent barely needs the vision path. Third, CPU KV cache offloading (PRs #37160, #37874) lets a longer running session with many AX-tree observations stay resident without falling off the GPU, which is relevant when a session has dozens of tool-call turns.

What does the observation payload look like that a vLLM-served model would read on a Mac turn?

One UTF-8 text line per AX element. Role, accessible title, frame, visibility flag. A real line from a Fazm Dev window: `[AXButton (button)] "Send" x:6272 y:-1754 w:56 h:28 visible`. A full window walk is about 441 elements and serializes to roughly 0.72 seconds of wall time. The model substring-searches for the word it wants, reads x, y, w, h off the same line, and passes those four numbers as arguments to the next `_and_traverse` tool call. The payload is a few kilobytes of UTF-8, not a 500 KB base64-encoded screenshot. That shape is what makes a local 9B or 12B model on vLLM fit the job without running out of context.

Is there a version where vLLM's release notes will directly affect the Fazm codebase?

The interesting trigger is not a specific version number, it is a shape change. vLLM already exposes OpenAI-compatible `/v1/chat/completions` and `/v1/completions`. The day a user wants to point Fazm at a local endpoint, the layer that moves is the ACP SDK Fazm wraps, not the MCP server below it. The MCP server was already provider-agnostic at `acp-bridge/src/index.ts:1057-1064`, because the registration never contained a provider-specific flag in the first place. vLLM release notes themselves will keep iterating inside the inference server, and that is the correct place for them.

Where can I inspect the Fazm facts this guide cites?

All three anchor points are in one file: `acp-bridge/src/index.ts` inside the Fazm desktop source tree. Line 63 resolves `macosUseBinary` to `Fazm.app/Contents/MacOS/mcp-server-macos-use`. Lines 1057 through 1064 are the `existsSync` guard plus the `servers.push({ name: "macos-use", command: macosUseBinary, args: [], env: [] })` registration. Line 1245 declares `DEFAULT_MODEL = "claude-sonnet-4-6"` and line 1246 aliases it as `SONNET_MODEL`. Line 1266 is the `BUILTIN_MCP_NAMES` set: `fazm_tools`, `playwright`, `macos-use`, `whatsapp`, `google-workspace`. For end-user verification, right-click Fazm.app, Show Package Contents, open Contents/MacOS, and run `file mcp-server-macos-use`; the output reports `Mach-O 64-bit executable arm64`.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.