Shipped April 3, 2026

vLLM v0.19.0 from the viewpoint of a Mac tool-calling agent

Every other write-up of the April 2026 release ranks its features by throughput. I spent the week reading it through a different lens: which of these features actually move the needle for a desktop agent that makes twenty tool calls per user turn, and how exactly do you point one at a local vLLM server. The answer is one environment variable.

Fazm

Published April 16, 202611 min read

Try Fazm free

4.9from 200+

vLLM v0.19.0 release walk-through

Read against live Fazm source

Written for agent builders on macOS

vLLM v0.19.0, from an agent loop

Ranked by what changes for a tool-calling desktop agent

448 commits, 197 contributors, tagged April 3, 2026

Zero-bubble async + speculative decoding, together at last

/v1/chat/completions/batch — offline agent-trace replay

CPU KV cache offloading with pluggable eviction

One env var to route Fazm at a vLLM server

0:00 / 0:05

In the v0.19.0 release notes

Gemma 4 full familyModel Runner V2Piecewise CUDA graphsZero-bubble asyncSpeculative decoding combo/v1/chat/completions/batchCPU KV cache offloadPluggable evictionNVIDIA B300 / GB300Online MXFP8NVFP4 accuracy fixesQeRL quantizationCohere ASRColQwen3.5Granite 4.0 SpeechQwen3-ForcedAlignerGemma 4 tool parserKimi-K2.5 tool parserDBO generalizationVision encoder CUDA graphs

The honest read

Most vLLM release coverage ships on day one and reads like a compressed changelog. Throughput up, tail latency down, new models supported, new hardware supported. All true. All insufficient. Nobody in those posts tells you what any of it means if your actual workload is a long-horizon tool-calling loop that spends most of its wall time not inside the model but between model calls.

Fazm is a Mac desktop agent. A single user turn typically looks like: read the focused window's accessibility tree, plan, call a tool, receive the tool result, re-read the tree, call the next tool, and keep going until the user's intent is discharged. Twenty tool calls in a turn is ordinary. Fifty is not unusual. In that shape, the things that matter in an inference engine are the things that compound across sequential calls: scheduling efficiency, contention under concurrency, how cleanly the tool-parser survives structured output at scale.

v0.19.0 is the first vLLM release where the features line up specifically for that shape. Not by accident, and not because the release is targeting desktop agents (it is not) but because the same underlying problem (sequential calls dominated by inter-call overhead) shows up in every agent framework that has scaled past a toy loop. Let me walk through what changed and what it means.

How Fazm routes to a local vLLM v0.19.0 server

The hub in that diagram is the single environment variable the ACP bridge writes. Swap its value and every downstream LLM call leaves your machine for the local vLLM server instead of api.anthropic.com. The model name ("claude-sonnet-4-6") is still what the app sends; the translator layer remaps it to whatever vLLM v0.19.0 is serving.

The two lines of Fazm source that make vLLM a drop-in

A Mac agent being swappable onto a local inference server is not a marketing claim; it is a grep. Here are the two exact places in Fazm's source where the routing happens, both straight out of the public repo.

Desktop/Sources/Chat/ACPBridge.swift (line 378-380)

acp-bridge/src/index.ts (line 1126)

Everything downstream of those two declarations respects them. The Claude Code SDK reads ANTHROPIC_BASE_URL; the ACP bridge forwards the user's chosen model name. If the endpoint you set speaks the Anthropic Messages API and forwards tool calls to vLLM v0.19.0 (claude-code-router, LiteLLM in translation mode, or a thin custom proxy), the loop keeps running and the only observable change is that inference happens on your GPU.

1 env var

“Fazm's ACP bridge writes exactly one environment variable to redirect the agent's brain. Point that variable at an Anthropic-to-OpenAI translator in front of vLLM v0.19.0 and the rest of the loop does not care.”

Desktop/Sources/Chat/ACPBridge.swift:380

Which v0.19.0 features actually move the needle for a tool loop

Ranked by how much they change day-to-day agent work, not by what sits highest in the release notes.

Zero-bubble async scheduling + speculative decoding

The single most consequential change for long-horizon tool loops. Before v0.19.0 you chose between speculative decoding (pull down single-token latency) and zero-bubble async scheduling (pull down queueing across concurrent turns). They now compose. For a Fazm session with twenty tool calls per user turn, stacking these translates into a measurable wall-time reduction that is not a benchmark number; it is the user waiting less between clicks.

/v1/chat/completions/batch endpoint

Exactly shaped for agent-trace replay. Record a real session against a frontier model, reshape turns into batch requests, ship the batch to a local vLLM v0.19.0 overnight. That is how you build evals and fine-tuning data for an open model without a cloud bill.

CPU KV cache offloading + pluggable eviction

On a 48GB M-series or a single H100, you can now hold far more in-flight agent contexts than before. The pluggable eviction policy is the part that matters for agents: you can keep the running session's KV blocks resident and evict speculative replays to CPU instead of the other way around.

Model Runner V2 with piecewise CUDA graphs

Tool-calling turns are heterogeneous in shape. MRV1 over-padded or fell back to eager; MRV2 captures distinct graphs per shape and stops paying the transition cost. Solid win for any pipeline-parallel deployment and for agents with high turn-to-turn variance.

Gemma 4 + Kimi-K2.5 tool parsers

This is how you get tool calls out of open models under a vLLM server. The generic Qwen3 parser already existed. v0.19.0 adds Gemma 4 and Kimi-K2.5 parsers so a translator layer can hand you clean JSON tool calls instead of regex-scraping model output.

Online MXFP8 quantization (caution for tool loops)

Great for chat-completion throughput. Not an automatic win for tool-calling workloads; the arguments field of a tool call is exactly where structured-output degradation shows up first. Eval with your real agent traces before flipping it on.

The full routing: start a vLLM v0.19.0 server, wire Fazm to it

Three processes, one env var. A real end-to-end setup for pointing the agent's brain at the April 2026 release.

vLLM v0.19.0 + Anthropic translator + Fazm

The numbers that framed my reading

0Commits in vLLM v0.19.0 (April 3, 2026)

0Contributors on the v0.19.0 release

0Env var Fazm sets to redirect the agent's brain

0Line number in ACPBridge.swift where that env var is written

The 448/197 counts are from the v0.19.0 release notes. The 380 is a straight grep of Desktop/Sources/Chat/ACPBridge.swift. The single env var is what makes a local vLLM install a drop-in for Fazm's brain; no rewrite, no rebuild, no custom SDK.

v0.18.x vs v0.19.0 for a tool-calling agent specifically

The release notes compare v0.19.0 to v0.18.x on throughput and supported hardware. Those are the right metrics for a serving team. A Mac desktop agent cares about a different set. Here is the same delta framed for that workload.

Feature	vLLM v0.18.x	vLLM v0.19.0
Speculative decoding + async scheduling	Mutually exclusive	Composable in one config
Model Runner for pipeline parallelism	MRV1, monolithic or eager fallback on shape change	MRV2 with piecewise CUDA graphs
Gemma 4 tool parser	Not shipped	Built in
Offline batch endpoint	Real-time only	/v1/chat/completions/batch
CPU KV cache offload policy	Fixed LRU or disabled	Pluggable eviction
Blackwell Ultra (B300/GB300)	Not supported	AllReduce fusion for SM 10.3
MXFP8 on MoE and dense	Dense only, offline recipes	Online quantization
Transformers dependency	>=5.1.0	>=5.5.0 (required for Gemma 4)

The first row is the one I keep coming back to. A tool-calling loop has two latency axes: per-token latency inside each call and scheduling latency between calls. vLLM used to force you to optimize one at the cost of the other. v0.19.0 stops forcing the choice.

A practical playbook for using v0.19.0 with a desktop agent

The obvious path (replace your frontier model with a local open model) is the weakest. The stronger path treats v0.19.0 as a sidecar for replay, eval, and cheap-lane tool calls.

Keep the primary loop on a frontier model

The real-time loop needs tool-call reliability across 20+ sequential invocations per turn. Fazm's default of claude-sonnet-4-6 exists for that reason. Do not change it first.

Stand up vLLM v0.19.0 with Gemma 4 or Qwen3.5

Enable MRV2, zero-bubble async scheduling, and speculative decoding. Turn on the Gemma 4 tool parser if you are serving Gemma 4. Leave MXFP8 off for the first pass; eval it separately.

Record real Fazm sessions

Every turn, every tool call, every accessibility-tree fragment the model was shown. Fazm's ACP bridge already emits the log stream in a shape that reshapes cleanly into agent-trace JSONL.

Replay sessions through /v1/chat/completions/batch

Reshape turns into batch requests. Submit the batch to the local vLLM instance overnight. Use the batch output as eval ground truth: does the open model take the same tool-call sequence the frontier model did on your real workflows?

Route only the safe tool calls through the local server

Once the local model passes on your eval set for a class of turn (tree reads, short summaries, classification), use the custom API endpoint to route just that class through vLLM v0.19.0 while the primary loop stays hosted. Cost falls, latency stays acceptable, and the loop keeps running.

Three numbers to remember about the release

v0.0.0

The tag. Shipped April 3, 2026. The first vLLM release where speculative decoding and zero-bubble async scheduling are compatible in one config.

Commits landed in v0.19.0 from 197 contributors. The highest-volume vLLM release of the year so far.

Environment variable Fazm sets (ANTHROPIC_BASE_URL, at ACPBridge.swift line 380) to redirect the agent loop onto a local vLLM server.

Where Fazm sits in the April 2026 inference picture

Fazm is the consumer-friendly app for Mac automation. Not a developer framework, not a Python library. It reads the live accessibility tree through AXUIElement, hands that structured state to a tool-calling frontier model, and executes native clicks, keystrokes, and menu selections against the same APIs Apple uses for VoiceOver and Switch Control.

The v0.19.0 release does not change Fazm's screen-reading path (that stays on the OS) and it does not change the default model (that stays on a hosted frontier tool-caller). What it changes is what is reachable from the same config surface. One env var, a translator layer, and a local vLLM v0.19.0 server is now a reasonable sidecar for replay, eval, and cheap-lane traffic. The door was ajar before. April opened it.

Try the ANTHROPIC_BASE_URL swap on your own Mac

Fazm is a consumer-friendly app, not a developer framework. The custom API endpoint field lives in Settings -> AI Chat.

Download Fazm →

Frequently asked questions

What shipped in vLLM's April 2026 release (v0.19.0)?

vLLM v0.19.0 was tagged on April 3, 2026 with 448 commits from 197 contributors. The headline items are Gemma 4 support across the full family, Model Runner V2 maturation with piecewise CUDA graphs for pipeline parallelism, zero-bubble async scheduling that is finally compatible with speculative decoding, a new /v1/chat/completions/batch endpoint for offline workloads, CPU KV cache offloading with pluggable eviction policies, online MXFP8 quantization for MoE and dense models, and NVIDIA B300 and GB300 support. There are also new architectures (Cohere ASR, ColQwen3.5, Granite 4.0 Speech) and new tool parsers (GigaChat, Kimi-K2.5, Gemma 4).

Why does vLLM v0.19.0 matter for a Mac desktop agent that runs tool calls?

Three features in v0.19.0 move the needle specifically for tool-calling loops. First, zero-bubble async scheduling + speculative decoding can now ship together. Before this release you had to pick one; speculative decoding pulled latency down on individual tokens but gave up the scheduling gains across concurrent requests. For an agent that round-trips through model -> tool -> model -> tool on every user turn, inter-call latency dominates. Second, the new /v1/chat/completions/batch endpoint fits agent-trace replay workloads cleanly; you can run a session against a frontier model, capture the trace, then batch-replay the prompts through a local open model to generate training data or evals without real-time latency constraints. Third, CPU KV cache offloading means a single 48GB M-series box can hold far more in-flight contexts than before, which is relevant for running an eval suite locally.

Can I point Fazm at a local vLLM v0.19.0 server?

Indirectly, yes. Fazm's ACP bridge reads one environment variable to redirect the agent's brain: at Desktop/Sources/Chat/ACPBridge.swift line 380 it writes env["ANTHROPIC_BASE_URL"] = customEndpoint. Set Settings -> AI Chat -> Custom API Endpoint to the URL of an Anthropic-to-OpenAI translator (claude-code-router, LiteLLM proxy, or similar) that forwards Messages API calls to vLLM's /v1/chat/completions. Fazm continues to call whatever model name is selected (the default at acp-bridge/src/index.ts line 1126 is claude-sonnet-4-6), the translator maps the name to a vLLM-served model, and the rest of the loop is unchanged. Tool calling survives because v0.19.0 ships a Gemma 4 tool parser and the generic Qwen3 tool parser, which most translator layers can map to the Anthropic tool_use shape.

Which v0.19.0 feature is a trap for agent workloads?

Online MXFP8 quantization on a model that is actively used for tool calling is a trap. Quantization degrades structured-output fidelity unevenly; the arguments field of tool calls is exactly where hallucinated or malformed JSON shows up first. The v0.19.0 implementation is solid for chat completions, but you want to eval your specific agent against the quantized version before switching it in as the tool-calling driver. Benchmarks that report accuracy on MMLU or HumanEval tell you very little about what happens at the fourteenth tool call in a session.

Does Model Runner V2 in v0.19.0 affect latency for a tool-calling loop?

Yes, favorably in most cases. Piecewise CUDA graphs mean the runner can capture distinct execution paths for different request shapes without monolithic recompilation. For a tool-calling agent, turns have heterogeneous shapes (short acknowledgments, long accessibility-tree reads, long tool-result summaries). Under MRV1 the runner would either over-pad or re-enter eager mode; under MRV2 each shape gets its own captured graph and you stop paying the transition cost. The caveat is that MRV2 is maturing, not the default in every config, and some pipeline parallelism setups need config adjustments when you opt in.

Can a local vLLM v0.19.0 install replace Claude Sonnet as Fazm's brain?

Not yet for the primary loop. The default at acp-bridge/src/index.ts line 1126 is claude-sonnet-4-6 because that model has the tool-call throughput and reliability to survive twenty-plus sequential tool invocations per user turn without the loop collapsing. Gemma 4 26B MoE and Qwen3.5 in v0.19.0 get closer, but sustained structured tool-calling over long accessibility-tree payloads still favors hosted frontier models. The interesting role for a local v0.19.0 install today is as the replay engine for eval and the cheaper sidecar for guardrail classification, not as the main brain.

What is the right way to use the new /v1/chat/completions/batch endpoint with an agent?

The agent itself stays on its real-time backend (frontier model for reliability). Separately, you record sessions: every turn, every tool call, every tool result, every accessibility-tree fragment the model was shown. You reshape those traces into batch-API requests and submit them to your local vLLM v0.19.0 instance. The batch endpoint lets you reprocess thousands of turns through an open model overnight, building eval corpora or fine-tuning data without blowing up your hosted-model bill. This is the workflow the release notes do not spell out but the feature is clearly shaped for.

Every release matters a little. v0.19.0 matters to this loop.

The hard part of a desktop agent is not which model it calls; it is the shape of the loop around the model. vLLM's April 2026 release is the first one that fits that shape cleanly.