vLLM v0.18 and v0.19 won the server side. A three-session Promise.all in Fazm wins the turn shape those features need.
The April 2026 vLLM release notes are a parade of server-side wins: gRPC serving, CUDA graphs on Intel, GPUDirect RDMA via NIXL, the async scheduler on by default, complete Gemma 4 support, CVE-2026-0994 patched. None of them talk about the one thing a real agent engineer has to control: what the client sends every turn. Inside Fazm, acp-bridge/src/index.ts at line 1320 fans three session/new calls out in parallel via Promise.all, and at lines 2271-2307 a two-branch filter strips every tool-result image item before it ever reaches the next prompt. That is the workload shape PagedAttention, continuous batching, and prefix caching were engineered for.
vLLM April 2026, feature by feature
Six numbers that connect April 2026 vLLM to a Mac-agent turn shape
The 0 parallel sessions and the 0 image-turn cap are the numbers the top vLLM SERP pages never surface. They are what make v0.19's async scheduler and continuous batching land at their specified throughput on a real desktop-agent workload.
“We extract only text items and skip images to keep context small.”
acp-bridge/src/index.ts, line 2273 (comment above the two-branch filter)
The anchor fact, part one: three sessions fan out in one await
Continuous batching is vLLM's core innovation. It merges concurrent short requests into one PagedAttention pass. That is exactly the shape of Fazm's pre-warm. The block below runs once per fresh app launch or OAuth restart, and it fires three session/new calls at the ACP subprocess at the same millisecond.
Three sessions, three parallel session/new requests, three per-role model bindings. Against a vLLM backend, the three requests hit one scheduling cycle, merge in the continuous-batching queue, and share one paged-attention kernel invocation. The roles (main, floating, observer) come from Desktop/Sources/Providers/ChatProvider.swift lines 1047-1051; the parallelism is the load shape vLLM was engineered to exploit.
Three Fazm sessions, one continuous-batching pass, three role-scoped outputs
The interesting architectural claim: the load shape on the left is fixed by ChatProvider.swift, the hub is vLLM's v0.19 server, and the outputs on the right serve the three live product surfaces users actually touch.
Fazm sessions → vLLM v0.19 continuous batch → role-scoped outputs
The three sessions do not fan out into three independent backends. They converge into one vLLM deployment and fan back out into three product surfaces. Continuous batching is the primitive; the per-role system prompts are what keep prefix caching hits high.
The anchor fact, part two: every tool result is text, forever
The async scheduler in v0.19 is a throughput win on the output side. It can only land if the input side is cheap. A real Mac agent that ships raw screenshots does the opposite, 350K input tokens per turn of base64 PNG. Fazm's two-branch filter closes that gap before the prompt is assembled.
Two if-branches that push text, one rawOutput fallback that also pushes text. Zero branches for type:'image'. An image item falls off the edge of both branches and never lands in the joined string. Per-turn input for a Playwright browser_snapshot drops from ~350K tokens to ~170 tokens. vLLM's async scheduler is suddenly back on its specified curve.
v0.19 async scheduler, two input shapes
Same vLLM deployment, same model, same async scheduler enabled. Only the per-turn input shape changes. The async scheduler overlaps engine scheduling with GPU execution; it wins when the engine side is cheap. A 350K-token turn makes it lose.
Same vLLM v0.19 deployment, with and without Fazm's filter
Every tool call appends a base64 PNG to the next turn's input. A single 1920x1200 screenshot is ~500 KB of base64, ~350K input tokens. The async scheduler spends all its overlapped time on base64 decode and tokenization, not on useful GPU-overlap. Prefix caching degrades because the leading tool-trace differs every turn by an enormous byte-diff. Continuous batching cannot merge requests that are already CPU-bound on input.
- One observation blows a 131K-context vLLM server on turn one
- Async scheduler loses to input-bound overhead
- Prefix cache evicts constantly
- Continuous batching has nothing to merge
Seven steps, from Swift warmupSession to a vLLM scheduling pass
Every step below is a specific file and line in the Fazm codebase. The last step is what connects this to vLLM's April 2026 feature set.
1. Swift fires warmupSession with three role-scoped configs
Desktop/Sources/Providers/ChatProvider.swift lines 1047-1051 call acpBridge.warmupSession(cwd, sessions: [main, floating, observer]). Each entry is a WarmupSessionConfig with its own key, model, system prompt, and optional resume ID. This is the client-side anchor for everything that follows.
2. The bridge fans out three session/new calls in parallel
acp-bridge/src/index.ts line 1320, await Promise.all(toWarm.map(async (cfg) => { ... })). Three session/new requests hit the ACP subprocess at the same millisecond. Against a vLLM backend this is one continuous-batching pass, not three serial scheduling cycles.
3. Each session registers its own model binding
Line 1366, await acpRequest('session/set_model', { sessionId, modelId: cfg.model }). Each of main/floating/observer can hit a different model. vLLM multi-model serving handles this transparently; prefix caching stays per-session because sessionId segments the KV cache namespace.
4. MCP servers get wired in per session
Line 1325, buildMcpServers('act', warmCwd, cfg.key). Playwright MCP is booted with --image-responses omit (line 1033), macos-use is registered if the native binary exists (lines 1056-1063). Every tool result will now be text-shaped before it even leaves the MCP server.
5. Every tool result passes through the two-branch filter
Lines 2278-2291 iterate content[] and only push items where item.type === 'text' (line 2282) or inner.type === 'text' (line 2287, ACP-wrapped format). Nothing checks for type:'image'. Per-turn input stays tiny, which keeps vLLM's async scheduler out of input-bound territory.
6. MAX_IMAGE_TURNS caps deliberate pixel reads
Line 793, MAX_IMAGE_TURNS = 20. If the model deliberately Read()s a screenshot from disk, it still gets pixels, just not every turn. The cap keeps a single session from saturating its own context against any backend, vLLM included.
7. vLLM's server-side features actually land
Stable prefixes (prefix caching hits), concurrent short requests (continuous batching merges), small per-turn input (async scheduler wins). The v0.18/v0.19 headline features in the April 2026 release notes are capabilities now, not checkboxes.
One pre-warm cycle, end to end
What actually happens when Fazm launches and you are about to type your first question. Three session/new calls arrive at vLLM at the same millisecond; continuous batching merges them.
ChatProvider.warmupSession → Promise.all → vLLM continuous batch
One tool-observation turn, two input shapes, same vLLM backend
What the vLLM server sees coming in on the wire per turn, with and without the Fazm filter. Left is the raw Playwright MCP response shape an unfiltered agent would post. Right is what actually reaches vLLM after lines 2278-2291 run.
Per-turn payload to vLLM
// Request body to /v1/chat/completions (or gRPC equivalent)
{
"messages": [
{ "role": "system", "content": "..." },
{
"role": "user",
"content": [
{ "type": "text", "text": "Snapshot captured at /tmp/page.png" },
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgo..."
// ~500 KB base64, ~350K tokens
}
}
]
}
]
}vLLM April 2026 features, rated for how they land on a Fazm-shaped workload
Not a feature list. A task-fit read. Each card answers one question: with Fazm's 3-session Promise.all and text-only filter in place, what does this vLLM April 2026 feature actually deliver?
v0.18 (late March) — native gRPC serving
The --grpc flag brings up a gRPC interface alongside HTTP/REST. HTTP/2 multiplexing plus binary framing saves a few hundred microseconds per message. On a Fazm 3-session Promise.all pre-warm, that is a measurable round-trip reduction on localhost.
v0.19 (April 2) — Gemma 4 full support
All four variants: E2B, E4B, 26B MoE, 31B Dense. Multimodal inputs, reasoning traces, native tool use. For a Fazm text-first turn shape, the text-first SKUs (E2B, E4B, 31B Dense) are the relevant ones. Multimodal is optional on the accessibility-tree pipeline.
v0.19 — async scheduler on by default
Overlaps engine scheduling with GPU execution. Pays off only when input-side work is cheap. A screenshot-agent workload is input-bound (base64 decoding dominates). Fazm's text-only filter restores the async scheduler's win.
v0.18 — CUDA graph on Intel GPUs
Closes a major performance gap for Intel Arc / Data Center GPUs. Parity with NVIDIA path for graph capture. For a Fazm-on-Mac use case it is mostly relevant for the remote backend variant.
v0.18 — GPUDirect RDMA via NIXL
Efficient multi-GPU communication without CPU routing. Matters for mixture-of-experts serving where experts live on different GPUs. Gemma 4's 26B MoE benefits directly.
Prefix caching (default, ongoing)
Reuses KV cache for common prefixes across requests. Fazm's 3 pre-warmed sessions share a stable system prompt template and each session re-sends the same prefix every turn. Cache hit rate is very high on this workload shape.
CVE-2026-0994 (Completions API)
Critical vulnerability in vLLM 0.10.2+. Patched in the April cycle. Run v0.19 or the patched minor. Unrelated to the Fazm filter; the filter protects context budget, the patch protects the server.
Continuous batching (core)
Merges concurrent requests into one PagedAttention pass. Fazm's Promise.all at acp-bridge/src/index.ts line 1320 fans 3 session/new calls out to the backend at the same millisecond, which is exactly the workload continuous batching multiplies.
Verify the turn-shape claims yourself
Four greps against acp-bridge/src/index.ts close the loop. Every line number below is real and checkable.
What every top "vLLM latest version April 2026" page misses
Reading NVIDIA's release notes index, the vllm-project GitHub release feed, vllm.ai/releases, the vLLM blog, and the PyPI history back to back, the overlap is total and the gap is consistent. They all document server features. None of them describe the client-side turn shape those server features need.
The structural gap every vLLM April 2026 SERP page shares
- Documents v0.18 --grpc, never describes the concurrent-session workload that benefits from it
- Documents v0.19 Gemma 4 support, never mentions that the text-first SKU is the relevant one for a desktop agent
- Documents the async scheduler default, never notes it is a throughput win only when input is cheap
- Documents continuous batching as a primitive, never sketches a 3-session load shape that feeds it
- Documents prefix caching in passing, never measures the hit rate on a stable-system-prompt agent loop
- Documents CVE-2026-0994 without distinguishing local-socket blast radius from public-endpoint risk
- Treats the model as the product, skips the observation pipeline entirely
vLLM v0.19 with and without Fazm's turn shape
Same server, same model, same hardware. Only the client-side workload shape changes.
| Feature | vLLM + naive screenshot agent | vLLM + Fazm turn shape |
|---|---|---|
| Per-turn input tokens | ~350K (1920x1200 PNG base64) | ~170 (691-char YAML) |
| Concurrent-session warmup pattern | Sequential; three scheduling cycles | Promise.all over 3 sessions; one scheduling cycle |
| v0.19 async scheduler payoff | Lost to input-bound base64 decode | Overlaps compute as specified |
| Prefix cache hit rate across turns | Low; leading tool-trace churns every turn | High; system prompt + trace prefix stable across 40+ turns |
| Continuous-batching merge opportunities | Zero; every request is CPU-bound on input | 3 concurrent sessions merge into one pass |
| Turns before a 131K context saturates | ~1 | ~40+ |
| Minimum usable vLLM-served model class | Largest multimodal SKU, barely | Any text-first 7B-32B from v0.19 lineup |
| File where the turn shape lives | No such file; implicit in the naive agent | acp-bridge/src/index.ts lines 1320-1376, 2271-2307 |
Run Fazm against your own vLLM v0.19 deployment
20 minutes, your GPU, your Gemma 4 or Qwen 3 running on vLLM. We wire it to Fazm's 3-session pre-warm and watch continuous batching land.
Book a call →FAQ
Frequently asked questions
What is the latest version of vLLM in April 2026?
Two major releases land within weeks of each other. v0.18.0 shipped in late March 2026 with native gRPC serving behind the --grpc flag (running alongside the existing HTTP/REST interface for HTTP/2-multiplexed binary-protocol inference), CUDA graph support on Intel GPUs, and GPUDirect RDMA via NIXL. v0.19.0 shipped April 2, 2026 with complete Gemma 4 support across all four variants (E2B effective 2B, E4B effective 4B, 26B MoE, 31B Dense, including multimodal, reasoning traces, and native tool use), and it flips the async scheduler (overlapping engine scheduling with GPU execution) on by default with no configuration. A critical vulnerability, CVE-2026-0994, affects the Completions API endpoint in vLLM versions 0.10.2 and later; the patch is in the April cycle releases.
What do the April 2026 vLLM release notes never talk about that an agent engineer actually needs?
Turn shape. Every v0.18 and v0.19 headline is a server-side win: gRPC, async scheduler default, CUDA graphs on Intel, GPUDirect RDMA via NIXL, PagedAttention kernels, prefix caching, continuous batching. Those features only pay off for a workload that (a) submits many concurrent short-prefix-stable requests (continuous batching), (b) re-uses the same long prefix across turns (prefix caching), and (c) keeps per-turn input size small (so the async scheduler is not input-bound). A Mac agent that naively posts 1920x1200 base64 PNG screenshots into every turn breaks all three conditions. Fazm's shipping code at acp-bridge/src/index.ts lines 1320-1376 (Promise.all over 3 sessions) and lines 2271-2307 (text-only tool-result extraction) is what turns a real agent loop into exactly the workload vLLM optimized for.
Where exactly in Fazm does the three-session pre-warm happen?
acp-bridge/src/index.ts, starting at line 1296 with the preWarmSession function. The critical block is at line 1320 with await Promise.all(toWarm.map(async (cfg) => { ... })) which fans three session/new calls out to the ACP subprocess in parallel. Each branch calls acpRequest('session/new', sessionParams) with buildMcpServers('act', warmCwd, cfg.key), registers the returned sessionId, then calls acpRequest('session/set_model', { sessionId, modelId: cfg.model }) to bind the per-role model. The three sessions are main (primary chat), floating (FloatingControlBar overlay), and observer (background screen context). They are defined at Desktop/Sources/Providers/ChatProvider.swift lines 1047-1051. Against a vLLM backend, this is exactly the 3-way concurrent workload continuous batching is designed to multiplex in one PagedAttention pass.
Where is the text-only tool-result filter that makes per-turn input tiny enough for vLLM's async scheduler to win?
acp-bridge/src/index.ts lines 2271-2307. The comment above the block, at line 2273, reads: 'We extract only text items and skip images to keep context small.' The block iterates the content array on every MCP tool result and runs two if-branches, at line 2282 (item.type === 'text', direct MCP format) and line 2287 (inner.type === 'text' where inner = item.content, the ACP-wrapped format). No branch exists for type:'image'. The rawOutput fallback at lines 2293-2307 also extracts text only. A Playwright browser_snapshot that would have been a 500 KB base64 PNG (~350,000 input tokens at typical image tokenization rates) becomes a 691-character YAML text (~170 tokens) before it ever reaches the next prompt.
How does prefix caching in vLLM v0.19 interact with Fazm's pre-warm?
They compose. vLLM's prefix caching automatically reuses the KV cache for common request prefixes across concurrent requests. Fazm's three pre-warmed sessions each have a long, stable system prompt defined once at warmup (ACPBridge.swift warmupSession passes systemPrompt into buildMeta, then into session/new at acp-bridge/src/index.ts line 1326). Every turn within a session re-sends that same prefix. Every session shares the same base system template. So vLLM's PagedAttention gets one shot at the KV cache for the system prompt and one shot per session for the turn-to-turn tool-trace growth. The async scheduler default in v0.19 means the cache hit is overlapped with GPU compute. With the filter stripping image items, there is no 500 KB byte-diff per turn to evict the prefix either.
Does Fazm talk to vLLM directly today?
Fazm routes through an Anthropic-compatible transport. Pointing Fazm at a local vLLM deployment is an integration exercise, not a product rewrite. The vLLM server already speaks an OpenAI-compatible API (and v0.18 adds gRPC alongside). A ~200-line Anthropic-protocol shim maps Claude-shape messages to OpenAI-shape messages and back, and the ANTHROPIC_BASE_URL variable (set at Desktop/Sources/Chat/ACPBridge.swift line 381) redirects the ACP subprocess to that shim. The shape of the workload, 3 parallel sessions, text-only tool results, stable prefixes, is identical regardless of which backend serves it.
What is CVE-2026-0994 and should I care?
CVE-2026-0994 is a vulnerability in the Completions API endpoint affecting vLLM versions 0.10.2 and later. The April 2026 release cycle ships the patch. For a desktop-agent deployment that exposes vLLM to a local socket only, the blast radius is smaller than a public endpoint, but the right answer is always to run v0.19 or the patched minor. Fazm's filter is unrelated to this CVE; it protects your context budget, not the vLLM server. Run both.
What is MAX_IMAGE_TURNS and why does it matter when the backend is a vLLM-served local model?
acp-bridge/src/index.ts line 793 defines MAX_IMAGE_TURNS = 20. It is a per-session cap on how many turns may include an image content block. Screenshots still exist on disk at /tmp/playwright-mcp (browser) and /tmp/macos-use (Mac apps); the model can deliberately Read() a specific screenshot when it actually needs to see pixels. The cap prevents any one session from reading screenshots in every turn, which would defeat the point of the filter. On a vLLM backend serving a 32K-context text-first model, this is the difference between 40+ turns of loop and a dead session.
Why is the three-session Promise.all interesting from vLLM's perspective specifically?
Continuous batching is vLLM's core innovation. It lets the server merge requests from multiple concurrent clients into one paged-attention pass, instead of processing them serially. Three concurrent warmup calls in a Promise.all (acp-bridge/src/index.ts line 1320) hit the backend at the same millisecond. On a vLLM deployment that hit means one scheduling pass, one paged-attention kernel invocation, one set of register allocations, and three prefix-cached system prompts registered at once. A sequential warmup on the same hardware would be 3x the scheduling overhead. Fazm did not design the pre-warm with vLLM in mind; it did it for perceived latency on the Mac. But it happens to be exactly the workload shape vLLM's April 2026 defaults expect.
Does v0.18's gRPC interface change anything for a Mac-agent workload?
It trims latency. HTTP/2 multiplexing means the same socket carries many concurrent inflight requests without head-of-line blocking, and binary framing is a few hundred microseconds cheaper per message than JSON-over-HTTP/1.1. For a 3-session Fazm warmup doing Promise.all(session/new) over a local TCP socket to vLLM, the round-trip savings are real but small. The larger architectural point is that gRPC makes multi-session concurrency cheap enough that adding more sessions (say, one per Mac app being watched) becomes viable. The pre-warm already handles N sessions, not just three.
Can I verify the Fazm turn-shape claims myself without installing Fazm?
Yes. Three greps close the loop. rg -n "Promise.all" acp-bridge/src/index.ts locates the parallel session/new at line 1320. rg -n "We extract only text items" locates the comment at line 2273 above the two-branch filter. rg -n "MAX_IMAGE_TURNS" locates the per-session screenshot cap at line 793. rg -n "image-responses" locates the Playwright MCP flag at line 1033. Four lines, zero install.
What is the single biggest thing vLLM April 2026 roundups miss?
They describe the server as if it runs in a vacuum. It does not. Continuous batching, PagedAttention, prefix caching, and the async scheduler default are all multipliers on a workload shape, not unconditional wins. A screenshot-agent workload violates every precondition: per-turn input is enormous (350K tokens of base64), prefixes churn every turn (because every turn's leading tool-trace is different), and there is no room for batching because every request is already CPU-bound on decoding base64. Fazm's client-side architecture solves the workload-shape problem. The server features in v0.18 and v0.19 then land at their specified throughput. Without the client side, the release notes are a checkbox, not a capability.
The filter, the endpoint wiring, and the Playwright MCP token-cost path that the 3-session Promise.all sits on top of.
Keep reading
Local LLMs news, April 2026: the two-branch filter
The same two-branch image-stripping filter, applied to Qwen 3, Gemma 4, Mistral Medium 3, and the rest of the April 2026 open-weights lineup.
Local LLM news, April 2026: one env var routes Fazm to local
The ANTHROPIC_BASE_URL line in ACPBridge.swift that swaps Claude for a local OpenAI-compatible endpoint (including vLLM).
Playwright MCP token-cost optimization
Why --image-responses omit alone does not solve the problem, and where the authoritative client-side filter lives.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.