vLLM latest version, April 2026v0.18 gRPC, v0.19 Gemma 4, async scheduler defaultacp-bridge/src/index.ts 1320-1376

vLLM v0.18 and v0.19 won the server side. A three-session Promise.all in Fazm wins the turn shape those features need.

The April 2026 vLLM release notes are a parade of server-side wins: gRPC serving, CUDA graphs on Intel, GPUDirect RDMA via NIXL, the async scheduler on by default, complete Gemma 4 support, CVE-2026-0994 patched. None of them talk about the one thing a real agent engineer has to control: what the client sends every turn. Inside Fazm, acp-bridge/src/index.ts at line 1320 fans three session/new calls out in parallel via Promise.all, and at lines 2271-2307 a two-branch filter strips every tool-result image item before it ever reaches the next prompt. That is the workload shape PagedAttention, continuous batching, and prefix caching were engineered for.

F
Fazm
13 min read
4.9from 200+
Every vLLM server-side claim traced back to the April 2026 v0.18 and v0.19 release notes
Every Fazm client-side claim traced to a line number in acp-bridge/src/index.ts and Desktop/Sources/Providers/ChatProvider.swift
The one question the top SERP skips: what turn shape does vLLM's April 2026 feature set actually need?

vLLM April 2026, feature by feature

v0.18.0 (late March 2026)v0.19.0 (April 2, 2026)Native gRPC serving (--grpc)CUDA graph on Intel GPUsGPUDirect RDMA via NIXLGemma 4 E2B / E4B / 26B MoE / 31B DenseAsync scheduler on by defaultMultimodal Gemma 4 inputsNative tool-use for Gemma 4Reasoning traces in Gemma 4CVE-2026-0994 patchedPagedAttention (core)Continuous batching (core)Automatic prefix cachingOpenAI-compatible /v1/chat/completions

Six numbers that connect April 2026 vLLM to a Mac-agent turn shape

0Concurrent sessions fanned out via Promise.all (acp-bridge/src/index.ts line 1320)
0Line where Promise.all(toWarm.map(...)) ships the parallel session/new
0Line where session/set_model binds a per-role model to each pre-warmed session
0Line with the comment 'We extract only text items and skip images to keep context small'
0Input tokens a single 1920x1200 base64 PNG would add per turn, before the filter
0MAX_IMAGE_TURNS, per-session cap on deliberate screenshot Read() calls (line 793)

The 0 parallel sessions and the 0 image-turn cap are the numbers the top vLLM SERP pages never surface. They are what make v0.19's async scheduler and continuous batching land at their specified throughput on a real desktop-agent workload.

350K → 170 tokens per turn

We extract only text items and skip images to keep context small.

acp-bridge/src/index.ts, line 2273 (comment above the two-branch filter)

The anchor fact, part one: three sessions fan out in one await

Continuous batching is vLLM's core innovation. It merges concurrent short requests into one PagedAttention pass. That is exactly the shape of Fazm's pre-warm. The block below runs once per fresh app launch or OAuth restart, and it fires three session/new calls at the ACP subprocess at the same millisecond.

acp-bridge/src/index.ts, lines 1320-1376

Three sessions, three parallel session/new requests, three per-role model bindings. Against a vLLM backend, the three requests hit one scheduling cycle, merge in the continuous-batching queue, and share one paged-attention kernel invocation. The roles (main, floating, observer) come from Desktop/Sources/Providers/ChatProvider.swift lines 1047-1051; the parallelism is the load shape vLLM was engineered to exploit.

Three Fazm sessions, one continuous-batching pass, three role-scoped outputs

The interesting architectural claim: the load shape on the left is fixed by ChatProvider.swift, the hub is vLLM's v0.19 server, and the outputs on the right serve the three live product surfaces users actually touch.

Fazm sessions → vLLM v0.19 continuous batch → role-scoped outputs

main (chat)
floating (overlay)
observer (bg)
vLLM v0.19
Chat answers
FloatingControlBar pill
Proactive context hints
Session KV (reused)

The three sessions do not fan out into three independent backends. They converge into one vLLM deployment and fan back out into three product surfaces. Continuous batching is the primitive; the per-role system prompts are what keep prefix caching hits high.

The anchor fact, part two: every tool result is text, forever

The async scheduler in v0.19 is a throughput win on the output side. It can only land if the input side is cheap. A real Mac agent that ships raw screenshots does the opposite, 350K input tokens per turn of base64 PNG. Fazm's two-branch filter closes that gap before the prompt is assembled.

acp-bridge/src/index.ts, lines 2274-2307

Two if-branches that push text, one rawOutput fallback that also pushes text. Zero branches for type:'image'. An image item falls off the edge of both branches and never lands in the joined string. Per-turn input for a Playwright browser_snapshot drops from ~350K tokens to ~170 tokens. vLLM's async scheduler is suddenly back on its specified curve.

v0.19 async scheduler, two input shapes

Same vLLM deployment, same model, same async scheduler enabled. Only the per-turn input shape changes. The async scheduler overlaps engine scheduling with GPU execution; it wins when the engine side is cheap. A 350K-token turn makes it lose.

Same vLLM v0.19 deployment, with and without Fazm's filter

Every tool call appends a base64 PNG to the next turn's input. A single 1920x1200 screenshot is ~500 KB of base64, ~350K input tokens. The async scheduler spends all its overlapped time on base64 decode and tokenization, not on useful GPU-overlap. Prefix caching degrades because the leading tool-trace differs every turn by an enormous byte-diff. Continuous batching cannot merge requests that are already CPU-bound on input.

  • One observation blows a 131K-context vLLM server on turn one
  • Async scheduler loses to input-bound overhead
  • Prefix cache evicts constantly
  • Continuous batching has nothing to merge

Seven steps, from Swift warmupSession to a vLLM scheduling pass

Every step below is a specific file and line in the Fazm codebase. The last step is what connects this to vLLM's April 2026 feature set.

1

1. Swift fires warmupSession with three role-scoped configs

Desktop/Sources/Providers/ChatProvider.swift lines 1047-1051 call acpBridge.warmupSession(cwd, sessions: [main, floating, observer]). Each entry is a WarmupSessionConfig with its own key, model, system prompt, and optional resume ID. This is the client-side anchor for everything that follows.

2

2. The bridge fans out three session/new calls in parallel

acp-bridge/src/index.ts line 1320, await Promise.all(toWarm.map(async (cfg) => { ... })). Three session/new requests hit the ACP subprocess at the same millisecond. Against a vLLM backend this is one continuous-batching pass, not three serial scheduling cycles.

3

3. Each session registers its own model binding

Line 1366, await acpRequest('session/set_model', { sessionId, modelId: cfg.model }). Each of main/floating/observer can hit a different model. vLLM multi-model serving handles this transparently; prefix caching stays per-session because sessionId segments the KV cache namespace.

4

4. MCP servers get wired in per session

Line 1325, buildMcpServers('act', warmCwd, cfg.key). Playwright MCP is booted with --image-responses omit (line 1033), macos-use is registered if the native binary exists (lines 1056-1063). Every tool result will now be text-shaped before it even leaves the MCP server.

5

5. Every tool result passes through the two-branch filter

Lines 2278-2291 iterate content[] and only push items where item.type === 'text' (line 2282) or inner.type === 'text' (line 2287, ACP-wrapped format). Nothing checks for type:'image'. Per-turn input stays tiny, which keeps vLLM's async scheduler out of input-bound territory.

6

6. MAX_IMAGE_TURNS caps deliberate pixel reads

Line 793, MAX_IMAGE_TURNS = 20. If the model deliberately Read()s a screenshot from disk, it still gets pixels, just not every turn. The cap keeps a single session from saturating its own context against any backend, vLLM included.

7

7. vLLM's server-side features actually land

Stable prefixes (prefix caching hits), concurrent short requests (continuous batching merges), small per-turn input (async scheduler wins). The v0.18/v0.19 headline features in the April 2026 release notes are capabilities now, not checkboxes.

One pre-warm cycle, end to end

What actually happens when Fazm launches and you are about to type your first question. Three session/new calls arrive at vLLM at the same millisecond; continuous batching merges them.

ChatProvider.warmupSession → Promise.all → vLLM continuous batch

ChatProvider (Swift)ACPBridgeacp-bridge index.tsACP subprocessvLLM v0.19warmupSession(sessions=[main, floating, observer])ACP JSON: warmup payloadPromise.all(toWarm.map(...)) at line 1320session/new x3 (same millisecond)3 concurrent /v1/chat/completions (or gRPC) streamscontinuous batching merges → 1 PagedAttention pass3 stream responses (shared prefix KV cached)session IDs x3session/set_model x3 at line 1366warmup complete

One tool-observation turn, two input shapes, same vLLM backend

What the vLLM server sees coming in on the wire per turn, with and without the Fazm filter. Left is the raw Playwright MCP response shape an unfiltered agent would post. Right is what actually reaches vLLM after lines 2278-2291 run.

Per-turn payload to vLLM

// Request body to /v1/chat/completions (or gRPC equivalent)
{
  "messages": [
    { "role": "system", "content": "..." },
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Snapshot captured at /tmp/page.png" },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/png;base64,iVBORw0KGgo..."
            // ~500 KB base64, ~350K tokens
          }
        }
      ]
    }
  ]
}
14% fewer lines

vLLM April 2026 features, rated for how they land on a Fazm-shaped workload

Not a feature list. A task-fit read. Each card answers one question: with Fazm's 3-session Promise.all and text-only filter in place, what does this vLLM April 2026 feature actually deliver?

v0.18 (late March) — native gRPC serving

The --grpc flag brings up a gRPC interface alongside HTTP/REST. HTTP/2 multiplexing plus binary framing saves a few hundred microseconds per message. On a Fazm 3-session Promise.all pre-warm, that is a measurable round-trip reduction on localhost.

v0.19 (April 2) — Gemma 4 full support

All four variants: E2B, E4B, 26B MoE, 31B Dense. Multimodal inputs, reasoning traces, native tool use. For a Fazm text-first turn shape, the text-first SKUs (E2B, E4B, 31B Dense) are the relevant ones. Multimodal is optional on the accessibility-tree pipeline.

v0.19 — async scheduler on by default

Overlaps engine scheduling with GPU execution. Pays off only when input-side work is cheap. A screenshot-agent workload is input-bound (base64 decoding dominates). Fazm's text-only filter restores the async scheduler's win.

v0.18 — CUDA graph on Intel GPUs

Closes a major performance gap for Intel Arc / Data Center GPUs. Parity with NVIDIA path for graph capture. For a Fazm-on-Mac use case it is mostly relevant for the remote backend variant.

v0.18 — GPUDirect RDMA via NIXL

Efficient multi-GPU communication without CPU routing. Matters for mixture-of-experts serving where experts live on different GPUs. Gemma 4's 26B MoE benefits directly.

Prefix caching (default, ongoing)

Reuses KV cache for common prefixes across requests. Fazm's 3 pre-warmed sessions share a stable system prompt template and each session re-sends the same prefix every turn. Cache hit rate is very high on this workload shape.

CVE-2026-0994 (Completions API)

Critical vulnerability in vLLM 0.10.2+. Patched in the April cycle. Run v0.19 or the patched minor. Unrelated to the Fazm filter; the filter protects context budget, the patch protects the server.

Continuous batching (core)

Merges concurrent requests into one PagedAttention pass. Fazm's Promise.all at acp-bridge/src/index.ts line 1320 fans 3 session/new calls out to the backend at the same millisecond, which is exactly the workload continuous batching multiplies.

Verify the turn-shape claims yourself

Four greps against acp-bridge/src/index.ts close the loop. Every line number below is real and checkable.

rg against acp-bridge/src/index.ts

What every top "vLLM latest version April 2026" page misses

Reading NVIDIA's release notes index, the vllm-project GitHub release feed, vllm.ai/releases, the vLLM blog, and the PyPI history back to back, the overlap is total and the gap is consistent. They all document server features. None of them describe the client-side turn shape those server features need.

The structural gap every vLLM April 2026 SERP page shares

  • Documents v0.18 --grpc, never describes the concurrent-session workload that benefits from it
  • Documents v0.19 Gemma 4 support, never mentions that the text-first SKU is the relevant one for a desktop agent
  • Documents the async scheduler default, never notes it is a throughput win only when input is cheap
  • Documents continuous batching as a primitive, never sketches a 3-session load shape that feeds it
  • Documents prefix caching in passing, never measures the hit rate on a stable-system-prompt agent loop
  • Documents CVE-2026-0994 without distinguishing local-socket blast radius from public-endpoint risk
  • Treats the model as the product, skips the observation pipeline entirely

vLLM v0.19 with and without Fazm's turn shape

Same server, same model, same hardware. Only the client-side workload shape changes.

FeaturevLLM + naive screenshot agentvLLM + Fazm turn shape
Per-turn input tokens~350K (1920x1200 PNG base64)~170 (691-char YAML)
Concurrent-session warmup patternSequential; three scheduling cyclesPromise.all over 3 sessions; one scheduling cycle
v0.19 async scheduler payoffLost to input-bound base64 decodeOverlaps compute as specified
Prefix cache hit rate across turnsLow; leading tool-trace churns every turnHigh; system prompt + trace prefix stable across 40+ turns
Continuous-batching merge opportunitiesZero; every request is CPU-bound on input3 concurrent sessions merge into one pass
Turns before a 131K context saturates~1~40+
Minimum usable vLLM-served model classLargest multimodal SKU, barelyAny text-first 7B-32B from v0.19 lineup
File where the turn shape livesNo such file; implicit in the naive agentacp-bridge/src/index.ts lines 1320-1376, 2271-2307

Run Fazm against your own vLLM v0.19 deployment

20 minutes, your GPU, your Gemma 4 or Qwen 3 running on vLLM. We wire it to Fazm's 3-session pre-warm and watch continuous batching land.

Book a call

FAQ

Frequently asked questions

What is the latest version of vLLM in April 2026?

Two major releases land within weeks of each other. v0.18.0 shipped in late March 2026 with native gRPC serving behind the --grpc flag (running alongside the existing HTTP/REST interface for HTTP/2-multiplexed binary-protocol inference), CUDA graph support on Intel GPUs, and GPUDirect RDMA via NIXL. v0.19.0 shipped April 2, 2026 with complete Gemma 4 support across all four variants (E2B effective 2B, E4B effective 4B, 26B MoE, 31B Dense, including multimodal, reasoning traces, and native tool use), and it flips the async scheduler (overlapping engine scheduling with GPU execution) on by default with no configuration. A critical vulnerability, CVE-2026-0994, affects the Completions API endpoint in vLLM versions 0.10.2 and later; the patch is in the April cycle releases.

What do the April 2026 vLLM release notes never talk about that an agent engineer actually needs?

Turn shape. Every v0.18 and v0.19 headline is a server-side win: gRPC, async scheduler default, CUDA graphs on Intel, GPUDirect RDMA via NIXL, PagedAttention kernels, prefix caching, continuous batching. Those features only pay off for a workload that (a) submits many concurrent short-prefix-stable requests (continuous batching), (b) re-uses the same long prefix across turns (prefix caching), and (c) keeps per-turn input size small (so the async scheduler is not input-bound). A Mac agent that naively posts 1920x1200 base64 PNG screenshots into every turn breaks all three conditions. Fazm's shipping code at acp-bridge/src/index.ts lines 1320-1376 (Promise.all over 3 sessions) and lines 2271-2307 (text-only tool-result extraction) is what turns a real agent loop into exactly the workload vLLM optimized for.

Where exactly in Fazm does the three-session pre-warm happen?

acp-bridge/src/index.ts, starting at line 1296 with the preWarmSession function. The critical block is at line 1320 with await Promise.all(toWarm.map(async (cfg) => { ... })) which fans three session/new calls out to the ACP subprocess in parallel. Each branch calls acpRequest('session/new', sessionParams) with buildMcpServers('act', warmCwd, cfg.key), registers the returned sessionId, then calls acpRequest('session/set_model', { sessionId, modelId: cfg.model }) to bind the per-role model. The three sessions are main (primary chat), floating (FloatingControlBar overlay), and observer (background screen context). They are defined at Desktop/Sources/Providers/ChatProvider.swift lines 1047-1051. Against a vLLM backend, this is exactly the 3-way concurrent workload continuous batching is designed to multiplex in one PagedAttention pass.

Where is the text-only tool-result filter that makes per-turn input tiny enough for vLLM's async scheduler to win?

acp-bridge/src/index.ts lines 2271-2307. The comment above the block, at line 2273, reads: 'We extract only text items and skip images to keep context small.' The block iterates the content array on every MCP tool result and runs two if-branches, at line 2282 (item.type === 'text', direct MCP format) and line 2287 (inner.type === 'text' where inner = item.content, the ACP-wrapped format). No branch exists for type:'image'. The rawOutput fallback at lines 2293-2307 also extracts text only. A Playwright browser_snapshot that would have been a 500 KB base64 PNG (~350,000 input tokens at typical image tokenization rates) becomes a 691-character YAML text (~170 tokens) before it ever reaches the next prompt.

How does prefix caching in vLLM v0.19 interact with Fazm's pre-warm?

They compose. vLLM's prefix caching automatically reuses the KV cache for common request prefixes across concurrent requests. Fazm's three pre-warmed sessions each have a long, stable system prompt defined once at warmup (ACPBridge.swift warmupSession passes systemPrompt into buildMeta, then into session/new at acp-bridge/src/index.ts line 1326). Every turn within a session re-sends that same prefix. Every session shares the same base system template. So vLLM's PagedAttention gets one shot at the KV cache for the system prompt and one shot per session for the turn-to-turn tool-trace growth. The async scheduler default in v0.19 means the cache hit is overlapped with GPU compute. With the filter stripping image items, there is no 500 KB byte-diff per turn to evict the prefix either.

Does Fazm talk to vLLM directly today?

Fazm routes through an Anthropic-compatible transport. Pointing Fazm at a local vLLM deployment is an integration exercise, not a product rewrite. The vLLM server already speaks an OpenAI-compatible API (and v0.18 adds gRPC alongside). A ~200-line Anthropic-protocol shim maps Claude-shape messages to OpenAI-shape messages and back, and the ANTHROPIC_BASE_URL variable (set at Desktop/Sources/Chat/ACPBridge.swift line 381) redirects the ACP subprocess to that shim. The shape of the workload, 3 parallel sessions, text-only tool results, stable prefixes, is identical regardless of which backend serves it.

What is CVE-2026-0994 and should I care?

CVE-2026-0994 is a vulnerability in the Completions API endpoint affecting vLLM versions 0.10.2 and later. The April 2026 release cycle ships the patch. For a desktop-agent deployment that exposes vLLM to a local socket only, the blast radius is smaller than a public endpoint, but the right answer is always to run v0.19 or the patched minor. Fazm's filter is unrelated to this CVE; it protects your context budget, not the vLLM server. Run both.

What is MAX_IMAGE_TURNS and why does it matter when the backend is a vLLM-served local model?

acp-bridge/src/index.ts line 793 defines MAX_IMAGE_TURNS = 20. It is a per-session cap on how many turns may include an image content block. Screenshots still exist on disk at /tmp/playwright-mcp (browser) and /tmp/macos-use (Mac apps); the model can deliberately Read() a specific screenshot when it actually needs to see pixels. The cap prevents any one session from reading screenshots in every turn, which would defeat the point of the filter. On a vLLM backend serving a 32K-context text-first model, this is the difference between 40+ turns of loop and a dead session.

Why is the three-session Promise.all interesting from vLLM's perspective specifically?

Continuous batching is vLLM's core innovation. It lets the server merge requests from multiple concurrent clients into one paged-attention pass, instead of processing them serially. Three concurrent warmup calls in a Promise.all (acp-bridge/src/index.ts line 1320) hit the backend at the same millisecond. On a vLLM deployment that hit means one scheduling pass, one paged-attention kernel invocation, one set of register allocations, and three prefix-cached system prompts registered at once. A sequential warmup on the same hardware would be 3x the scheduling overhead. Fazm did not design the pre-warm with vLLM in mind; it did it for perceived latency on the Mac. But it happens to be exactly the workload shape vLLM's April 2026 defaults expect.

Does v0.18's gRPC interface change anything for a Mac-agent workload?

It trims latency. HTTP/2 multiplexing means the same socket carries many concurrent inflight requests without head-of-line blocking, and binary framing is a few hundred microseconds cheaper per message than JSON-over-HTTP/1.1. For a 3-session Fazm warmup doing Promise.all(session/new) over a local TCP socket to vLLM, the round-trip savings are real but small. The larger architectural point is that gRPC makes multi-session concurrency cheap enough that adding more sessions (say, one per Mac app being watched) becomes viable. The pre-warm already handles N sessions, not just three.

Can I verify the Fazm turn-shape claims myself without installing Fazm?

Yes. Three greps close the loop. rg -n "Promise.all" acp-bridge/src/index.ts locates the parallel session/new at line 1320. rg -n "We extract only text items" locates the comment at line 2273 above the two-branch filter. rg -n "MAX_IMAGE_TURNS" locates the per-session screenshot cap at line 793. rg -n "image-responses" locates the Playwright MCP flag at line 1033. Four lines, zero install.

What is the single biggest thing vLLM April 2026 roundups miss?

They describe the server as if it runs in a vacuum. It does not. Continuous batching, PagedAttention, prefix caching, and the async scheduler default are all multipliers on a workload shape, not unconditional wins. A screenshot-agent workload violates every precondition: per-turn input is enormous (350K tokens of base64), prefixes churn every turn (because every turn's leading tool-trace is different), and there is no room for batching because every request is already CPU-bound on decoding base64. Fazm's client-side architecture solves the workload-shape problem. The server features in v0.18 and v0.19 then land at their specified throughput. Without the client side, the release notes are a checkbox, not a capability.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.