The vLLM 2026 release arc is a persistence story the server changelog refuses to tell.
Every top SERP result for vllm release 2026 walks through server-side deltas: v0.18 gRPC serving, KV cache offloading, CUDA graphs on Intel, v0.19 async scheduler default, Gemma 4 across four variants, CVE-2026-0994 patched. All true, all server-side, all half the picture. Prefix caching, continuous batching, KV offloading, and the async scheduler are multipliers on a workload shape the client has to produce. Fazm produces that shape with three specific client-side primitives, each traceable to a file and a line number.
The 2026 release arc, feature by feature
Six numbers that connect the 2026 release line to a client
The 0 turn-pair batch size and the persistent 0 sessionIds are the two client-side numbers the vLLM 2026 release notes never surface. They are what make v0.18's KV offloading and v0.19's async scheduler compound across app restarts rather than reset on every launch.
“const CHAT_OBSERVER_BATCH_SIZE = 10; // Send batch every N turn pairs”
acp-bridge/src/index.ts line 1156 (verbatim)
Anchor one: session/resume with a persisted sessionId
The single most under-discussed client-side behavior for vLLM 2026 is how the client behaves across app restarts. vLLM's prefix cache, KV offloading, and continuous-batching slots all reward long-lived session identity. Fazm encodes that identity in two lines of Swift and one if-branch in the bridge.
Two lines pull prior sessionIds from UserDefaults. Three warmup entries bind them to role-scoped sessions. main and floating resume if the IDs exist; observer always starts fresh, because it is a batching background analyzer, not a user-facing conversation. The IDs are persisted on first successful session/new and replayed forever.
Line 1333 is the one the SERP never mentions. That call tells the ACP subprocess (and by extension, any Anthropic-shim-or-vLLM backend that honors session identity) that this is not a new conversation. The server-side KV cache lineage for this sessionId is reusable. The prior tool trace is still in the cache. Byte-identical prefix. One call to session/set_model right after (line 1341) because the ACP SDK would otherwise snap the model binding back to the default.
Three prefixes, one KV cache, three slots reused every launch
On the left, the three client-side primitives. In the middle, vLLM's 2026 server. On the right, the server mechanisms that only light up when the primitives on the left hold.
Fazm client primitives → vLLM 2026 hub → server mechanisms that compound
Server features are not self-activating. The client has to ship session identity, prompt stability, and batched cadence. Every beam on this diagram has a client-side precondition and a server-side mechanism.
Anchor two: the ten-turn-pair observer flush
The observer session is the one Fazm uses to build persistent memory from your main conversation. It reads every turn pair and decides what to save. A naive implementation would fire one request per turn. That is the workload vLLM's continuous batching is least designed for, because background QPS would saturate user-interactive slots. Fazm buffers ten turn pairs, flushes once, and keeps the observer footprint sub-linear.
One constant. One buffer. One flush function. The batch size picks the operating point on a simple tradeoff: smaller batches mean faster memory-write latency; larger batches mean lower QPS and higher prefix-cache hit. Ten is the picked number, and it is what lets vLLM's continuous-batching slots stay free for the two user-interactive sessions.
Two clients on the same vLLM 2026 deployment
Same v0.19 server, same model, same KV offloading, same async scheduler. Only the client-side behavior changes. The server features compound on the right and collapse on the left.
Cold-start client vs Fazm-shaped client on vLLM v0.19
Each app launch creates a fresh sessionId via session/new. vLLM has no cache lineage to reuse; the system prompt prefix is new every launch. The observer fires one request per turn, so background QPS scales 1:1 with main QPS, eating continuous-batching slots. KV offloading cannot help because there is no persistent identity to tie offloaded tensors to. The async scheduler stays idle because the engine side is bound on tokenizing fresh prefixes.
- Fresh sessionId on every launch = prefix cache miss
- Observer QPS = main QPS, saturates batching slots
- No persistent identity = KV offload cannot target
- Async scheduler overlapped with cold prefix tokenization
The seven-step lifecycle from app launch to vLLM cache hit
Every step is a specific file and line. The compounding happens because step 1 restarts clean and step 7 restarts byte-identical to the previous step 6.
1. App launch reads saved sessionIds from UserDefaults
Desktop/Sources/Providers/ChatProvider.swift lines 1040-1041. savedFloatingSessionId = UserDefaults.standard.string(forKey: floatingSessionIdKey) and savedMainSessionId = UserDefaults.standard.string(forKey: mainSessionIdKey). These are the persistent anchors that map app launches to server-side KV-cache lineages.
2. warmupSession is called with three role-scoped configs
Lines 1047-1051 in the same file. Three .init(key:model:systemPrompt:resume:) entries: main, floating, observer. main and floating get resume IDs if they exist; observer always starts fresh (it is a background analyzer, not a conversation).
3. The bridge fans out into three parallel resume/new calls
acp-bridge/src/index.ts line 1320, await Promise.all(toWarm.map(async (cfg) => { ... })). Inside the map: line 1331 branches on cfg.resume. If resume is present, line 1333 calls session/resume. Otherwise line 1345 or 1354 calls session/new. Either way, line 1366 binds the per-session model via session/set_model.
4. vLLM's prefix cache sees byte-identical prefixes
After resume the client sends the same system prompt prefix and the same prior-turn tool trace that it sent before the restart. Automatic prefix caching hits immediately; there is no cold-start cache miss on the system prompt, and no per-turn prefix churn from a new session ID.
5. User types in main; floating is standby; observer buffers
Every main-session turn triggers bufferChatObserverTurn (acp-bridge/src/index.ts line 1158) on both role/text sides, so each round adds two entries to chatObserverBuffer. The CHAT_OBSERVER_BATCH_SIZE = 10 constant at line 1156 gates the flush.
6. Ten turn-pairs later, observer flush fires one batched request
flushChatObserverBatch at line 1170 splices the entire buffer (line 1183), joins with join('\n\n') at line 1184, wraps in a fixed meta-prompt, and sends exactly one observer request for the whole batch. QPS for the observer stays at ~1 per 10 main turns, continuous-batching-friendly.
7. On next app launch, steps 1-6 repeat with byte-identical prefixes
Because sessionIds persist in UserDefaults and vLLM's prefix cache persists in HBM (or CPU RAM under KV offloading), the second launch compounds on the first. Prefix-cache hit rate stays high. KV offload hit rate stays high. The async scheduler stays unbound from input-side work.
One app launch, eleven messages, one cache-hit decision
Three actors on the client, two on the server. The key message is the fifth: when session/resume arrives with a known sessionId, the prefix-cache lookup is a hit and the KV-offload reload kicks in. When it misses, the whole lifecycle degrades to cold-start.
App launch → Promise.all → session/resume → vLLM prefix cache hit
What the code would have looked like without persistence
Fazm's resume path is not the easier code path. It is an if-branch that could have been skipped. Here is the skip-it version on the left, and the ship-it version on the right.
Cold-start every launch vs resume + set_model
// Warmup: always create a fresh session. Simple.
await Promise.all(toWarm.map(async (cfg) => {
const sessionParams = {
cwd: warmCwd,
mcpServers: buildMcpServers("act", warmCwd, cfg.key),
...buildMeta(cfg.systemPrompt, cfg.key),
};
const { sessionId } = await acpRequest(
"session/new",
sessionParams,
);
registerSession(cfg.key, { sessionId, cwd: warmCwd, model: cfg.model });
}));The 2026 release line rated by how much it compounds on the Fazm client shape
Eight release-line items, scored against whether a cold-start client can exploit them. The pattern is consistent: every one of them wants long-lived identity, stable prefixes, and predictable cadence. Fazm ships all three.
v0.18.0 --grpc, with resume
HTTP/2 multiplexing plus binary framing over one long-lived gRPC channel. Fazm's three resumed sessions ride the same channel, so the server sees three concurrent request streams with stable prefixes rather than three fresh TCP connections.
v0.18.0 KV offload
Idle session KV tensors move to CPU RAM or disk. Only pays off when the server can identify idle sessions. Fazm's persistent sessionIds (UserDefaults at ChatProvider.swift line 1040-1041) make that identification trivial.
v0.18.0 GPU NGram spec decode
Draft next tokens from an on-GPU ngram of prior context. Works best when the prior context is stable across turns. Fazm's stable role-scoped prompts keep that context stable.
v0.19.0 async scheduler default
Overlaps engine scheduling with GPU execution. Only wins when input-side work is cheap. A 10-turn-pair observer batch is one request worth of input work for ten turns of observation. Observer batching (line 1156) keeps the scheduler winning.
v0.19.0 Gemma 4
Four variants (E2B, E4B, 26B MoE, 31B Dense) under one Apache 2.0 license. Three Fazm sessions with session/set_model per role can split workload across variants without tearing down the pre-warm.
v0.19.0 zero-bubble spec decode
Speculative decoding composed with the async scheduler and no pipeline bubbles. Observer batch flushes are long, predictable, and low-priority, which is exactly the shape spec decode was built for.
Prefix caching (core)
Automatic KV cache reuse for common prefixes. Fazm's three role-scoped stable system prompts (ChatProvider.swift lines 1047-1051) give vLLM three cache lineages to hit, not N ephemeral prompts to evict.
CVE-2026-0994 (Completions API)
Critical vulnerability affecting 0.10.2+. Patched in the April cycle. Orthogonal to the Fazm mechanisms; run the patched release regardless of client shape.
Verify the three anchors with four grep commands
No install, no build, no clone. If you have access to the Fazm repository, these four lines close the loop on every claim in this guide.
What a vLLM-2026-friendly client actually has to do
Eight preconditions. Release notes focus on server features; these are the client invariants that let those features compound. Every item is checked against code in this repository.
Client-side preconditions for vLLM 2026 to land at spec
- Persist sessionIds across app restarts (ChatProvider.swift 1040-1041)
- Call session/resume before session/new on every launch (index.ts 1333)
- Rebind per-role models right after resume (index.ts 1341)
- Define a small fixed set of stable system prompts (ChatProvider.swift 1047-1051)
- Scope each stable prompt to a session key (main / floating / observer)
- Batch background sessions rather than firing per-turn requests (index.ts 1156)
- Splice the buffer atomically on flush so overlap is impossible (index.ts 1183)
- Log the resume vs new outcome per session for observability (index.ts 1339)
Where the SERP stops and where this guide starts
| Feature | Top vllm release 2026 SERP | This guide (client-side) |
|---|---|---|
| What drives the page angle | Server-side version-to-version deltas (gRPC, Gemma 4, async scheduler) | Client-side persistence and cadence (session/resume, stable prompts, batch size 10) |
| v0.18.0 KV cache offloading framing | Headline feature with memory-pressure graphs | Only useful with persistent sessionIds; anchored at ChatProvider.swift 1040-1041 |
| Prefix caching framing | Generic explanation of KV reuse | Three role-scoped stable prompts at ChatProvider.swift 1047-1051 give the cache three lineages |
| Continuous batching framing | PagedAttention and multi-request slots | CHAT_OBSERVER_BATCH_SIZE = 10 at line 1156 keeps observer QPS low enough to leave slots free |
| v0.19.0 async scheduler framing | Overlaps engine scheduling with GPU execution | Only wins on cheap input. Observer batches collapse 10 turns into one input pass |
| Session lifecycle framing | Implicit; server handles it | Explicit: session/resume at line 1333, session/set_model at line 1341, register at line 1365 |
| App-restart behavior | Not discussed | sessionIds persist in UserDefaults; resume on next launch yields byte-identical prefixes |
| Observability | vLLM Prometheus metrics | logErr at line 1339 prints Pre-warm resumed session: <id> (key=..., model=...) per session |
| Where to verify | Release notes, GitHub tags, PyPI | rg -n in acp-bridge/src/index.ts and Desktop/Sources/Providers/ChatProvider.swift |
None of the above is a vLLM critique. The server releases are excellent. The SERP just assumes a client exists that produces the right workload, and for most 2026 inference stacks, it does not.
Want the Fazm team to map this to your vLLM deployment?
Bring your v0.18 or v0.19 server and your target workload. We'll walk the client-side primitives and what you'd change to compound them.
Book a call →Frequently asked questions
What does vllm release 2026 refer to, specifically?
The 2026 release arc is anchored by two major minor versions. v0.18.0 shipped late March 2026 (v0.18.1 patch followed March 31) and introduced native gRPC serving behind --grpc, GPU NGram speculative decoding, and KV cache offloading to CPU or disk. v0.19.0 shipped April 2, 2026 (v0.19.1 patch April 18) with async scheduling flipped on by default (overlapping engine scheduling with GPU execution), complete Gemma 4 architecture support across E2B, E4B, 26B MoE, and 31B Dense variants, speculative decoding composed with the async scheduler for zero-bubble overlap, and a patch for CVE-2026-0994 affecting the Completions API endpoint on 0.10.2 and later.
What does every vllm release 2026 SERP page miss that this guide fills?
Client-side persistence and cadence. The server features in v0.18 and v0.19 (prefix caching, KV offload, continuous batching, async scheduler) all optimize for a workload shape the client must produce: stable long prefixes, low per-request QPS, and many concurrent long-lived sessions. A client that reopens with a fresh session on every launch, ships one-off prompts, and hits the server once per mouse click cannot hit the sweet spot those features were built for. Fazm ships three client-side mechanisms that produce exactly the shape: session/resume with persisted session IDs at acp-bridge/src/index.ts line 1333, three role-scoped stable system prompts at Desktop/Sources/Providers/ChatProvider.swift lines 1047-1051, and CHAT_OBSERVER_BATCH_SIZE = 10 at acp-bridge/src/index.ts line 1156.
Where exactly is the session/resume call in Fazm's code?
acp-bridge/src/index.ts line 1333, inside the Promise.all pre-warm loop. The call is await acpRequest('session/resume', { sessionId: cfg.resume, cwd: warmCwd, mcpServers: buildMcpServers('act', warmCwd, cfg.key) }). It fires only when cfg.resume is defined (line 1331, if (cfg.resume)). The resume IDs come from UserDefaults: savedMainSessionId at Desktop/Sources/Providers/ChatProvider.swift line 1041 (UserDefaults.standard.string(forKey: mainSessionIdKey)) and savedFloatingSessionId at line 1040. Line 1338 sets sessionId = cfg.resume and line 1341 calls session/set_model right after, because the ACP SDK would otherwise reset the model binding on resume.
Why does session/resume matter for vLLM specifically?
vLLM's automatic prefix caching (on by default in the 2026 release line) reuses the KV cache when a new request shares a prefix with a previous one. When Fazm calls session/resume with a persisted sessionId, the entire prior conversation is preserved: same system prompt prefix, same tool trace prefix, same role tokens. A cold-start client that creates a fresh session on every app launch breaks that assumption, because every launch produces a new conversation preamble. With resume, the prefix a session sends after app restart is byte-identical to the prefix it sent before app restart (up to the prior message), which is exactly the worst-case-best-case ratio prefix caching was designed to exploit.
What is CHAT_OBSERVER_BATCH_SIZE = 10 and why does it matter for vLLM?
It is a constant declared at acp-bridge/src/index.ts line 1156 with the comment Send batch every N turn pairs. The bufferChatObserverTurn function at line 1158 pushes each role/text turn into chatObserverBuffer (line 1159). The condition at line 1162 fires flushChatObserverBatch when Math.floor(chatObserverBuffer.length / 2) >= 10. The flush at line 1183 splices the entire buffer and line 1184 joins it with join('\n\n') before sending one observer prompt that covers ten turn pairs. For vLLM's continuous batching, this matters because the observer is a low-priority background session; batching 10 turn pairs into one request reduces the observer's QPS by 10x, keeps its prefix stable (same chatObserverSystemPrompt at line 1050), and leaves the continuous-batching slot free for the main and floating sessions that are user-interactive.
How does a 2026-era vLLM deployment actually see the three Fazm sessions?
As three long-lived clients with very stable prefixes. main and floating are driven by user keystrokes; observer is driven by the 10-turn-pair batch flush. Every Fazm session holds its sessionId across app restarts (via UserDefaults at ChatProvider.swift lines 1040-1041, then replayed via session/resume at index.ts line 1333). vLLM's prefix-cache hit rate on this workload is high because (a) the system-prompt prefix is re-sent every turn byte-identical, (b) the prior-turn trace is byte-identical after resume, (c) the observer batches collapse ten distinct turns into one prompt that still starts with the same chatObserverSystemPrompt prefix. None of the 2026 vLLM blog posts or release notes describe what it takes to produce this shape; they describe the server's reaction to it.
What does KV cache offloading in v0.18 do for a persistent-session client like Fazm?
KV cache offloading moves inactive KV entries out of GPU HBM into CPU RAM or disk. For a single-shot REST client, offloading is mostly a memory-pressure knob. For a Fazm-shaped client, it is structural: the observer session sits idle for 10 turn pairs at a time, the floating session sits idle while the user focuses on main, and main sits idle while the user reads the previous response. A 2026 vLLM server with KV offloading can evict those idle sessions' KV tensors to CPU RAM, then reload them when the session sends its next request (identified by sessionId). Without persistent sessionIds on the client, the server has to guess which sessions are reusable. Fazm's session/resume mechanism makes that guess trivial.
What is Gemma 4 support in v0.19 and how does it compose with Fazm's pre-warm?
v0.19 ships full Google Gemma 4 support across four variants: E2B (effective 2B), E4B (effective 4B), 26B MoE, and 31B Dense. All variants support multimodal inputs, reasoning traces, and native tool-use. Fazm's three pre-warmed sessions at ChatProvider.swift lines 1047-1051 each bind their own model at index.ts line 1366 via session/set_model. A 2026 Fazm deployment could assign main to 31B Dense (best tool-use), floating to E4B (fastest for overlay snippets), and observer to E2B (cheapest for background batches), all served by one vLLM instance. The async scheduler in v0.19 then overlaps their scheduling into one GPU pass, and the prefix cache is partitioned by sessionId.
Is CVE-2026-0994 a problem for a Fazm-on-vLLM deployment?
CVE-2026-0994 is a vulnerability in vLLM's Completions API endpoint affecting 0.10.2 and later. The patch ships in the April 2026 release cycle. For a desktop-agent deployment where vLLM is bound to localhost only, the blast radius is smaller than for a public-internet deployment, but the right answer is always to run v0.19.1 or the backported patch. Fazm's client-side mechanisms (session/resume, three stable prompts, observer batching) are orthogonal to this CVE; they protect the client-side context budget and server-side cache-hit rate, not the server's request-parsing surface. Run both.
How do I verify the three client-side anchors myself, without installing Fazm?
Three greps close the loop. rg -n 'session/resume' acp-bridge/src/index.ts locates the resume call at line 1333. rg -n 'CHAT_OBSERVER_BATCH_SIZE' acp-bridge/src/index.ts locates the batching constant at line 1156 and its use at line 1162. rg -n 'warmupSession' Desktop/Sources/Providers/ChatProvider.swift locates the three-role warmup at line 1047, and the three .init(key:...) entries span lines 1048-1050. Four files, zero install.
Does Fazm already run against vLLM in production, or is this a compatibility claim?
Fazm's default transport is Anthropic-protocol. The workload shape described in this guide (three concurrent sessions, stable system prompts, persistent sessionIds, observer batching) is the same shape regardless of backend. Pointing Fazm at a local vLLM deployment is an integration exercise; vLLM already exposes an OpenAI-compatible /v1/chat/completions endpoint (and in v0.18 adds gRPC alongside). An Anthropic-to-OpenAI protocol shim plus setting ANTHROPIC_BASE_URL in the ACP subprocess environment is enough. What is permanent is the client-side shape: that code is in git today and it produces a vLLM-friendly workload whether the backend is Claude, a local vLLM, or both.
What is the single biggest gap between the vllm release 2026 SERP and reality?
The SERP treats the server as if it runs in a vacuum. It doesn't. Prefix caching needs prefixes that repeat across requests. Continuous batching needs multiple concurrent long-lived clients. KV offloading needs persistent sessionIds so the server knows which KV slots are reusable. The async scheduler needs per-turn input small enough that engine-side work is not the bottleneck. Any one client failure breaks the whole compounding stack. Fazm's three mechanisms (resume, three prompts, batch size 10) are not features; they are preconditions. v0.18 and v0.19 land at spec only on workloads that satisfy them.
Companion guides that trace other parts of the turn-shape stack to specific file and line anchors in the Fazm codebase.
Keep reading on the vLLM 2026 client side
vLLM latest version, April 2026: three sessions, one PagedAttention pass
The companion to this guide: what the per-turn input shape has to look like (Promise.all pre-warm, text-only filter) for v0.18 and v0.19 to land at spec.
Local LLMs news, April 2026
The same two-branch image-stripping filter applied to Qwen 3, Gemma 4, Mistral Medium 3 and the rest of the April 2026 open-weights lineup.
New LLM model release, April 2026
How session/set_model lets a 17-turn floating-bar investigation upgrade to Opus 4.7 mid-conversation with every prior message preserved.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.