vLLM, April 2026Client-side rewire, Mac

vLLM Update 2026: the client-side dance nobody writes about

v0.19.0 landed on April 3, 2026. Gemma 4 Day 0 on three accelerator families, Model Runner V2 matured, zero-bubble async scheduling. Every roundup covers the server side. This page covers the other half: what your Mac desktop agent has to do locally when you bump the backend. Concretely, a 7-line Swift function called restartBridgeForEndpointChange at ChatProvider.swift lines 2101 to 2107, the single .onSubmit that triggers it, and the ANTHROPIC_BASE_URL injection at ACPBridge.swift line 381 that hot-swaps the endpoint without quitting the app.

F
Fazm
10 min read
4.9from 200+
All vLLM release dates pulled from the project's own blog and GitHub releases
Fazm source references map to real files and line numbers you can verify
One setting (Custom API Endpoint) routes every tool-using chat at your vLLM

The numbers behind the April update

0Commits in v0.19.0, from 197 contributors
0Accelerator families with Day 0 Gemma 4 support
0Lines in Fazm's restartBridgeForEndpointChange
0Line in ChatProvider.swift where the function starts

The first two numbers are from vLLM's April 3, 2026 release announcement. The last two you can verify yourself by opening Desktop/Sources/Providers/ChatProvider.swift in the Fazm repo.

The seven lines of Swift that absorb every vLLM update

Every vLLM shipment in 2026 (Iris in January, v0.18 in March, v0.19 in April) changes something on the server. Fazm's client side does not change at all. Here is why: the entire rewire logic lives in a function called restartBridgeForEndpointChange. Seven lines, one UserDefaults read, one bridge stop, one flag flip. Read it and then we'll walk through what actually happens on the next chat.

Desktop/Sources/Providers/ChatProvider.swift, lines 2100-2107

The pairing on the Settings side is three lines. A TextField, an onSubmit, a task. When the user types a new URL and hits Enter, this fires.

Desktop/Sources/MainWindow/Pages/SettingsPage.swift, lines 933-938

And the actual env injection, on the next bridge spawn, is another three lines of Swift in ACPBridge. This is where the v0.19.0 endpoint you typed in Settings becomes the base URL every tool-using chat gets routed through.

Desktop/Sources/Chat/ACPBridge.swift, lines 378-381

Hot-swap vs app restart, for someone who upgrades vLLM twice a month

Most desktop agents that let you change the model endpoint require an app relaunch. Fazm's bridge-stop pattern sets a flag and lets the next sendMessage call re-spawn the subprocess with fresh env. The operational difference is small but it compounds across a year of shipments.

Same config change, two workflows

You change the endpoint in Settings. The agent makes you quit and reopen. Every active chat is killed. Pre-warmed Opus and Sonnet session pools burn. If any browser-extension login is mid-flight, it restarts. You lose your place. Multiply by every vLLM release in 2026 so far (Iris, v0.18, v0.18.1, v0.19) and you are restarting the app roughly once per month just to track the backend.

  • Quit and reopen the app
  • All open chats killed
  • Session pool re-warmup from cold
  • Any interactive flow in progress restarts

The actual v0.18.1 to v0.19.0 upgrade sequence, Mac-side

Everything that has to happen between “pip install -U vllm” on the server and “Fazm is talking to v0.19.0 end-to-end” on the Mac. Seven steps, none of them quit the app.

1

1. Pin transformers>=5.5.0

v0.19.0 requires transformers 5.5.0 or later for Gemma 4 weights. Pin this before you upgrade vllm itself, not after, or the first serve command will fail on import.

2

2. pip install -U vllm on the server

Stop the current process first. v0.18.1 to v0.19.0 is a wheel swap, not a config migration. Old launch scripts still work.

3

3. Relaunch vllm serve with the flags you actually want

--enable-zero-bubble-async composes with NGram speculative decoding from v0.18. --cpu-kv-cache-offload is new and has a pluggable eviction policy you can tune.

4

4. curl /v1/models to confirm the server is live

Before you touch the client, verify the backend came back. If you are routing through a proxy, curl through it too so you know the shim is awake.

5

5. Re-submit Custom API Endpoint in Fazm Settings

Settings > Advanced > Custom API Endpoint. Either hit Enter on the field or toggle off/on. The .onSubmit at SettingsPage.swift:936 fires restartBridgeForEndpointChange, which stops the ACP bridge and marks it for restart.

6

6. Start a new chat in Fazm to re-spawn the bridge

The next sendMessage calls ensureBridgeStarted, which relaunches the ACP subprocess with ANTHROPIC_BASE_URL freshly injected from UserDefaults. Session history is preserved.

7

7. Run a tool-using prompt end-to-end

Something like 'summarize my foreground Terminal window.' That exercises accessibility-tree capture, the bridge, the proxy, and the v0.19.0 serve path in one round-trip. If that works, the upgrade is clean.

What you see on the server when v0.19.0 comes up

A sample of the v0.19.0 serve log so you know what a clean upgrade looks like. The “Model Runner V2 initialized” and “zero-bubble=on” lines are the tells that v0.19 is actually live, not just pip-installed.

vllm serve on v0.19.0 (first 12 seconds after launch)

v0.18.1 vs v0.19.0, as they affect a Mac-side operator

The server engineering effort between these two versions is enormous. For someone running a single vLLM instance behind a Mac desktop agent, the differences that actually show up in daily use are smaller, and they cluster around latency and config ergonomics.

FeaturevLLM v0.18.1vLLM v0.19.0
Gemma 4 supportNot supportedDay 0, GPU + TPU + XPU
Zero-bubble async + speculative decodingIncompatible, must pick oneCompatible
/v1/chat/completions/batch endpointNoYes
CPU KV cache offloadingFixed policyPluggable eviction
NVIDIA B300 / GB300 hardwareNoYes
Model Runner V2 maturityOpt-in experimentalDefault path
transformers version required>= 5.3.0>= 5.5.0
gRPC serving (--grpc)YesYes

Three things the client-side rewire buys you

Not speed. Not cost. Those are properties of the vLLM server. The hot-reload buys you time, continuity, and the ability to treat vLLM's release cadence as background noise.

Upgrade ergonomics

Three vLLM releases in 90 days (Iris, v0.18, v0.19) is three endpoint toggles, not three app restarts. The difference is small per event and noticeable over a quarter.

Chat continuity

ACP session history survives the bridge stop. You finish the conversation you were on when you decided to upgrade. No lost context, no re-uploaded files, no mid-flow restart of an interactive login.

Backend decoupling

The client does not know or care that the backend shipped a new version. The contract is the ANTHROPIC_BASE_URL env variable. Fazm updates only when the Fazm team has something new to say, not when vLLM does.

Verify the claims in this page yourself

Everything above references real files and line numbers in the Fazm desktop app source tree. If you have the repo locally, here is the shortest way to confirm each anchor fact. If you do not, you can still read the function signatures in the Fazm open-source releases.

Verify the 7 lines

Try the client-side half on your own vLLM deployment

Fazm is a Mac desktop agent that reads accessibility trees (not screenshots) and ships with Claude Sonnet 4.6 by default. The Custom API Endpoint setting in Advanced lets you route every tool-using chat through your vLLM v0.19.0 server, via an Anthropic-shape shim of your choice. Hot-reload included, app restart optional.

Download Fazm

Frequently asked questions

What is the latest vLLM update in 2026?

v0.19.0, released April 3, 2026. It shipped with 448 commits from 197 contributors. Headline changes: Day 0 Gemma 4 support across NVIDIA GPUs, Google TPUs, and the Intel XPU backend; Model Runner V2 maturation with piecewise CUDA graphs for pipeline parallelism; zero-bubble async scheduling compatible with speculative decoding; CPU KV cache offloading with pluggable eviction policies; a new POST /v1/chat/completions/batch endpoint; and NVIDIA B300/GB300 hardware support. The previous release, v0.18.0 from late March 2026, introduced gRPC serving via the --grpc flag and GPU-accelerated NGram speculative decoding. Semantic Router v0.1 codename Iris shipped on January 5, 2026 as a separate repo.

Should I upgrade my vLLM server to v0.19.0 today?

If you serve Gemma 4, yes, Day 0 Gemma 4 support is the single biggest pull. If you run multimodal models and have not split preprocessing, land v0.18.0 first for the vllm launch render subcommand, then v0.19.0. If you skipped v0.18.1, the SM100 MLA prefill and DeepGEMM Qwen3.5 accuracy fixes are already in v0.19. One gotcha: v0.19 requires transformers>=5.5.0 for Gemma 4 weights. Pin that before the upgrade, not after.

What does a vLLM update mean for a Mac desktop AI agent like Fazm?

Essentially a handshake. Fazm ships with Claude Sonnet 4.6 by default, but the Custom API Endpoint setting lets you point every tool-using chat at a compatible server, including a vLLM deployment fronted by an Anthropic-shape proxy. When you upgrade the vLLM server, the Fazm bridge has to re-read its environment so it connects with fresh settings. Fazm does this with a 7-line function called restartBridgeForEndpointChange at ChatProvider.swift lines 2101 to 2107. It stops the ACP bridge and marks it for relaunch on the next sendMessage, so you get a soft reconnect rather than an app restart.

Where is the Fazm function that hot-swaps the vLLM endpoint?

Desktop/Sources/Providers/ChatProvider.swift, lines 2101 to 2107. It reads customApiEndpoint from UserDefaults, calls acpBridge.stop(), sets acpBridgeStarted = false, and logs the event. The trigger is a TextField .onSubmit binding in Desktop/Sources/MainWindow/Pages/SettingsPage.swift, lines 936 to 938. The env variable injection happens in Desktop/Sources/Chat/ACPBridge.swift, lines 378 to 381, where env[ANTHROPIC_BASE_URL] is set from the same UserDefaults key before the ACP subprocess is spawned. Three files, about twelve lines, end to end.

Why hot-reload the bridge instead of restarting the Fazm app?

Conversation history and session state. A full app restart kills every open chat, forces the user back through any interactive login, and re-warms the Claude Sonnet 4.6 session pool (which is pre-warmed in parallel with Opus at startup). The soft reconnect pattern at ChatProvider.swift line 2097 sets acpBridgeStarted = false and lets ensureBridgeStarted in the next sendMessage call do a full warmup with session resume. You keep your threads, you change the backend. For someone upgrading their vLLM server twice a month, the difference between a restart and a reconnect is the difference between losing their place and not.

What is the operational sequence for running the v0.18.1 to v0.19.0 upgrade from a Mac?

Seven steps. 1) Pin transformers>=5.5.0 on the server. 2) Stop the vLLM process, pip install -U vllm. 3) Relaunch with vllm serve and any new flags you want from v0.19 (zero-bubble async, CPU KV offload). 4) Confirm the server came back with curl /v1/models. 5) In Fazm on your Mac, open Settings > Advanced and either re-submit the same Custom API Endpoint value or toggle it off and on. That fires the .onSubmit at SettingsPage.swift:936. 6) Start a new chat. The ACP bridge re-spawns with ANTHROPIC_BASE_URL injected from the fresh UserDefaults read. 7) Run a simple tool-calling prompt to verify end-to-end, then upgrade the rest of your fleet.

Is there a built-in vLLM integration in Fazm?

No, and that is the point. Fazm is model-agnostic at the bridge layer. There is no vllm.swift, no v0.19 adapter, no special handling. The entire integration surface is the customApiEndpoint UserDefaults key plus the ANTHROPIC_BASE_URL injection at ACPBridge.swift:380. That gives you a stable contract the Fazm team does not touch when vLLM ships a new version, which in 2026 is often: three releases in roughly 90 days (Iris, v0.18, v0.19). If the contract were tighter, every vLLM release would be a Fazm release too.

What is Semantic Router Iris and does it affect my desktop agent?

Iris is vLLM's new routing fabric: it sits in front of one or more model backends and picks which one serves a given request, based on shape, cost, or intent. It shipped January 5, 2026 as v0.1, the first major release after a September 2025 experimental launch with 600+ merged PRs. From a Mac desktop agent's point of view, Iris is transparent. You point Fazm's Custom API Endpoint at your Iris URL (fronted by an Anthropic-shape shim) and Iris picks the backend per request. That means you can swap which model answers a given Fazm chat without touching Fazm at all.

Does Fazm send screenshots to my vLLM server?

No. Fazm reads macOS accessibility trees directly via AXUIElementCreateApplication. The capture entry point is in Desktop/Sources/AppState.swift around line 439. The tree arrives as structured text lines with role, content, and coordinates, not pixels. That is the payload sent to whichever model the bridge is routed at, whether that is Claude Sonnet 4.6 on api.anthropic.com or your v0.19.0 deployment at your-proxy:8766. This matters for the upgrade story because a text-only vLLM server (no vision encoder) can handle Fazm chats without dropping context, where a screenshot-based agent would need a multimodal serve path.

The quiet half of the 2026 vLLM story

Every vLLM update in 2026 has a loud server-side headline. v0.19.0 added Day 0 Gemma 4 on TPUs. v0.18 split multimodal preprocessing. Iris moved routing into the serving layer. Those changes carry the narrative because they are the work.

The quiet half is that if you pair vLLM with a Mac desktop agent, you want a client that treats each new release as a no-op on its side. Seven lines of Swift do that in Fazm. One UserDefaults read, one bridge stop, one flag, and the next message you send routes through the new backend with session history intact. That is the client-side dance. Now the only thing left to upgrade is the server.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.

How did this page land for you?

Comments

Public and anonymous. No signup.

Loading…