vLLM updates land weekly. Your Mac agent does not need to notice.
Every release notes page summarizes the server wins. None finish the sentence on the client. If you run vLLM behind a desktop agent, you actually want to know one thing: when a new vLLM ships and you upgrade the server, what do you have to rebuild on your Mac? This guide answers that with a file path and a line number. The answer is nothing, and the six lines of Swift that make the answer "nothing" live at ChatProvider.swift:2101.
What the 14-day update window actually looked like
This is the part the SERP covers well, so I am going to keep it short. Three things landed. The server-ops press treats them as the whole story. The interesting question starts on the other side.
v0.18.0
Native gRPC serving behind --grpc. NGram speculative decoding moved from CPU to GPU. FlexKV arrived as a new KV cache offload backend. Relevant if you are running vLLM as shared team infrastructure, less so if it is one box serving one laptop.
v0.19.0
Day-one Gemma 4 support across E2B, E4B, 26B MoE, and 31B Dense. Async scheduler on by default, which overlaps engine scheduling with GPU execution and drops time-to-first-token. Model Runner V2. This is the release most agent workloads actually feel.
v0.19.1rc0
Release candidate dropped the day after v0.19.0. Stability candidate rather than new surface. If you are running agents in April 2026, pin this or the final v0.19.1, not v0.19.0.
the uncopyable part
The six lines that make a vLLM upgrade a non-event
Fazm's Custom API Endpoint setting shipped in v2.2.0 on April 11, 2026. The implementation is intentionally small. When you change the endpoint in Settings (or flip the toggle off), the app calls one function on ChatProvider. That function is all of this.
The trick is what this function does not do. It does not cancel the app. It does not tear down the UI. It does not invalidate the chat history. It stops the ACP bridge process and flips a boolean. The bridge respawns lazily on the next outgoing query, and when it respawns the child process reads the current value of customApiEndpoint from UserDefaults and exports it as ANTHROPIC_BASE_URL on the Node subprocess. That is the only contract between the Mac app and whatever is serving inference. Change vLLM, keep the URL, nothing on this side moves.
What actually happens during a vLLM upgrade
The upgrade flow, command by command
This is what an upgrade from v0.18.0 to v0.19.1rc0 looks like if you are self-hosting one vLLM process behind a Fazm chat. No magic. The client-side step is literally "keep typing."
Update the server package
Run pip install -U vllm inside the same venv you launched vllm serve from. Patch releases do not usually require a CUDA toolchain change in 2026.
Restart the server process
Stop the old vllm serve cleanly (SIGTERM, not SIGKILL; you want clean checkpoint unloads). Start the new one on the same host and same port.
Smoke-test the endpoint
Hit /v1/models with curl. You want a 200 and the expected model ID in the response. That is your entire server-side acceptance check.
Do nothing on the Mac
Seriously. Fazm keeps the bridge process alive until its next query, at which point it re-reads the endpoint from UserDefaults. Your chat history, window layout, and open conversation are unchanged.
Send the next message
Bridge respawns, shim translates Anthropic Messages to OpenAI, new vLLM answers. You feel a slightly lower time-to-first-token because v0.19.0 ships the async scheduler on by default.
Where the fan-out pattern saves you
The URL is the whole contract. That is what lets the same Mac app point at a local vLLM for offline work, a team cluster for shared compute, or a managed vLLM-style provider for burst, and switch between them without code changes.
One client, many vLLM backends, one env var
Typical agent stack vs. the Fazm + vLLM pattern
What you touch on your client when the server ships a new vLLM version.
| Feature | Typical forked-client agent | Fazm + vLLM |
|---|---|---|
| Client rebuild after vLLM patch | Re-pin SDK, rebuild Electron bundle, ship update | None. URL is the contract, not the SDK surface |
| Config changes in the UI | Often model IDs, sometimes base URLs | Zero. customApiEndpoint in UserDefaults is unchanged |
| In-flight chat session | Usually reset with the client update | Preserved. Bridge respawns lazily on next query |
| Model swap on the server | May require client model-registry update | Handled at the shim. Fazm sends model name, shim maps it |
| Switch from local vLLM to managed provider | Rebuild or re-ship with new SDK integration | Change one string in Settings. Bridge respawns |
| Rollback to previous vLLM version | Revert package + redeploy client + hope | Reinstall old vllm, restart server, keep typing |
The shim in the middle is what lets the Anthropic Messages surface stay fixed while the OpenAI-compatible backend underneath moves.
post-upgrade sanity checklist
- curl /v1/models returns 200 with the expected model ID
- Server log mentions async scheduler enabled (v0.19+ default)
- First response in Fazm streams normally, no connection errors
- Bridge log shows customApiEndpoint respected on respawn
- Tool calls round-trip cleanly (Anthropic tool_use → OpenAI function_call)
- Time-to-first-token is at least as fast as before the upgrade
what the user actually feels
Async scheduler is the one 2026 vLLM change that your Mac agent notices without being told.
For a conventional chat UI, a few milliseconds of time-to-first- token are invisible. For an agent that runs tool calls in a loop (read accessibility tree, plan action, emit tool call, act, read new tree, repeat), TTFT compounds. Every turn is a latency you feel. v0.19.0 turned async scheduling on by default. The effect on a Mac agent loop is snappier screen actions, particularly when the model is burning parallel reasoning tokens before it emits the next tool call.
None of that required a code change on the Fazm side. The app reads the bridge's streamed tokens the same way it did before. It just gets them sooner.
“restartBridgeForEndpointChange() is six lines. That is the blast radius of a vLLM swap on the Fazm side.”
Desktop/Sources/Providers/ChatProvider.swift, lines 2101-2106
What is on the other end of that URL in 2026
Counting the cost of the 14-day release window on the client
Three vLLM releases, each one an upgrade on the server. For a typical forked-client agent stack, each would be a client rebuild and a user update to ship. For Fazm, the numbers look different.
Where this falls apart
The honest version of any story like this includes the ways the promise leaks. Three of them here.
- Mid-stream restarts cut the stream. If you bounce the vLLM process while the client is receiving tokens, the HTTP stream dies with it. The UI will see a disconnect. Bounce between user turns, not during.
- Breaking OpenAI-API changes still bite. vLLM is careful about surface compatibility, but if a future patch changes the shape of tool_calls on the OpenAI side, your shim needs to know. That is a shim update, not a Fazm update.
- Model registry drift on the shim. If you load a new Gemma 4 variant on the server but the shim still maps sonnet to the old checkpoint, you will keep getting the old model. Check the shim's model map after any non-trivial server upgrade.
Self-hosting vLLM behind a Mac agent?
Talk to the Fazm team about wiring your vLLM endpoint into the Custom API Endpoint setting, including the Anthropic-to-OpenAI shim that sits in front of it.
Book a call →Frequently asked questions
What vLLM releases actually shipped in 2026 so far, and how often do they land?
Three releases in a 14-day window. v0.18.0 landed in late March 2026 with gRPC serving behind the new --grpc flag, GPU-based NGram speculative decoding, and a new KV cache offload backend named FlexKV. v0.19.0 landed on April 2 with day-one Gemma 4 support across E2B, E4B, 26B MoE, and 31B Dense, plus async scheduler on by default and Model Runner V2. v0.19.1rc0 followed on April 3 as a release candidate. That is a tight cadence by any standard. If you run vLLM as the backend for a desktop agent, you will be upgrading often in 2026.
When I upgrade the vLLM server, what do I have to rebuild on the client side of my Fazm setup?
Nothing. The client side is a single string in UserDefaults under the key customApiEndpoint. Fazm's ACP bridge reads that value at spawn time inside Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382 and exports it as ANTHROPIC_BASE_URL on the Node subprocess. vLLM changes behind that URL, the bridge does not. When you run pip install -U vllm on the server and restart the server process on the same port, Fazm keeps working with no app update, no recompile, no Settings change.
Does in-flight chat state survive a vLLM upgrade?
Yes, with a caveat. The Fazm UI keeps the conversation history alive in the app. The bridge process gets torn down and respawned lazily on the next query, which is what restartBridgeForEndpointChange() at Desktop/Sources/Providers/ChatProvider.swift lines 2101 to 2106 does. The caveat is that any streaming token generation in progress at the moment you bounce the vLLM server dies with the old connection, as you would expect from any long-lived HTTP stream. Restart the vLLM server between user turns, not mid-response.
What is the six-line function this page keeps referencing?
restartBridgeForEndpointChange(), an async Swift method on ChatProvider. It guards on acpBridgeStarted so it is a no-op if the bridge never started. It reads customApiEndpoint from UserDefaults, writes a log line naming the new endpoint for debuggability, calls await acpBridge.stop(), flips acpBridgeStarted to false, and returns. That is it. The bridge then respawns on the next outgoing query, re-reading UserDefaults at spawn time. It is deliberately small. Most of the complexity of safe endpoint-swapping lives in the stop-and-respawn model, not in the function itself.
Do gRPC serving in v0.18.0 and the async scheduler in v0.19.0 change anything for a single-user Mac agent?
One helps, one does not. gRPC is server-to-server and matters when vLLM sits behind a shared gateway for a fleet. For a single laptop routing through a single vLLM process, vanilla OpenAI-compatible HTTP is fine and you will see no difference. The async scheduler matters. It overlaps engine scheduling with GPU execution, so time-to-first-token drops. On an agent that emits tool calls in a loop, lower TTFT compounds: the UI feels snappier because each tool-call round trip is shorter. Turning it on is now the default, so most users will get this by upgrading and doing nothing else.
What about the Completions API CVE, does the Fazm integration care?
Indirectly. CVE-2026-0994 was a deserialization issue in vLLM's Completions API prompt_embeds handling. Fazm does not call vLLM's Completions endpoint directly; it goes through an Anthropic-to-OpenAI shim that usually translates to /v1/chat/completions, which has a different surface. You should still upgrade the vLLM server to a patched v0.19.x build. Never treat 'I use a different endpoint than the vulnerable one' as a substitute for patching.
Gemma 4 shipped with vLLM v0.19.0 on day one. Is it a good model for a Mac agent backend?
The 26B MoE variant is the interesting one. MoE active-parameter cost with dense-style reasoning quality is a sweet spot for self-hosted agent workloads, especially for structured tool use. The E4B effective-4B variant is small enough to run on a single consumer GPU and strong enough to drive a responsive agent. Fazm is model-agnostic at the protocol level: it speaks Anthropic Messages on the client, an OpenAI-compatible shim translates, and vLLM serves whatever is behind it. Gemma 4 is not special from Fazm's side, it is just a better point on the cost/quality curve than what you had in March.
What is the upgrade procedure with the fewest moving parts?
Five commands. On the server: pip install -U vllm. Stop the vllm serve process cleanly. Start it again with the same model and same port. Verify with a single curl to /v1/models. Return to Fazm and continue the conversation. If your shim is running in the same venv as vLLM, restart the shim too. If the shim is a separate service, it does not need to be touched unless the OpenAI API surface it depends on changed, which almost never happens patch-to-patch in vLLM.
Where is the Fazm source that guarantees all of this?
Three files. Desktop/Sources/Chat/ACPBridge.swift lines 379 to 382, the four-line read of customApiEndpoint from UserDefaults and the export as ANTHROPIC_BASE_URL. Desktop/Sources/MainWindow/Pages/SettingsPage.swift lines 906 to 952, the Settings card titled 'Custom API Endpoint' with placeholder https://your-proxy:8766. Desktop/Sources/Providers/ChatProvider.swift lines 2101 to 2106, the six-line restart function. CHANGELOG.json dated 2026-04-11 records the release (v2.2.0) that first shipped the Custom API Endpoint setting. The full tree is MIT-licensed at github.com/mediar-ai/fazm.
How does Fazm know which model to ask for on the new endpoint?
It does not hard-code one. acp-bridge/src/index.ts line 1245 declares DEFAULT_MODEL = 'claude-sonnet-4-6' as a warmup seed, but when you send a message the bridge uses whatever the app passes as msg.model or falls back to DEFAULT_MODEL if nothing is set. In the Anthropic-to-OpenAI shim in front of vLLM, model names are translated (sonnet goes to whatever you configured, for example a Gemma 4 26B MoE checkpoint). The Fazm side cares about the Messages API shape, not the model string.
Why is this story missing from every vLLM release notes summary in the top search results?
Because vLLM is server infrastructure and its audience is ML ops. Every mainstream release notes page is written for people operating inference clusters. The question 'what do I rebuild on my desktop when I run pip install -U vllm on my server' is asked by agent builders, not platform operators. Those audiences read different pages. This guide is the overlap: vLLM's 2026 release cadence, read through the lens of a consumer Mac agent whose source tree has a six-line function saying 'the answer is nothing'.