VLLM / WINDOWS / 2026 STATE OF SUPPORT

vLLM on Windows in 2026: what officially works, what doesn't, and what to do once it's serving

Short version: vLLM does not have official Windows support, has never had it, and there is no public roadmap for it as of May 2026. The three paths that work today are WSL2, Docker Model Runner with the WSL2 backend (added December 2025), and the community fork SystemPanic/vllm-windows (v0.20.0 shipped April 30 2026, native Python 3.12 + CUDA 13 + PyTorch 2.11 wheels). The rest of this page is the comparison between those three, plus the question every other guide on this topic skips: once the server is serving, what is actually calling it.

Matthew Diakonov, Written with AI

Published May 12, 20268 min read

Direct answer (verified 2026-05-12)

No, vLLM does not officially support Windows. The official installation docs at docs.vllm.ai list NVIDIA CUDA, AMD ROCm, Intel XPU, and Apple Silicon for GPU, plus Intel/AMD x86, ARM AArch64, Apple Silicon, and IBM Z for CPU. Windows is not in either list.

The three paths that do work in 2026 are:

WSL2: the team's recommended path. Install Ubuntu inside Windows and use the official Linux build.
Docker Model Runner: added vLLM on Windows via the WSL2 backend on December 11 2025. Requires Docker Desktop 4.54+, WSL2, and an NVIDIA GPU with compute capability 8.0 or higher.
SystemPanic/vllm-windows (community fork): v0.20.0 shipped April 30 2026 with native Windows wheels built for Python 3.12, CUDA 13, and PyTorch 2.11. First release with NCCL plus tensor and pipeline parallelism on Windows.

The three working paths

WSL2

Official Linux build, inside Windows

Docker Model Runner

Dec 11 2025, WSL2 backend

SystemPanic/vllm-windows

v0.20.0, Apr 30 2026, native wheels

Same OpenAI API

localhost:8000/v1

Path 1: WSL2 (still the team's recommendation)

WSL2 is a real Linux kernel running inside Windows, with GPU passthrough to your NVIDIA card via the modern WSL CUDA driver. The vLLM team's stance, expressed across multiple GitHub issues over 2024 and 2025, is that maintaining a separate Windows build with full CUDA, Triton, and Flash-Attention kernels has high engineering cost relative to the size of the Windows-first inference audience, most of whom can use WSL2 instead. So WSL2 has remained the canonical Windows path.

What you give up: somewhere between 20 and 40 percent of native throughput on a GPU-bound workload, depending on model size, batch size, and how chatty the host-to-guest filesystem traffic is. Most of that is GPU passthrough overhead and CPU-side scheduling under Hyper-V. For a single 7B model on a single 5090 you may not notice. For batched serving on a multi-GPU box, the tax shows up.

What you gain: zero patches to track, identical behavior to a Linux server, and the entire vLLM extras ecosystem (Flash-Attention 3, FP8 quantization, Speculative Decoding) without anyone having to port kernels.

Path 2: Docker Model Runner (the easiest one)

On December 11 2025, Docker shipped vLLM as a first-class backend for Docker Model Runner on Windows. The setup is shorter than any of the other paths:

# Enable Docker Model Runner (Docker Desktop 4.54+, WSL2 enabled)
docker desktop enable model-runner --no-tcp

# Install the vLLM backend with CUDA
docker model install-runner --backend vllm --gpu cuda

# Pull and run any model that Docker Model Runner supports
docker model run ai/smollm2-vllm "Hello, Windows."

Requirements per Docker's own announcement: Docker Desktop for Windows 4.54 or newer, WSL2 backend enabled, an NVIDIA GPU with compute capability 8.0 or higher (so Ampere, Ada Lovelace, Hopper, or Blackwell), and current NVIDIA drivers. AMD and Intel GPUs are served through Vulkan in Docker Model Runner generally, but the vLLM backend specifically is NVIDIA-only.

This is still technically WSL2 underneath, so it inherits the same 20 to 40 percent virtualization tax. The win is operational: no Python environment to manage, no CUDA toolkit to install by hand, no Triton wheel hunt. If you are evaluating vLLM on Windows for the first time, this is the path that costs you the least time.

Path 3: SystemPanic/vllm-windows native wheels

The community fork at github.com/SystemPanic/vllm-windows maintains pre-built Windows wheels of vLLM plus the necessary Windows-specific kernel patches. The current release, v0.20.0, shipped on April 30 2026. Three things changed in v0.20.0 that matter:

NCCL on Windows. Up through v0.19.0 the Windows builds were single-GPU only because Gloo was the only working distributed backend on Windows and tensor/pipeline parallelism needed NCCL.
Tensor and pipeline parallelism. First Windows release with multi-GPU serving. This is the build that makes a 70B model on a Windows workstation with two RTX cards actually plausible.
CUDA 13 + Blackwell support. Pre-built against Python 3.12, CUDA 13, and PyTorch 2.11. The release notes specifically call out Blackwell GPU compatibility, which the upstream Linux build also added in the same window.

The install is a single pip command against the downloaded wheel that matches your Python and CUDA versions. There is no automated installer; you pick the wheel manually from the releases page. Older v0.19.0 and v0.17.0 wheels are still up if you are pinned to CUDA 12.4 or running an earlier PyTorch.

The risk you accept: the fork is one developer, not the vLLM project. New upstream features land here some days or weeks later. Security fixes are best-effort. For a Windows-only team that genuinely cannot use WSL2 (corporate AV, driver constraints, or a kiosk-style deployment), the latency is acceptable. For a team that can use WSL2, the calculus usually pushes back toward the official path.

v0.20.0

“Released April 30 2026, built for Python 3.12, CUDA 13, and PyTorch 2.11. First Windows release with NCCL plus tensor and pipeline parallelism, plus Blackwell GPU support.”

github.com/SystemPanic/vllm-windows/releases

Pick the right path in under a minute

1
Just want it to work
Docker Model Runner. One install command. Pays a small WSL2 tax. NVIDIA only.
2
Want full control of the env
WSL2 + Ubuntu + official vLLM. Same tax as Docker, more knobs.
3
Cannot use WSL2
SystemPanic/vllm-windows wheels. Native speed. Community-maintained.
4
Mac and Windows in same shop
Serve vLLM on the Windows box, point a Mac-side agent at it over the LAN.

The part every other page on this skips

Every guide we found follows the same arc: pick a path, run the install commands, curl http://localhost:8000/v1/chat/completions, print the JSON, and end. That is enough if you only wanted to confirm the GPU works. It is not enough if you wanted vLLM for a reason.

The reason people put up with running vLLM on Windows is usually one of two things. Either they have a beefy gaming or workstation GPU on a Windows box and want to use it for something more interesting than benchmarks, or they have a private corpus or sensitive workflow they cannot send to a cloud provider. Neither use case ends at a curl. Both end with a client that does real work.

That client is what Windows is missing in 2026. Native consumer desktop agents that read the OS accessibility tree, drive other apps, and treat the local LLM as the brain still don't exist on Windows the way they do on Mac. The category is small everywhere, but on Windows it is genuinely thin. Most of what you find is browser-only, screenshot-based, or developer toolkits aimed at engineers willing to script their own agent loop. Real consumer-facing desktop automation that uses the model you just stood up is rare.

What a useful Windows + vLLM stack looks like

A real cross-machine pattern: Windows brain, Mac hands

If you already have vLLM serving on a Windows box and you also use a Mac, the most productive thing to do with both is to treat them as separate layers. Run inference on the Windows GPU, expose the OpenAI Chat Completions endpoint on port 8000 of the Windows machine's LAN address on the LAN, and run the agent on the Mac side where the desktop automation surface is more developed.

Fazm on the Mac exposes this directly. In Settings, AI Chat, Custom API Endpoint, you paste the URL of your Windows vLLM server (usually fronted by a thin Anthropic-shaped shim like LiteLLM, because vLLM speaks OpenAI Chat Completions and the agent's bridge talks Anthropic Messages). The Swift code that picks up that field lives in Desktop/Sources/Chat/ACPBridge.swift around lines 468 to 469:

if let customEndpoint = defaults.string(forKey: "customApiEndpoint"),
   !customEndpoint.isEmpty {
  env["ANTHROPIC_BASE_URL"] = customEndpoint
}

When that env var is set, every model call the agent makes goes to your Windows vLLM endpoint instead of api.anthropic.com. The Mac side handles the accessibility-tree work (clicking around in Finder, Calendar, Mail, Sheets), and the Windows GPU is the brain. The whole thing stays on your network. That is the realistic version of "running vLLM on Windows for an agent" in 2026, because the agent half does not exist natively on Windows yet.

What to verify before you call your Windows vLLM setup done

Pick exactly one of WSL2, Docker Model Runner, or SystemPanic/vllm-windows. Mixing two will lead to driver fights.
Confirm GPU compute capability: NVIDIA 8.0 or higher for Docker Model Runner. nvidia-smi --query-gpu=compute_cap --format=csv answers it.
Curl http://localhost:8000/v1/models from the Windows box. Then curl from another machine on the LAN. The second test is the one that catches firewall issues.
If you plan to use the SystemPanic fork, match the wheel exactly: Python 3.12 + CUDA 13 + PyTorch 2.11 for v0.20.0. v0.19.0 is the right pick if you are stuck on CUDA 12.4.
Put an Anthropic-shaped shim in front (LiteLLM is the usual choice) if you plan to drive the endpoint from a tool that speaks Anthropic Messages.
Decide where the agent runs. Windows-side options are limited in 2026. A Mac-side agent pointed at your Windows endpoint is often the shortest path to something useful.

Numbers worth knowing

0%upper bound of native-over-WSL2 throughput gain reported in community benchmarks

0.0NVIDIA compute capability floor for the Docker Model Runner vLLM backend

0CUDA major version targeted by SystemPanic/vllm-windows v0.20.0

0official Windows platforms in the vLLM supported-platform list

One last honest take

vLLM on Windows in 2026 is a solved problem at the inference layer. Three working paths, all NVIDIA-friendly, all reach the same OpenAI-shaped endpoint. If your only question is "can I get vLLM to serve a model on my Windows box," the answer is yes, pick whichever of the three paths matches your tolerance for unofficial tooling.

What is not solved on Windows is the layer above the model: a consumer-grade desktop agent that uses the model. The pages that rank for this topic do not say so, but it is the part that matters once you are past the install. The Mac side has more there. The cross-machine Windows-brain Mac-hands pattern is the realistic way to combine them today, and it costs you exactly one field in the agent's settings.

Got vLLM serving on a Windows box and a Mac on the same network?

Book 20 minutes. We will walk through the LAN bridge, the Anthropic shim choice, and how the Mac-side agent talks to your Windows GPU live.

Frequently asked questions

Does vLLM officially support Windows in 2026?

No. The official vLLM installation docs at docs.vllm.ai list NVIDIA CUDA, AMD ROCm, Intel XPU, and Apple Silicon for GPU, plus Intel/AMD x86, ARM AArch64, Apple Silicon, and IBM Z for CPU. Windows is not in either list. The vLLM team's recommended path for Windows users is WSL2 (Ubuntu running inside Windows), which lets you install the official Linux build without any patches.

What changed for Windows users in late 2025 and early 2026?

Two things. On December 11 2025, Docker added vLLM to Docker Model Runner on Windows, gated by Docker Desktop 4.54 or newer, the WSL2 backend, and an NVIDIA GPU with compute capability 8.0 or higher. That made a single 'docker model install-runner --backend vllm --gpu cuda' the easiest path. Then on April 30 2026, the community fork SystemPanic/vllm-windows shipped v0.20.0, the first native Windows release with NCCL plus tensor and pipeline parallelism, built for Python 3.12, CUDA 13, and PyTorch 2.11.

Is the SystemPanic/vllm-windows fork safe to use in production?

It is community-maintained, not blessed by the vLLM project itself. The pre-built wheels work on the published Python/CUDA/PyTorch matrix and pass the project's own tests, but you take on the risk that comes with any unofficial fork: lag behind upstream, occasional incompatibilities with the latest models, and no SLA on security fixes. Most teams running production Windows inference still go through WSL2 or Docker Model Runner for those reasons. The fork is useful when you genuinely cannot use WSL2 (corporate policy, driver constraints) or you need throughput that WSL2's virtualization tax eats into.

How much performance do you lose by running vLLM under WSL2 versus native Windows?

Community benchmarks throughout late 2025 and early 2026 land in a 20 to 40 percent range, with native Windows builds claiming up to 40 percent throughput improvements over WSL2-based deployments. Most of the delta is GPU passthrough overhead and CPU-side scheduling. For a small model on a single H100 or RTX 5090 you may not notice; for batched serving on a multi-GPU box the gap shows up.

Can I run vLLM on Windows without an NVIDIA GPU?

CPU-only inference is possible by building vLLM from source against the CPU backend, but on Windows the practical answer is 'use WSL2 to do this.' AMD GPUs are unsupported through any of the three working paths today; ROCm does not target Windows. If you must run on CPU on Windows, look at llama.cpp before vLLM. vLLM's architectural sweet spot is high-throughput batched serving on data center GPUs, not low-throughput single-prompt CPU work.

Does any desktop AI agent run on Windows the way Fazm does on Mac?

Not at parity, no. The category of consumer-facing computer-use agents that read the OS accessibility tree (rather than taking screenshots and pattern-matching pixels) is small even on Mac. On Windows in 2026 the comparable surface is UI Automation, which works but is less consistent across Win32, WinUI, and packaged WSL apps. Most Windows-side computer-use experiments still rely on screenshot loops. If you have already invested in running vLLM on a Windows machine, the better-than-nothing pattern is to expose your vLLM endpoint on the LAN and point a Mac-side agent at it: the brain runs on your Windows GPU, the hands run on a Mac.

Will vLLM ever get first-class Windows support?

There is no public roadmap for it as of May 2026, and the official vLLM installation page treats Linux as the canonical platform. The project's reasoning, expressed across multiple GitHub issues, is that the engineering cost of maintaining a separate Windows build with full CUDA, Triton, and Flash-Attention kernels is high relative to the size of the Windows-first inference audience, most of whom can use WSL2. Expect the WSL2 path to stay the recommended one. Expect Docker Model Runner and the SystemPanic fork to absorb the rest of the demand.

What does a real end-to-end Windows + vLLM agent stack look like in 2026?

Three layers. (1) Inference: vLLM running on the Windows box, either via WSL2, Docker Model Runner, or SystemPanic/vllm-windows, exposing the OpenAI Chat Completions endpoint on port 8000 of the Windows machine's LAN address. (2) Translation: a thin shim that converts whatever protocol your agent speaks (Anthropic Messages, ACP, MCP) into OpenAI Chat Completions. LiteLLM is the usual choice. (3) Agent: the part that actually does work, sees the screen, reads documents, drives apps. This is the part Windows is missing. Most teams either roll their own browser-only agent, or run the agent layer on a Mac that points at the Windows vLLM endpoint over the LAN.