vLLM Update April 2026: v0.18, v0.19, Gemma 4, and gRPC Serving

Matthew Diakonov·April 13, 2026·9 min read

vllm inference llm-serving april-2026 gemma-4 speculative-decoding grpc open-source

vLLM Update April 2026: What Shipped and What It Means

April 2026 has been one of the busiest months for vLLM since the project's inception. Two major releases (v0.18.0 and v0.19.0) landed within weeks of each other, adding gRPC serving, GPU-accelerated speculative decoding, full Gemma 4 support, and a critical security patch. This post covers every significant change, who it affects, and how to upgrade.

April 2026 Release Timeline

| Date | Version | Headline Change | Who Should Care | |---|---|---|---| | Late March | v0.18.0 | gRPC serving, GPU NGram spec decode, KV cache offloading | Production teams serving at scale | | April 2, 2026 | v0.19.0 | Gemma 4 architecture, async scheduling default, Model Runner V2 | Anyone deploying Gemma 4 or MoE models | | April 3, 2026 | v0.19.1rc0 | Release candidate with stability fixes | Early adopters testing v0.19 | | Early April | Security patch | CVE-2026-0994 fix for prompt_embeds deserialization | Everyone running vLLM in production |

v0.18.0: gRPC Serving and Speculative Decoding on GPU

gRPC as a First-Class Serving Protocol

vLLM v0.18.0 introduced native gRPC serving through the --grpc flag. This runs alongside the existing HTTP/REST interface, giving teams a binary protocol with HTTP/2 multiplexing for lower-latency, higher-throughput inference calls.

For microservice architectures where the inference server sits behind internal load balancers, gRPC reduces serialization overhead compared to JSON over HTTP. Teams already using gRPC for other services can now integrate vLLM without an intermediate proxy.

# Launch vLLM with gRPC enabled
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E \
  --grpc \
  --port 8000

GPU-Accelerated NGram Speculative Decoding

Speculative decoding predicts multiple tokens at once, then verifies them in a single forward pass. In v0.18.0, NGram speculative decoding moved from CPU to GPU and became compatible with the async scheduler. The result is significantly reduced overhead for speculative decode workflows, which translates directly to lower latency for interactive applications.

KV Cache Offloading Gets Smarter

The new KV cache offloading system in v0.18.0 stores only frequently reused blocks on CPU, rather than offloading everything indiscriminately. FlexKV was added as a new offloading backend, and support for multiple KV groups landed as well. For long-context workloads, this means you can handle more concurrent requests without running out of GPU memory.

GPU-less Render Serving

The vllm launch render command lets you separate multimodal preprocessing from GPU inference entirely. Preprocessing (tokenization, image encoding) runs on CPU-only nodes, and the GPU nodes focus purely on inference. For multimodal workloads at scale, this separation can cut GPU costs significantly.

v0.19.0: Gemma 4 Support and Async Scheduling by Default

Full Gemma 4 Architecture Support

vLLM v0.19.0 shipped with complete Gemma 4 support, covering all four model variants: E2B (effective 2B), E4B (effective 4B), 26B MoE, and 31B Dense. The implementation handles MoE routing, multimodal inputs, reasoning traces, and tool-use capabilities natively.

Google DeepMind released Gemma 4 on April 2, 2026, and vLLM had day-one support. The recommended deployment path is the pre-built Docker image:

# Deploy Gemma 4 26B MoE with vLLM (recommended Docker approach)
docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:gemma4 \
  --model google/gemma-4-27b-it \
  --tensor-parallel-size 2

# Or install from PyPI
pip install vllm==0.19.0
python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-27b-it \
  --tensor-parallel-size 2

Async Scheduling Enabled by Default

The async scheduler, which overlaps engine scheduling with GPU execution, is now on by default. No configuration needed. This zero-bubble overlap approach means the scheduler prepares the next batch while the current batch is still executing on the GPU, eliminating idle time between steps.

Combined with speculative decoding compatibility, this delivers measurable throughput improvements without any code changes on the user's side.

Model Runner V2 Improvements

Model Runner V2 gained piecewise CUDA graphs for pipeline parallelism, multi-modal embeddings support for speculative decode, streaming inputs, and EPLB (Expert-Level Load Balancing) support for MoE models. Vision encoders now support full CUDA graph capture, reducing per-request overhead for multimodal inference.

Auto Max Model Length

The --max-model-len auto flag automatically determines the maximum context length that fits in available GPU memory. This eliminates the common failure mode where vLLM starts up, tries to allocate KV cache for the full context window, and immediately OOMs.

# Let vLLM figure out the right context length for your hardware
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E \
  --max-model-len auto

Hardware Platform Updates

Both releases expanded hardware support significantly:

| Platform | What Changed | Version | |---|---|---| | Intel XPU | CUDA graph support, GPUDirect RDMA via NIXL, TORCH_SDPA/TRITON_ATTN ViT backends | v0.18.0+ | | ARM CPU | BF16 cross-compilation support | v0.18.0+ | | s390x (IBM Z) | FP16 support, vector intrinsics for attention kernels | v0.18.0+ | | ppc64le (POWER) | Prefix caching support | v0.18.0+ | | Apple Silicon | vLLM-Metal v0.1.0 automated releases (April 6) | Separate plugin | | CPU (general) | AVX2 and AVX512 builds in official releases | v0.18.0+ |

The Intel XPU improvements are particularly notable. CUDA graph support on Intel GPUs closes a major performance gap, and GPUDirect RDMA via NIXL enables efficient multi-GPU communication without routing through the CPU.

Security: CVE-2026-0994

A critical vulnerability was patched in the April release cycle. CVE-2026-0994 affects the Completions API endpoint in vLLM versions 0.10.2 and later. The issue: when processing user-supplied prompt embeddings, vLLM loads serialized tensors using torch.load() without sufficient validation.

Due to a change in PyTorch 2.8.0 that disabled sparse tensor integrity checks by default, maliciously crafted tensors can bypass bounds checks and trigger out-of-bounds memory writes during to_dense() calls. This can crash vLLM and potentially enable remote code execution.

Action Required

If you are running vLLM in production with the Completions API exposed, upgrade immediately. The vulnerability requires no authentication or elevated privileges to exploit. An attacker only needs the ability to send embedding payloads to your server.

Deprecations and Removals

Several features were removed in the v0.18/v0.19 cycle:

BitBlas quantization and Marlin 24 quantization methods removed
The reasoning_content message field is deprecated (use structured output instead)
Deprecated pooling items API removed
The VLLM_ALL2ALL_BACKEND environment variable is gone (replaced by automatic selection)
huggingface-hub updates prepare for Transformers v5, with compatibility fixes across multiple model architectures

If your deployment scripts reference any of these, update them before upgrading.

Upgrade Path

The recommended approach for upgrading to v0.19.0:

# 1. Check your current version
pip show vllm | grep Version

# 2. Review breaking changes if jumping from v0.17 or earlier
# Key: BitBlas/Marlin24 removal, reasoning_content deprecation

# 3. Upgrade
pip install vllm==0.19.0

# 4. Test with auto context length to catch memory issues early
python -m vllm.entrypoints.openai.api_server \
  --model your-model-here \
  --max-model-len auto \
  --port 8000

# 5. Verify the security patch is included
python -c "import vllm; print(vllm.__version__)"

For Docker deployments, pull the latest image:

docker pull vllm/vllm-openai:latest
# Or for Gemma 4 specifically
docker pull vllm/vllm-openai:gemma4

Performance Benchmarks

The v0.18/v0.19 releases include several measurable performance improvements:

| Optimization | Improvement | Context | |---|---|---| | Pipeline parallel async send/recv | 2.9% end-to-end throughput | Multi-GPU setups | | Pooling maxsim optimization | 13.9% throughput | Embedding models | | Triton ViT attention backend | Reduced vision encoder overhead | Multimodal models | | Prefix caching overhead | Less than 1% at 0% hit rate | All deployments | | Pooling model copy optimization | 1.8% throughput | Embedding models | | Zero-bubble async scheduling | Variable (workload dependent) | All deployments |

These are incremental but they compound. A production deployment running multi-GPU pipeline parallel with prefix caching and async scheduling will see meaningful aggregate throughput gains over v0.17.

What to Watch Next

Based on the current trajectory and community discussions:

vLLM 0.20 will likely expand Transformers v5 compatibility across all supported architectures
Speculative decoding continues to improve; expect more draft model options and better acceptance rates
MoE routing optimizations for Gemma 4 and DeepSeek V3 style models are active areas of development
vLLM-Metal for Apple Silicon is maturing fast, with automated releases starting in April 2026

Bottom Line

April 2026 brought two releases that meaningfully advance vLLM's production readiness. The gRPC support in v0.18 and async scheduling default in v0.19 reduce deployment friction. Gemma 4 day-one support demonstrates the project's ability to keep pace with new model architectures. The CVE-2026-0994 patch is the most urgent action item: if you run vLLM in production, upgrade now.

The project shipped 445 commits from 213 contributors in v0.19 alone, with 61 first-time contributors. The open source inference serving space is consolidating around vLLM, and April 2026 shows why.

Fazm is an open source AI agent for macOS that helps you automate desktop tasks using voice and text. Built with Swift, runs locally, and connects to your tools through MCP.