vLLM 0.8.2 Release Date, Changelog, and Upgrade Guide

Matthew Diakonov··6 min read

vLLM 0.8.2 Release Date: March 25, 2025

vLLM 0.8.2 was released on March 25, 2025 on PyPI, with the GitHub tag created on March 23, 2025. This was a critical patch release focused on fixing a V1 engine memory usage bug that could cause out-of-memory crashes under production load. The vLLM team explicitly recommended upgrading from 0.8.1 immediately.

Release Timeline: vLLM 0.8.x Series

| Version | Release Date | Type | Key Change | |---|---|---|---| | v0.8.0 | March 2025 | Major | V1 engine improvements, structured output backends | | v0.8.1 | Mid-March 2025 | Patch | Bug fixes, stability improvements | | v0.8.2 | March 25, 2025 | Critical patch | V1 engine memory fix, FP8 KV cache | | v0.8.3 | Late March 2025 | Patch | Additional fixes | | v0.8.4 | April 2025 | Patch | Continued stability work | | v0.8.5 | April 2025 | Patch | Final 0.8.x release |

What Changed in vLLM 0.8.2

The release addressed a V1 engine memory leak that could cause crashes during sustained serving. Beyond the critical fix, it shipped several notable features.

V1 Engine Memory Fix

The primary reason for this release: the V1 engine had a memory management bug that caused growing memory consumption over time. Under sustained load, this would eventually trigger OOM kills. The fix corrected how the engine tracked and released memory allocations during request lifecycle management.

FP8 KV Cache Support

vLLM 0.8.2 added FP8 (8-bit floating point) support for the KV cache, reducing memory usage for cached key-value pairs by roughly 50% compared to FP16. This is significant for long-context serving where KV cache dominates GPU memory consumption.

# Enable FP8 KV cache in vLLM 0.8.2+
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-70b-chat-hf \
  --kv-cache-dtype fp8

Speculative Decoding Expansion

Top-p and top-k sampling support was added for speculative decoding workflows, broadening the sampling strategies available beyond greedy and beam search. N-gram speculative decoding also received improved defaults for better out-of-the-box latency.

TPU Support Improvements

The release expanded TPU support with ragged attention, tensor parallelism, and a new MHA Pallas backend. Teams running vLLM on Google Cloud TPUs gained meaningfully better performance and feature parity with the CUDA path.

Other Notable Changes

  • fastsafetensors loader: Faster model weight loading via integrated safetensors support
  • Tool calling and reasoning parser: Support for function calling patterns in the serving layer
  • Tele-FLM model support: New model architecture added
  • Pipeline parallel for TransformersModel: Broader multi-GPU support
  • CUDA graph support for Llama 3.2 vision models: Reduced per-request overhead for multimodal inference
  • OpenVINO removed: Moved to an external plugin architecture
  • Triton attention backend on NVIDIA GPUs: ROCm attention backend enabled for broader hardware
vLLM 0.8.2: Key ImprovementsServing LayerOpenAI-compatible API | Tool calling + reasoning parser (new)V1 Engine (fixed)Memory leak resolved, duplicate ID crash fixSpeculative DecodingTop-p/top-k sampling, improved N-gram defaultsKV Cache (FP8 new)~50% memory reduction vs FP16Multi-HardwareTPU (Pallas), NVIDIA, ROCm, CPUModel Loadingfastsafetensors | Pipeline parallel | CUDA graphs for vision65+ contributors | 16 first-time contributors | Critical upgrade recommended

How to Install vLLM 0.8.2

For teams that need this specific version (for compatibility or reproducibility), pin it directly:

pip install vllm==0.8.2

For most users today, the latest stable release is recommended instead:

pip install vllm --upgrade

As of April 2026, vLLM is at v0.19.x with significantly more features including gRPC serving, Gemma 4 support, and async scheduling by default. See the full April 2026 vLLM update for details on the latest releases.

Should You Still Use vLLM 0.8.2?

In most cases, no. vLLM 0.8.2 is over a year old and the project has shipped major improvements since then. However, there are valid reasons to run this version:

  • Reproducibility: If your production benchmarks and validation were done on 0.8.2, upgrading introduces variables
  • Dependency constraints: Some deployment environments pin transitive dependencies that conflict with newer vLLM versions
  • Hardware compatibility: Certain older CUDA toolkit versions may work better with 0.8.x

If you are still on 0.8.2, the upgrade path to current versions involves checking for breaking changes in each minor release. The most impactful changes landed in v0.9 (scheduler rewrite), v0.12 (V1 engine default), and v0.18 (OpenVINO fully removed, gRPC added).

vLLM Version History: 0.8.x in Context

The 0.8.x series was a transitional period for vLLM. The V1 engine was being stabilized, hardware support was expanding rapidly, and the structured output system was maturing. The 0.8.2 release was notable because it fixed a critical memory bug that had been causing production outages for teams running the V1 engine at scale.

| Milestone | Version | When | |---|---|---| | Initial V1 engine | v0.7.x | Early 2025 | | V1 memory stability | v0.8.2 | March 2025 | | V1 as default engine | v0.12.x | Mid 2025 | | gRPC serving | v0.18.0 | Late March 2026 | | Async scheduling default | v0.19.0 | April 2026 |

The full release changelog is available on the vLLM GitHub releases page.

Fazm is an open source AI agent for macOS that helps you automate desktop tasks using voice and text. Built with Swift, runs locally, and connects to your tools through MCP.

Related Posts