llama.cpp Releases in April 2026: Tensor Parallelism, 1-Bit Quantization, and More
llama.cpp Releases in April 2026
April 2026 brought over 170 incremental releases to llama.cpp (builds b8607 through b8779), with several headline features that change how you run local models. The project, now hosted under ggml-org/llama.cpp on GitHub, shipped backend-agnostic tensor parallelism, 1-bit quantization, day-one Gemma 4 support, and hardware backends for AMD CDNA4 and Qualcomm Hexagon.
This post covers every significant change, organized by feature area, with the build numbers and dates you need to track what landed when.
Release Timeline Overview
| Build | Date | Headline Feature |
|---|---|---|
| b8607 | Apr 1 | Walsh-Hadamard KV cache rotation, Flash Attention head dim 512 |
| b8634 | Apr 2 | Granite 4.0 chat template with tool-calling |
| b8641 | Apr 2 | Gemma 4 vision + MoE support (launch day) |
| b8658 | Apr 4 | KV cache clearing for idle slots (--clear-idle) |
| b8682 | Apr 6 | Q1_0 1-bit quantization (CPU) |
| b8705 | Apr 8 | Step3-VL-10B, fused QKV tensors |
| b8738 | Apr 9 | Backend-agnostic tensor parallelism (NCCL/RCCL) |
| b8739 | Apr 9 | AMD Instinct MI350X/MI355X CDNA4 support |
| b8755 | Apr 11 | Hexagon/Snapdragon backend for Linux |
| b8762 | Apr 11 | MERaLiON-2 speech QA models |
| b8766 | Apr 12 | Gemma 4 audio with Conformer encoder |
| b8769 | Apr 12 | Qwen3 audio support |
| b8779 | Apr 13 | Vulkan Flash Attention DP4A shader |
Backend-Agnostic Tensor Parallelism (b8738)
The single biggest feature of April 2026. Build b8738, merged on April 9, adds true tensor parallelism across multiple GPUs without being locked to a specific vendor.
Previous multi-GPU support in llama.cpp relied on layer splitting, where each GPU handled a subset of transformer layers. Tensor parallelism splits individual operations across GPUs, which means every GPU stays busy on every token. The result is significantly lower latency per token compared to layer splitting.
The implementation uses NCCL (NVIDIA) and RCCL (AMD) for topology-aware communication, automatically detecting NVLink and PCIe connections between GPUs. It works with CUDA, ROCm, and other backends.
Benchmarks from the PR show 3-4x performance gains over standard layer-parallel methods, with GPUs pegged at 100% utilization. Models with MoE architectures (Qwen 3 MoE, Llama 4) benefit the most because expert routing distributes work unevenly across layers, which hurts layer splitting but not tensor parallelism.
Q1_0: 1-Bit Quantization (b8682)
Build b8682 (April 6) introduced Q1_0, a new 1-bit quantization format. This is not a general-purpose format for arbitrary models. It targets models specifically trained for 1-bit inference, such as the "Bonsai" family of models that use binary weight training.
Backend support rolled out progressively through April:
| Backend | Build | Date | |---|---|---| | CPU | b8682 | Apr 6 | | Metal (Apple Silicon) | b8712 | Apr 8 | | Vulkan | b8742 | Apr 9 | | SYCL (Intel) | b8771 | Apr 12 |
The practical use case is running capable models on extremely constrained hardware. A 7B-parameter model with Q1_0 quantization fits in under 1 GB of memory, making it viable on devices that cannot handle even Q4 quantized models.
Walsh-Hadamard KV Cache Rotation (b8607)
Build b8607 (April 1) merged PR #21038, which applies Walsh-Hadamard matrix rotation to Q, K, and V activations before KV cache quantization. The inverse rotation is applied after retrieval. This is a mathematical trick that spreads information more evenly across dimensions, which means quantization loses less critical data.
The impact on reasoning benchmarks is dramatic:
| KV Cache Format | AIME25 Before | AIME25 After | Improvement | |---|---|---|---| | Q4_0 | 0.0% | 21.7% | From broken to usable | | Q8_0 | 31.7% | 37.1% | +5.4 percentage points | | FP16 (baseline) | 38.3% | 38.3% | No change |
Q4_0 KV cache was previously unusable for reasoning tasks because quantization destroyed the information the model needed for multi-step logic. With Hadamard rotation, Q4_0 becomes viable, which cuts KV cache VRAM usage by 4x compared to FP16.
The feature is backend-agnostic and enabled by default. If you need to disable it (for debugging or benchmarking), set the LLAMA_ATTN_ROT_DISABLE environment variable. The only constraint: model head dimensions must be divisible by 64 for Metal compatibility.
Gemma 4: Day-One Support Through Audio
Google released Gemma 4 on April 2, and llama.cpp had vision and MoE support ready on launch day (PR #21309). Support deepened throughout the month:
- b8641 (Apr 2): Chat template fixes for Gemma 4 conversation format
- b8662 (Apr 4): Final logit softcapping parameter reading
- b8665 (Apr 5): Tool-call parser with JSON output and interleaved thinking
- b8678 (Apr 6): Byte token handling in BPE detokenizer
- b8744 (Apr 10): Reasoning budget sampler
- b8766 (Apr 12): Audio support with Conformer encoder (E2B/E4B variants)
- b8775 (Apr 13): Audio model switched to causal attention
The audio support in b8766 is notable. It implements a 12-layer USM-style Conformer architecture with FFN, self-attention, causal Conv1D, and 128-bin HTK mel preprocessing. This means you can run Gemma 4 with voice input locally, without any cloud API.
AMD MI350X and MI355X Support (b8739)
Build b8739 (April 9) added support for AMD's Instinct MI350X and MI355X accelerators, which use the CDNA4 architecture (gfx950). The implementation includes MFMA instruction routing optimized for the new silicon.
Early benchmarks on the MI355X with ROCm 7.0.1 show strong numbers: 40,013 tokens per second for prompt processing and 254 tokens per second for text generation on Qwen2.5-1.5B with Q4_K_M quantization.
This matters for anyone building local inference infrastructure on AMD hardware. The MI350X series is AMD's answer to NVIDIA's H100/H200 for datacenter inference, and same-week llama.cpp support means you can run GGUF models on this hardware without waiting for vendor-specific toolchains.
New Model Support
April brought first-class support for several new model families:
| Model | Build | Notes | |---|---|---| | Gemma 4 (vision + MoE) | b8641 | Launch-day support | | Gemma 4 Audio (E2B/E4B) | b8766 | Conformer encoder, mel preprocessing | | Granite 4.0 | b8634 | Chat template with tool-calling | | HunyuanOCR | b8670 | Perceiver-based vision projector | | Step3-VL-10B | b8705 | Fused QKV tensors | | MERaLiON-2 (3B/10B) | b8762 | Whisper large-v2 audio encoder, speech QA | | Qwen3 Audio | b8769 | Audio preprocessing pipeline | | Qwen3-Next MoE | b8738 | Tensor parallelism compatible |
The trend is clear: llama.cpp is no longer just a text inference engine. Audio and vision model support is becoming a standard expectation, and the project is keeping pace with model releases from Google, Alibaba, and IBM.
Hardware Backend Updates
Hexagon/Snapdragon for Linux (b8755)
Build b8755 (April 11) added Linux support for Qualcomm's Hexagon NPU, targeting Snapdragon-powered laptops and edge devices. Earlier builds in April added cumulative sum (b8628) and argsort optimization (b8672) operations for the Hexagon backend, with b8754 bringing operation request batching and buffer/cache management improvements.
WebGPU Improvements
The WebGPU backend received steady improvements throughout April:
- b8607: Quantized buffers for wider browser/device support
- b8639: Vectorized flash attention
- b8660: Single buffer with offsets (replacing parameter pools)
- b8683: MUL_MAT_ID operations for MoE models
- b8749: F16 numerical stability, NaN canonicalization
- b8750: Non-square subgroup matrix support for Intel GPUs
Intel OpenVINO Backend
OpenVINO 2026.1 (announced April 8) includes a preview backend for llama.cpp, enabling optimized inference across Intel CPUs, GPUs, and NPUs. Validated models include Llama 3.2 1B, Phi-3 mini, Qwen 2.5 1.5B, and Mistral 7B in GGUF format. It works on Intel Core Ultra Series 1/2 AI PCs and Arc Pro B70 32GB GPUs.
Performance Highlights
Several targeted optimizations shipped across backends:
| Optimization | Build | Impact | |---|---|---| | SYCL Q8_0 reorder | b8685 | ~3x throughput on Intel Arc (4.88 to 15.24 t/s) | | CUDA Flash Attention stream_k | b8680 | Improved kernel scheduling | | CUDA multiplication fusion | b8740 | Fused ops for better throughput | | CUDA ds_read_b128 | b8701 | Vectorized LDS loads, gains on MI50/RX6800XT | | CUDA graph optimizations | b8702 | Hash-based property checking | | Flash Attention head dim 512 | b8609 (CUDA), b8724 (SYCL) | Support for models with larger head dims | | Vulkan FA dequantization | b8690 | Q4_1, Q5_0, Q5_1, IQ4_NL support |
The SYCL Q8_0 reorder in b8685 stands out: it pushed bandwidth utilization from 21% to 66% on Intel Arc GPUs, translating directly to a 3x throughput increase for Qwen3.5-27B.
Server and Infrastructure Changes
- b8658:
--clear-idleflag for KV cache clearing on idle slots, useful for VRAM optimization in multi-user setups - b8748: Model alias conflict fix for preset configurations
- b8756: Structured output JSON schema
$refresolution fix - b8777: Router mode build info endpoint
- b8778: Download cancellation and temp file cleanup
- b8752: Download progress callback interface
- b8625: WebUI API key bypass for static assets
What This Means for Local AI
April 2026 marks a shift in what llama.cpp can do. Tensor parallelism makes multi-GPU setups practical without vendor lock-in. Q1_0 quantization opens the door to models on extremely constrained devices. Audio and vision model support means local multimodal inference is becoming routine rather than experimental.
If you are running local models for desktop agent workflows, the combination of tensor parallelism (for speed) and Walsh-Hadamard KV cache rotation (for memory efficiency) is the most impactful upgrade path this month. Both features work with existing GGUF model files and do not require re-quantization.
For the latest builds and full changelogs, check the releases page at github.com/ggml-org/llama.cpp/releases.