llama.cpp Releases in April 2026: Tensor Parallelism, 1-Bit Quantization, and More

Matthew Diakonov·April 13, 2026·10 min read

llama-cpp local-ai april-2026 tensor-parallelism quantization inference

April 2026 brought over 170 incremental releases to llama.cpp (builds b8607 through b8779), with several headline features that change how you run local models. The project, now hosted under ggml-org/llama.cpp on GitHub, shipped backend-agnostic tensor parallelism, 1-bit quantization, day-one Gemma 4 support, and hardware backends for AMD CDNA4 and Qualcomm Hexagon.

This post covers every significant change, organized by feature area, with the build numbers and dates you need to track what landed when.

Release Timeline Overview

Build	Date	Headline Feature
b8607	Apr 1	Walsh-Hadamard KV cache rotation, Flash Attention head dim 512
b8634	Apr 2	Granite 4.0 chat template with tool-calling
b8641	Apr 2	Gemma 4 vision + MoE support (launch day)
b8658	Apr 4	KV cache clearing for idle slots (`--clear-idle`)
b8682	Apr 6	Q1_0 1-bit quantization (CPU)
b8705	Apr 8	Step3-VL-10B, fused QKV tensors
b8738	Apr 9	Backend-agnostic tensor parallelism (NCCL/RCCL)
b8739	Apr 9	AMD Instinct MI350X/MI355X CDNA4 support
b8755	Apr 11	Hexagon/Snapdragon backend for Linux
b8762	Apr 11	MERaLiON-2 speech QA models
b8766	Apr 12	Gemma 4 audio with Conformer encoder
b8769	Apr 12	Qwen3 audio support
b8779	Apr 13	Vulkan Flash Attention DP4A shader

Backend-Agnostic Tensor Parallelism (b8738)

The single biggest feature of April 2026. Build b8738, merged on April 9, adds true tensor parallelism across multiple GPUs without being locked to a specific vendor.

Previous multi-GPU support in llama.cpp relied on layer splitting, where each GPU handled a subset of transformer layers. Tensor parallelism splits individual operations across GPUs, which means every GPU stays busy on every token. The result is significantly lower latency per token compared to layer splitting.

The implementation uses NCCL (NVIDIA) and RCCL (AMD) for topology-aware communication, automatically detecting NVLink and PCIe connections between GPUs. It works with CUDA, ROCm, and other backends.

Benchmarks from the PR show 3-4x performance gains over standard layer-parallel methods, with GPUs pegged at 100% utilization. Models with MoE architectures (Qwen 3 MoE, Llama 4) benefit the most because expert routing distributes work unevenly across layers, which hurts layer splitting but not tensor parallelism.

Q1_0: 1-Bit Quantization (b8682)

Build b8682 (April 6) introduced Q1_0, a new 1-bit quantization format. This is not a general-purpose format for arbitrary models. It targets models specifically trained for 1-bit inference, such as the "Bonsai" family of models that use binary weight training.

Backend support rolled out progressively through April:

Backend	Build	Date
CPU	b8682	Apr 6
Metal (Apple Silicon)	b8712	Apr 8
Vulkan	b8742	Apr 9
SYCL (Intel)	b8771	Apr 12

The practical use case is running capable models on extremely constrained hardware. A 7B-parameter model with Q1_0 quantization fits in under 1 GB of memory, making it viable on devices that cannot handle even Q4 quantized models.

Walsh-Hadamard KV Cache Rotation (b8607)

Build b8607 (April 1) merged PR #21038, which applies Walsh-Hadamard matrix rotation to Q, K, and V activations before KV cache quantization. The inverse rotation is applied after retrieval. This is a mathematical trick that spreads information more evenly across dimensions, which means quantization loses less critical data.

The impact on reasoning benchmarks is dramatic:

KV Cache Format	AIME25 Before	AIME25 After	Improvement
Q4_0	0.0%	21.7%	From broken to usable
Q8_0	31.7%	37.1%	+5.4 percentage points
FP16 (baseline)	38.3%	38.3%	No change

Q4_0 KV cache was previously unusable for reasoning tasks because quantization destroyed the information the model needed for multi-step logic. With Hadamard rotation, Q4_0 becomes viable, which cuts KV cache VRAM usage by 4x compared to FP16.

The feature is backend-agnostic and enabled by default. If you need to disable it (for debugging or benchmarking), set the LLAMA_ATTN_ROT_DISABLE environment variable. The only constraint: model head dimensions must be divisible by 64 for Metal compatibility.

Gemma 4: Day-One Support Through Audio

Google released Gemma 4 on April 2, and llama.cpp had vision and MoE support ready on launch day (PR #21309). Support deepened throughout the month:

b8641 (Apr 2): Chat template fixes for Gemma 4 conversation format
b8662 (Apr 4): Final logit softcapping parameter reading
b8665 (Apr 5): Tool-call parser with JSON output and interleaved thinking
b8678 (Apr 6): Byte token handling in BPE detokenizer
b8744 (Apr 10): Reasoning budget sampler
b8766 (Apr 12): Audio support with Conformer encoder (E2B/E4B variants)
b8775 (Apr 13): Audio model switched to causal attention

The audio support in b8766 is notable. It implements a 12-layer USM-style Conformer architecture with FFN, self-attention, causal Conv1D, and 128-bin HTK mel preprocessing. This means you can run Gemma 4 with voice input locally, without any cloud API.

AMD MI350X and MI355X Support (b8739)

Build b8739 (April 9) added support for AMD's Instinct MI350X and MI355X accelerators, which use the CDNA4 architecture (gfx950). The implementation includes MFMA instruction routing optimized for the new silicon.

Early benchmarks on the MI355X with ROCm 7.0.1 show strong numbers: 40,013 tokens per second for prompt processing and 254 tokens per second for text generation on Qwen2.5-1.5B with Q4_K_M quantization.

This matters for anyone building local inference infrastructure on AMD hardware. The MI350X series is AMD's answer to NVIDIA's H100/H200 for datacenter inference, and same-week llama.cpp support means you can run GGUF models on this hardware without waiting for vendor-specific toolchains.

New Model Support

April brought first-class support for several new model families:

Model	Build	Notes
Gemma 4 (vision + MoE)	b8641	Launch-day support
Gemma 4 Audio (E2B/E4B)	b8766	Conformer encoder, mel preprocessing
Granite 4.0	b8634	Chat template with tool-calling
HunyuanOCR	b8670	Perceiver-based vision projector
Step3-VL-10B	b8705	Fused QKV tensors
MERaLiON-2 (3B/10B)	b8762	Whisper large-v2 audio encoder, speech QA
Qwen3 Audio	b8769	Audio preprocessing pipeline
Qwen3-Next MoE	b8738	Tensor parallelism compatible

The trend is clear: llama.cpp is no longer just a text inference engine. Audio and vision model support is becoming a standard expectation, and the project is keeping pace with model releases from Google, Alibaba, and IBM.

Hardware Backend Updates

Hexagon/Snapdragon for Linux (b8755)

Build b8755 (April 11) added Linux support for Qualcomm's Hexagon NPU, targeting Snapdragon-powered laptops and edge devices. Earlier builds in April added cumulative sum (b8628) and argsort optimization (b8672) operations for the Hexagon backend, with b8754 bringing operation request batching and buffer/cache management improvements.

WebGPU Improvements

The WebGPU backend received steady improvements throughout April:

b8607: Quantized buffers for wider browser/device support
b8639: Vectorized flash attention
b8660: Single buffer with offsets (replacing parameter pools)
b8683: MUL_MAT_ID operations for MoE models
b8749: F16 numerical stability, NaN canonicalization
b8750: Non-square subgroup matrix support for Intel GPUs

Intel OpenVINO Backend

OpenVINO 2026.1 (announced April 8) includes a preview backend for llama.cpp, enabling optimized inference across Intel CPUs, GPUs, and NPUs. Validated models include Llama 3.2 1B, Phi-3 mini, Qwen 2.5 1.5B, and Mistral 7B in GGUF format. It works on Intel Core Ultra Series 1/2 AI PCs and Arc Pro B70 32GB GPUs.

Performance Highlights

Several targeted optimizations shipped across backends:

Optimization	Build	Impact
SYCL Q8_0 reorder	b8685	~3x throughput on Intel Arc (4.88 to 15.24 t/s)
CUDA Flash Attention stream_k	b8680	Improved kernel scheduling
CUDA multiplication fusion	b8740	Fused ops for better throughput
CUDA ds_read_b128	b8701	Vectorized LDS loads, gains on MI50/RX6800XT
CUDA graph optimizations	b8702	Hash-based property checking
Flash Attention head dim 512	b8609 (CUDA), b8724 (SYCL)	Support for models with larger head dims
Vulkan FA dequantization	b8690	Q4_1, Q5_0, Q5_1, IQ4_NL support

The SYCL Q8_0 reorder in b8685 stands out: it pushed bandwidth utilization from 21% to 66% on Intel Arc GPUs, translating directly to a 3x throughput increase for Qwen3.5-27B.

Server and Infrastructure Changes

b8658: --clear-idle flag for KV cache clearing on idle slots, useful for VRAM optimization in multi-user setups
b8748: Model alias conflict fix for preset configurations
b8756: Structured output JSON schema $ref resolution fix
b8777: Router mode build info endpoint
b8778: Download cancellation and temp file cleanup
b8752: Download progress callback interface
b8625: WebUI API key bypass for static assets

What This Means for Local AI

April 2026 marks a shift in what llama.cpp can do. Tensor parallelism makes multi-GPU setups practical without vendor lock-in. Q1_0 quantization opens the door to models on extremely constrained devices. Audio and vision model support means local multimodal inference is becoming routine rather than experimental.

If you are running local models for desktop agent workflows, the combination of tensor parallelism (for speed) and Walsh-Hadamard KV cache rotation (for memory efficiency) is the most impactful upgrade path this month. Both features work with existing GGUF model files and do not require re-quantization.

For the latest builds and full changelogs, check the releases page at github.com/ggml-org/llama.cpp/releases.

llama.cpp Releases in April 2026: Tensor Parallelism, 1-Bit Quantization, and More

Release Timeline Overview

Backend-Agnostic Tensor Parallelism (b8738)

Q1_0: 1-Bit Quantization (b8682)

Walsh-Hadamard KV Cache Rotation (b8607)

Gemma 4: Day-One Support Through Audio

AMD MI350X and MI355X Support (b8739)

New Model Support

Hardware Backend Updates

Hexagon/Snapdragon for Linux (b8755)

WebGPU Improvements

Intel OpenVINO Backend

Performance Highlights

Server and Infrastructure Changes

What This Means for Local AI

Related Posts

Open Source Large Language Model Release April 2026: Every Model, Ranked

Open Source LLM Releases in April 2026: Every Model Worth Running

Anthropic April 2026 Announcement: Everything Claude Shipped This Month

Comments ()

Release Timeline Overview

Backend-Agnostic Tensor Parallelism (b8738)

Q1_0: 1-Bit Quantization (b8682)

Walsh-Hadamard KV Cache Rotation (b8607)

Gemma 4: Day-One Support Through Audio

AMD MI350X and MI355X Support (b8739)

New Model Support

Hardware Backend Updates

Hexagon/Snapdragon for Linux (b8755)

WebGPU Improvements

Intel OpenVINO Backend

Performance Highlights

Server and Infrastructure Changes

What This Means for Local AI

Related Posts

Open Source Large Language Model Release April 2026: Every Model, Ranked

Open Source LLM Releases in April 2026: Every Model Worth Running

Anthropic April 2026 Announcement: Everything Claude Shipped This Month

Comments (••)

Comments ()