LLM quantization updates 2026

A short reference of the formats and milestones that actually changed this year, with dates and primary sources for each. At the end, the one Fazm setting that lets a quantized local model drive a real Mac desktop agent.

Matthew Diakonov, Written with AI

Published May 8, 20269 min read

Direct answer (verified 2026-05-08)

What is new in LLM quantization in 2026

Six things, in rough order of impact for someone running models locally:

NVFP4 (Nvidia Blackwell-native FP4) merged into llama.cpp through PRs from late March into April 2026; the Blackwell tensor-core dispatch path is PR #22196.
MXFP4 support shipped in LLM Compressor 0.9.0 (Jan 2026) with hardware support starting to appear in edge silicon including Apple A19 Pro.
AWQ batched calibration in LLM Compressor 0.9.0 cuts calibration time by ~3x at batch_size=32.
TurboQuant (Zandieh et al., ICLR 2026) pushes aggressive KV-cache quantization, prototyped against llama.cpp in Discussion #20969.
BitNet b1.58 (Microsoft) is production-ready up to 8B parameters, ternary weights trained natively from scratch.
MXFP6 mixed precision with MXFP4 is being explored as the next accuracy lever (Discussion #22498).

Authoritative sources: llama.cpp Discussion #22042, LLM Compressor 0.9.0 release notes, microsoft/BitNet.

Why this list looks short

Most year-in-review pieces on this topic conflate three different things: weight format, calibration algorithm, and runtime support. They list two dozen acronyms and leave the reader unable to tell which ones moved in the last twelve months. The actual 2026 surface area is small. Six bullets cover what is genuinely new since January.

The frame I find useful: weight format (NVFP4, MXFP4, MXFP6, ternary) is a hardware question. Calibration (AWQ, GPTQ, GGUF k-quants) is a software question. KV-cache quantization (TurboQuant) is a memory question. Each axis moved in 2026, but they moved separately.

The six updates, with dates

NVFP4

Late March to April 2026

Nvidia Blackwell-native 4-bit floating point. 16-value blocks, two-level scale (E4M3 per block plus FP32 tensor-wide). Mainline llama.cpp has the merged kernels (CUDA dp4a, MMQ, SYCL, Vulkan); the Blackwell tensor-core dispatch is in PR #22196 and depends on SM120 support.

What it gives you

3.5x footprint reduction vs FP16, 1.8x vs FP8, under 1 percent accuracy loss.

What it costs you

Real speed needs Blackwell. On consumer Apple Silicon NVFP4 is a software fallback path with no native dispatch.

Primary source

MXFP4

Jan 2026 (LLM Compressor 0.9.0)

OCP-standard microscaling FP4. 32-value blocks with a power-of-two E8M0 scaling factor. LLM Compressor 0.9.0 added the MXFP4 preset and an MXFP4PackedCompressor that packs weights and scales as uint8 tensors. Hardware support is starting to appear in edge silicon, including Apple A19 Pro.

What it gives you

Open standard, simpler to implement than NVFP4, hardware path on consumer Macs is materializing.

What it costs you

Larger block size than NVFP4 means slightly more quantization error in the same nominal bit budget.

Primary source

AWQ batched calibration

Jan 2026 (LLM Compressor 0.9.0)

AWQ as an algorithm did not change. The calibration pass did. Batch size of 32 yields roughly a 3x speedup on large models because AWQ depends on many onloaded forward passes per layer.

What it gives you

Faster recalibration of fine-tunes. AWQ remains the default INT4 format for production inference in 2026.

What it costs you

It is a calibration speedup, not a quality jump. Output quality versus an existing AWQ checkpoint is the same.

Primary source

TurboQuant (KV cache)

ICLR 2026 paper, llama.cpp Discussion #20969 active

Extreme KV-cache quantization rather than weight quantization. KV cache is what dominates memory growth in long context, and earlier KV-quant schemes (Q4 KV, Q8 KV) topped out before they hit useful compression on consumer hardware.

What it gives you

Long agent sessions stay in memory on a 24-32 GB Mac instead of OOM after the conversation grows.

What it costs you

Not yet merged into mainline llama.cpp. Treat as a forward-looking lever, not something you turn on today.

Primary source

BitNet b1.58

Production 2026 (Microsoft/BitNet repo)

Ternary weights (-1, 0, +1) trained natively from scratch. log2(3) = 1.58 bits per parameter. Multiplications collapse to additions, which lets it run usefully on CPU.

What it gives you

BitNet b1.58 2B4T benchmarks within 1-2 points of full-precision peers on MMLU, GSM8K, HumanEval+. Memory drops from ~2 GB to 0.4 GB at 2B parameters.

What it costs you

Has to be trained ternary from scratch. You cannot quantize a Llama 70B down to 1.58 bits. Public scale ceiling is 8B parameters.

Primary source

MXFP6 mixed precision

Research / Discussion #22498

Use MXFP6 in places MXFP4 hurts, keep MXFP4 elsewhere. The microscaling FP6 format has more headroom than FP4 with similar compute efficiency on the same hardware path.

What it gives you

Mixed-precision MXFP4 plus MXFP6 has shown better accuracy than pure MXFP4 in published experiments while keeping memory footprint close.

What it costs you

No tooling default uses it yet. This is the 'where is the field heading' bullet, not the 'what to ship today' bullet.

Primary source

GGUF Q4_K_M is still the default for a reason

Despite all of the above, the practical answer for most people running a local model on a Mac in mid 2026 is still a Q4_K_M or Q5_K_M GGUF on llama.cpp or its MLX equivalent. The k-quants system pairs integer block quantization with per-block scales and a small amount of outlier handling, and it remains the default download target on Hugging Face for community fine-tunes.

NVFP4 wins when you have Blackwell. MXFP4 wins when you have A19 Pro or other dedicated FP4 silicon. BitNet wins when you have a model that was trained ternary from the start. For everyone else with a plain M-series Mac and a model from the public commons, k-quants is still the format that ships.

Feature	NVFP4	GGUF Q4_K_M
Bits per weight	NVFP4: 4 (FP4 with 16-value blocks)	GGUF Q4_K_M: ~4.5 effective
Hardware path on consumer Mac	Software-fallback only on non-Blackwell	Native llama.cpp + MLX
Quality vs FP16	Under 1% on language tasks	Small but measurable drop
Tooling maturity (mid 2026)	Merged into llama.cpp, Blackwell PR #22196 open	Default for community fine-tunes
Best for	DGX Spark, Blackwell, A19 Pro hardware paths	Today, on any Mac

3.5x

“3.5x reduction in model memory footprint relative to FP16, less than 1 percent degradation on key language modeling tasks for some models”

Nvidia developer blog, NVFP4 announcement

Where Fazm fits, the one setting that matters

Fazm is a desktop agent for macOS, not a quantization toolkit. It does not bundle llama.cpp, it does not ship its own quantized weights, and it does not pretend to know which format is best for your machine. What it does do, and the part that matters for this page, is expose a setting that lets you point the agent at any Anthropic Messages-API-compatible endpoint, including a local one serving a quantized model.

The setting lives at Settings, AI Chat, Custom API Endpoint. Toggle it on, paste a URL likehttps://your-proxy:8766and the next bridge restart will route the agent there. Internally, the @AppStorage key is declared at SettingsPage.swift:885, the toggle plus text field render at lines 954-999, and the value is exported as the ANTHROPIC_BASE_URL environment variable on the bridge subprocess at ACPBridge.swift:468-469. The agent process itself does not change. The reasoner behind it is whatever you pointed the bridge at.

What the bridge subprocess sees

A few endpoints that drop in:

llama.cpp server hosting a Q4_K_M or Q5_K_M GGUF, with a thin Anthropic shim in front. This is the most common path on a plain M-series Mac.
MLX server hosting a 4-bit group quantized model, for users who want Apple's unified-memory path natively.
vLLM with NVFP4 weights for users on a Blackwell box, fronted by an Anthropic-compatible proxy.
OpenRouter Anthropic mode or any commercial Anthropic-compatible gateway, if you want the agent loop without running quantization locally at all.

What this means for the agent loop

The honest framing is that quantization improvements help an agent loop in a narrower way than they help a chat session. Chat is one prompt. An agent loop re-sends the system prompt, the tool schema, the conversation so far, and the current screen state on every turn. That input is dominated by prefill, not generation. Most 2026 quantization advances target weight memory and decode tok/s. The single advance that helps long-running agents in a structural way is KV-cache quantization, which is exactly what TurboQuant addresses.

If you are reading this in mid 2026 and want a working setup today, the boring answer is the right one: a 13B to 30B class instruct model with strong tool-use training, GGUF Q4_K_M or Q5_K_M, served by llama.cpp with prefix caching enabled, behind an Anthropic shim, pointed at by Fazm's Custom API Endpoint setting. Watch NVFP4 land on Blackwell, watch MXFP4 mature on Apple A19 Pro, watch TurboQuant merge upstream. None of those need to ship before a local agent is usable. They just make a usable thing better.

Pointing Fazm at a local quantized model on your Mac?

Bring your runtime, format, and use case. We will spend the call making the bridge see your endpoint.

Frequently asked questions

What is the single biggest 2026 quantization shift for someone running models locally on a Mac?

MXFP4 picking up real hardware support. The Open Compute Project's microscaling FP4 format used to be a software-only idea on consumer machines. With Apple's A19 Pro the MXFP4 path has dedicated silicon, which collapses the dequantization overhead that made earlier 4-bit floating-point formats slower than INT4 on consumer Apple chips. That hardware path makes MXFP4 a serious option for an on-device runtime in a way it was not in 2025.

How do NVFP4 and MXFP4 differ if they are both 4-bit floating point?

Block size and scaling. MXFP4 (the OCP standard) uses 32-value blocks with a single E8M0 power-of-two scaling factor per block. NVFP4 (Nvidia's Blackwell-native format) uses 16-value blocks with a two-level scaling strategy: a fine-grained E4M3 scaling factor per block plus a second-level FP32 tensor-wide scalar. Smaller blocks plus the higher-precision scale mean NVFP4 retains a bit more accuracy, at the cost of needing actual Blackwell tensor-core dispatch to run efficiently. Per the Nvidia developer blog, NVFP4 cuts memory footprint 3.5x versus FP16 and 1.8x versus FP8 with under 1 percent accuracy loss on language tasks.

Is BitNet b1.58 actually usable in production yet?

Usable, not a drop-in replacement. Microsoft's BitNet repo on GitHub is MIT-licensed and ships an inference framework. The catch is that 1.58-bit weights have to be trained that way from scratch, which means you do not get to take an existing 70B model and crush it down. The largest publicly trained BitNet b1.58 model is 8B (BitNet b1.58 2B4T). At 2B parameters trained on 4T tokens, it benchmarks within 1 to 2 points of full-precision peers on MMLU, GSM8K, HumanEval+, and reduces footprint from roughly 2 GB to 0.4 GB. That is real. It is also still capped at the 8B scale ceiling.

What does TurboQuant change about long-context inference?

It pushes the KV cache, not the weights, down to extreme quantization levels. KV-cache memory is what blows up first as you scale context length, and weight quantization does nothing to fix it. TurboQuant (Zandieh et al., ICLR 2026) is being prototyped against llama.cpp via Discussion #20969 and is the first KV-quantization scheme aggressive enough to consider for consumer hardware running long agent sessions. If you have ever watched a local agent OOM after the conversation grows past a few thousand turns, this is the lever.

Did anything change for AWQ in 2026 or is it the same as it was?

Same algorithm, much faster calibration. The 0.9.0 release of LLM Compressor added batched calibration for AWQ, with batch_size=32 yielding roughly a 3x speedup on large models because AWQ relies on many onloaded forward passes during calibration. AWQ remained the default INT4 production-inference format through 2025, and faster calibration is what made it tractable to recompute scales for new fine-tunes without renting a small cluster.

How do I run a 4-bit quantized local model behind Fazm on my Mac?

Fazm has a Custom API Endpoint setting in Settings, AI Chat, AI Chat. The toggle and text field are defined in SettingsPage.swift lines 954-999 with placeholder 'https://your-proxy:8766'. The value is stored as the customApiEndpoint UserDefault and exported as ANTHROPIC_BASE_URL on the bridge subprocess at ACPBridge.swift lines 468-469. Anything that speaks Anthropic's Messages API can sit at that URL: a llama.cpp server with an Anthropic shim, an MLX server hosting a 4-bit group quantized model, an Ollama-fronted GGUF, or an OpenRouter Anthropic mode endpoint. Fazm itself does not ship a quantization engine. It ships the hook that points the agent at one.

Is the GGUF k-quants system being replaced by FP4 formats?

Not yet, and the dominant production usage on consumer Macs is still GGUF k-quants (Q4_K_M, Q5_K_S, Q6_K). The k-quants format mixes integer block quantization with per-block scales and minor outlier handling and remains the default download target on Hugging Face for community fine-tunes. The 2026 movement is layered on top: AWQ scales applied before GGUF quantization to protect salient weights, and FP4 formats (NVFP4, MXFP4) rolling in for hardware that supports them. For most Mac users in mid 2026 the practical answer is a Q4_K_M or Q5_K_M GGUF on llama.cpp, with FP4 reserved for users on Apple A19 Pro or Blackwell discrete GPUs.

What about MLX, does Apple's framework get its own format?

MLX uses 4-bit group quantization as its native format, optimized for Apple's unified memory architecture. There is no MLX-specific format announcement in 2026 in the same sense as NVFP4 from Nvidia. The interesting MLX side is that MXFP4 with hardware support on A19 Pro now competes with MLX's own 4-bit group format, and the open question is whether Apple's stack will adopt MXFP4 as a first-class quantization target or keep MLX's group quantization as the default.

Related guides

Throughput

Local LLM desktop agent throughput

Why prompt-processing of screen state, not generation tok/s, is the bottleneck for a local agent loop.

Read

Local LLM

Local LLM runtime vs agent loop

What a runtime gives you and what an agent loop has to add on top before a desktop agent works.

Read

Setup

Custom API endpoint guide

How to point a desktop agent at a non-Anthropic endpoint, and what the bridge expects on the other side.

Read