mlx-optiq
Research · foundation

Not All Layers Are Equal:
Mixed-precision quantization for weights and KV cache on Apple Silicon.

When you quantize an LLM to run locally on a Mac, the standard approach is uniform 4-bit — every layer gets the same treatment. We wanted to know: what happens if you measure which layers actually need precision and which don't?

The answer surprised us. Some layers are 56× more sensitive to quantization than others. And the KV cache — which nobody talks about — becomes the dominant memory cost at long contexts, where uniform quantization completely breaks down.

Sensitivity varies wildly and predictably. Measurement beats heuristics.

Here's what we found building a sensitivity-driven quantization pipeline for MLX. Everything that became mlx-optiq traces back to this work.

The setup

We worked with Qwen3-0.6B-base on an M3 Max, using Apple's MLX framework. The approach is straightforward: for each layer, temporarily quantize it, run calibration data (WikiText-2) through the full model, and measure the KL divergence of the output logits against a reference. This gives a per-layer sensitivity score at each candidate bit-width.

A greedy knapsack algorithm then assigns bits — more to sensitive layers, fewer to robust ones — to hit a target average bits-per-weight.

We tested this on three model types: LLMs (Qwen3-0.6B), vision-language models (Qwen3-VL-2B), and speech recognition (Qwen3-ASR-0.6B).

Result 1 — weight quantization, +17 pp on math reasoning

With per-layer 4 / 8-bit allocation at 4.5 average bits-per-weight, the mixed-precision model scores 51 % on GSM8K (100-question math reasoning) compared to 34 % for uniform 4-bit — a 17-percentage-point improvement at only 11 % more model size.

ModelSizeBPWPerplexityGSM8K
Uniform 4-bit320 MB4.019.234 %
Mixed 4 / 8-bit355 MB4.517.451 %
Mixed 3 / 4-bit266 MB3.532.79 %

The sensitivity analysis reveals a clear structure: lm_head has 8× the sensitivity of the median layer, the last transformer block layers go up to 6.8×, and early layers with downstream amplification reach 10.4×. Meanwhile, query / key projections sit at 1.0× — safely quantized to minimum bits.

The mixed-precision allocation dominates the entire Pareto frontier between uniform 3-bit and uniform 8-bit. At every model size, it matches or beats the uniform baseline.

Pareto Frontier — mixed-precision dominates uniform at every size point
Mixed-precision dominates uniform-bit at every model size on the quality-vs-size Pareto frontier.

Result 2 — KV cache, the hidden memory problem

This is where things get interesting. The KV cache stores key / value projections for all past tokens during generation. For Qwen3-0.6B at 4 K context, the FP16 KV cache is 448 MB — already 140 % of the model weights (320 MB). At 16 K context it's 1.8 GB, or 5.6× the weights.

MLX supports uniform KV cache quantization, but when we measured per-layer sensitivity, we found Layer 0's KV cache is 56× more sensitive than average. Uniform 4-bit KV doesn't know this.

The result is catastrophic:

Uniform 4-bit KV breaks quality entirely — mixed-precision fixes it
Uniform 4-bit KV sends perplexity from 21 → 507. Mixed-precision keeps the few sensitive layers at 8-bit and recovers most of the savings without breaking quality.
KV ConfigPerplexityKV memory savings
FP16 (reference)21.2
Uniform 8-bit21.644 %
Mixed 5-bit31.162 %
Uniform 4-bit507.569 %

Uniform 4-bit KV quantization sends perplexity from 21 to 507. The fix: keep 7 sensitive layers (out of 28) at 8-bit. This costs only 7 % more KV memory than uniform 4-bit but gives 16× better quality.

The per-layer sensitivity chart shows why — most layers are robust, but Layer 0 is an extreme outlier that must stay at higher precision.

Per-layer KV sensitivity — Layer 0 dominates
Per-layer KV sensitivity. Layer 0 is an extreme outlier; everything else is roughly equal.

Result 3 — the full stack, 57 % memory savings at 16 K context

Combining mixed-precision weights and mixed-precision KV cache, the savings compound at long contexts:

Context lengthDefault (U4 + FP16 KV)Full mixedSavings
4 K tokens768 MB495 MB35 %
16 K tokens2,112 MB915 MB57 %
32 K tokens3,904 MB1,475 MB62 %

At 16 K context, mixed-precision saves 1.2 GB. On a base MacBook with 8 GB unified memory, this is the difference between fitting and not fitting.

Memory scaling — the gap widens with context length
Memory footprint as a function of context length. The mixed-precision gap widens with context — the longer the prompt, the more KV-cache compression matters.

An interesting finding: KV cache sensitivity depends on weight quantization. Different weight bit-widths produce different KV sensitivity profiles — 10 of 28 layers get different KV allocations depending on the weight quantization. You can't just measure KV sensitivity on the FP16 model and call it done.

Combined weight + KV analysis across all configurations
Joint weight + KV sensitivity across configurations. The two are coupled — measuring them in isolation misses the interaction.

Result 4 — works across modalities

The same approach works for vision-language and speech models:

  • Qwen3-VL-2B — Mixed-precision recovers 32 % of the quantization accuracy gap on AI2D diagram understanding (41 % vs 35 % for uniform 4-bit) while being 25 % smaller. 85 of 104 vision encoder layers need 8-bit, while 195 of 197 language model layers are safe at 4-bit.
  • Qwen3-ASR-0.6B — Mixed-precision reduces output divergence by 51.5 % vs uniform 4-bit on speech recognition, with WER improvement on LibriSpeech.

What we learned

Sensitivity varies wildly and predictably. The most sensitive layers are always at the boundaries — first / last transformer blocks, lm_head, and embedding layers. Middle MLP layers are consistently robust. This structure holds across LLMs, VLMs, and speech models.

KV cache is the real bottleneck. At long contexts, it dominates total memory. Quantizing weights while leaving the KV cache in FP16 misses the biggest cost. But uniform KV quantization is a trap — it looks like it saves memory (69 % reduction!) until you check quality and find it's completely broken.

The interaction matters. Weight quantization and KV cache quantization aren't independent. The same model with different weight precision has different KV sensitivity profiles. Joint optimization over both gives better results than optimizing each in isolation.

Measurement beats heuristics. Static recipes (like "keep first and last layer at high bits") get the direction right but miss the magnitude. Layer 0's KV cache is 56× more sensitive, not 2×. Data-driven allocation captures this.

From research to production Everything in this post is now in mlx-optiq — the per-layer sensitivity pass, the greedy knapsack, the joint weight + KV optimization, all extended to the Qwen3.5 / 3.6 and Gemma-4 families and packaged for one-line install via pip install mlx-optiq. Browse the 10 pre-built quants or read the methodology docs for the productionized story.