Research · serving

When 4-bit KV cache uses more memory than fp16

Topic Research Related Not all layers are equal

Stock mlx-lm's --kv-bits 4 on a 24 GB Mac at long context actually peaks higher than fp16 KV. Users who turn it on hoping to save memory get an OOM. This post traces where the spike comes from, how OptiQ avoids it, and why the resulting path runs at fp16 speed parity with better long-reasoning accuracy.

The diagnosis

Run granite-4.1-8b-4bit on a 24 GB M-class Mac at 32k context. Compare peak GPU memory:

fp16 KV cache:                  11.51 GB
stock mlx-lm u4 KV cache:       16.35 GB   ← worse than fp16
stock mlx-lm u4 at 48k:         OOMs the process

Two compounding spikes inside the mlx-lm path explain this.

Conversion spike. maybe_quantize_kv_cache in mlx-lm 0.31.x enqueues every layer's c.to_quantized(...) as a lazy MLX op, then evals them all in a single batch. At the batch eval point MLX holds (fp16_all_layers + quantized_all_layers) co-resident in memory. On a 9 B model with 40 attention layers at 32k context, this transient is about 5 GB on its own.

Prefill scores-matrix spike. mlx-lm's unfused quantized_scaled_dot_product_attention computes one big Q @ K^T scores matrix per call. With GQA expansion the shape is (B, n_kv_heads, n_repeats, L_q, N). At prefill chunk-size 2048 and N=16k on Granite-4.1-8b that is about 4.3 GB of fp16, plus the softmax output adds another 4.3 GB co-resident at the softmax step. Roughly 8 GB of transient just to compute attention scores once.

Add these together and the "4-bit cache saves memory" story collapses on tight RAM. The persistent cache really is 4× smaller; the transients eat the savings and then some.

The fix: two pieces, both default-on

OptiQ patches both spikes whenever you pass --kv-bits or --kv-config to optiq serve. No flags to set, no decisions for the user.

Streaming per-layer converter (optiq.runtime.streaming_kv_quant). Replaces the batched maybe_quantize_kv_cache with a per-layer loop: quantize one layer's K, mx.eval, drop the fp16 reference, mx.clear_cache, same for V, repeat. The transient drops from "all layers worth of fp16 K and V" to roughly one layer's worth (~150 MB on a 9B at 32k).

FlashAttention-2 N-tiled SDPA (optiq.runtime.fused_quant_sdpa). Replaces the unfused quantized_scaled_dot_product_attention with the standard online-softmax algorithm: tile the K/V N axis into chunks of 512, compute partial scores via mx.quantized_matmul (Apple's tuned Metal kernel), apply an online softmax update, accumulate into the output. The full (B, H_q, L_q, N) scores matrix is never materialized; the largest transient is a single tile of (B, H_q, L_q, 512), about 270 MB at our scale.

Same FlashAttention-2 algorithm that fused fp16 SDPA uses; the inner matmul is Apple's existing kernel. The contribution is the orchestration that makes long-context KV-quant fit at all.

Memory result

granite-4.1-8b-4bit on M4 24 GB, NIAH retrieval at depth 0.5:

Context	fp16 peak	Stock mlx-lm u4	OptiQ u4
16k	8.59 GB	spike-prone	6.71 GB
32k	11.51 GB	16.35 GB	7.60 GB
48k	16+ GB	OOM	8.5 GB

At 32k the OptiQ path is 34% below fp16 and 53% below stock mlx-lm's broken u4. The same patches unlock contexts on Qwen3.5-9B where fp16 KV is tight (11.5 GB at 64k, 18.9 GB at 96k) but OptiQ u4 stays under 10 GB across the whole range.

Speed result

On Qwen3.5-9B (matched 25 trials, ctx=10k chars), the OptiQ path runs at fp16 KV cache speed within ±2%:

Qwen3.5-9B, ctx=10k chars	gen_tps	vs fp16
fp16 KV	15.47	baseline
OptiQ u4 KV	15.16	−2.0%
OptiQ mixed-precision KV	15.29	−1.2%

The speed lift comes from upstream mx.quantized_matmul; our FlashAttention orchestration is what carries it through to long-context inference without an OOM.

Hash-hop is the right benchmark for KV precision

NIAH (needle-in-a-haystack) is too easy a test for KV precision: models tolerate noise because the answer is a single phrase the model can recover approximately. Hash-hop forces exact 16-character hash retrieval through a multi-hop chain. Any KV noise that flips even one character of the final hash is a total failure. That makes it sensitive to KV quality in a way NIAH is not.

Methodology

All KV-mode comparisons below run on the same weight-quantized model: mlx-community/Qwen3.5-9B-OptiQ-4bit (OptiQ-quantized weights, ~4-bit average, with the most sensitive layers held at higher precision). The labels fp16, u4, and mixed in the tables refer to KV-cache precision only, not weight precision. The weights are constant; we're isolating the effect of KV-cache quantization.

fp16: OptiQ-quantized weights + fp16 KV cache (the upper bound on quality at this weight quant)
u4: OptiQ-quantized weights + uniform 4-bit KV cache (mlx-lm's --kv-bits 4 path)
mixed: OptiQ-quantized weights + OptiQ mixed-precision KV cache (kv_config.json with the two most sensitive full-attention layers held at 8 bits, rest at 4 bits)

For the "stock vs OptiQ stack" comparison further down, we also bench uniform 4-bit weights (mlx-community/Qwen3.5-9B-4bit) paired with uniform 4-bit KV (the typical "just turn 4-bit on for everything" user setup) against the OptiQ-weights + OptiQ-mixed-KV stack.

Difficulty curve on Qwen3.5-9B at ctx=10k chars (25 trials per cell, matched conditions):

Hops	fp16 KV	u4 uniform	mixed-precision (OptiQ)
1 (easy)	25/25 (100%)	25/25 (100%)	25/25 (100%)
2	22/25 (88%)	25/25 (100%)	25/25 (100%)
3 (the differentiator)	9/25 (36%)	6/25 (24%)	8/25 (32%)

Read across the difficulty curve:

hops=1 and 2 are saturated. The model handles single-hop and 2-hop chains perfectly regardless of KV precision. Some sampling-path variation (all three modes at or above 88%) but no meaningful differentiation.
hops=3 is the differentiation window. The model is partly capable but stressed by reasoning and KV precision together. Uniform 4-bit drops to 67% of fp16's accuracy. OptiQ mixed-precision retains 89% of fp16, 8 percentage points above u4 in absolute terms, 33% better in relative terms (32% vs 24%).

The mixed-precision win comes from keeping the two most KV-sensitive full-attention layers (layer 3 and the last attention layer) at 8 bits while letting the rest run at 4 bits. Total KV memory is essentially the same as uniform u4 (avg ~5 bits), but the layers that matter most stay precise. That's the OptiQ-specific value: a single --kv-config file that closes most of the accuracy gap.

Stacked: stock 4-bit everything vs the OptiQ stack

The fairer end-to-end question: what does a user actually pick on a 24 GB Mac when they want long context with quantized memory? Two setups, both real shippable choices:

Stock 4-bit-everything: mlx-community/Qwen3.5-9B-4bit (uniform 4-bit weights) + --kv-bits 4 uniform 4-bit KV cache. The minimal-memory option.
OptiQ stack: mlx-community/Qwen3.5-9B-OptiQ-4bit (OptiQ mixed-precision weights, with the most sensitive linear layers held at higher precision) + --kv-config OptiQ mixed-precision KV (avg ~5 bits, layers 3 and 31 at 8 bits, rest at 4 bits). Larger memory, but spends those bits where the model can use them.

Same hash-hop sweep, 25 trials per cell:

Setup	Active GB at load	gen_tps	hops=1	hops=2	hops=3
Stock 4-bit-everything	5.04	18.4	100%	84%	20%
OptiQ stack (weights + KV)	7.10	13.9	100%	100%	32%

At the differentiator (hops=3), the OptiQ stack scores 60% better than stock 4-bit-everything (32% vs 20%, or +12 percentage points absolute). The trade-off is +2 GB of weights memory and ~25% slower decode (14 tps vs 18 tps). Whether that's worth it depends on the workload: short prompts in a chat UI probably don't notice the accuracy difference at hops=3, but anything involving multi-hop retrieval, code symbol following, or long structured reasoning benefits substantially. For the kind of long-context serving where OptiQ is targeted, the OptiQ stack is the right pick.

An intermediate choice some users will want: OptiQ weights + plain fp16 KV. That uses OptiQ's higher-quality weight quant without the extra KV-cache infrastructure. At hops=3 it lands at 36% (the highest of any 4-bit-weights setup we tested), but the persistent KV cache is 4× larger than the OptiQ stack and OOMs sooner at long context. The OptiQ stack is the right pick when long context is the goal; OptiQ-weights + fp16-KV is the right pick when short-prompt accuracy is the goal and you have headroom.

Where on the model size axis this matters

Hash-hop at ctx=10k chars across the Qwen3.5 sizes (25 trials per cell, hops=2 and 3):

Size	fp16 tps	u4 tps	mixed tps	Speed gap	hops=2 fp16/u4/mixed	hops=3 fp16/u4/mixed
0.8B	121	107	108	−12%	0% / 0% / 4%	0% / 0% / 8%
2B	58.7	55.1	54.8	−7%	0% / 0% / 4%	12% / 8% / 0%
4B	27.5	26.2	26.2	−4.7%	44% / 48% / 44%	40% / 40% / 40%
9B	14.0	14.0	13.9	−0.6%	88% / 100% / 100%	36% / 24% / 32%

Two observations.

First, the speed gap shrinks as the model grows: 12% at 0.8B, 7% at 2B, 4.7% at 4B, and 0.6% at 9B. Below 4B the per-token quantize overhead is a bigger fraction of total cost. At 9B the cost is amortized, so you essentially get the memory savings free.

Second, hash-hop has a minimum model size for the task itself. At 0.8B and 2B no mode answers reliably (the model is the bottleneck, not the KV cache). At 4B the model is at its capability cliff (~44% on hops=2, ~40% on hops=3 in all modes, no KV-mode signal above noise). Mixed-precision differentiation only appears once the model is reliably capable, which on Qwen3.5 is 9B+.

How to use it

# Default. Both patches install automatically with --kv-bits.
$ optiq serve --kv-bits 4 --model mlx-community/Qwen3.5-9B-OptiQ-4bit

# Mixed-precision KV (the differentiator). Generate a kv_config first.
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit --target-bits 5.0
$ optiq serve --kv-config optiq_output/kv_cache/kv_config.json \
              --model mlx-community/Qwen3.5-9B-OptiQ-4bit

The serve docs cover request routing, adapter mounting, and the OpenAI/Anthropic/Responses endpoints that share this same KV-cache pipeline.