Engineering · April 8, 2026

Sensitivity-aware LoRA — one signal, two optimizations.

Topic Fine-tuning Reading time 5 min Hardware Apple M3 Max 36 GB

Standard LoRA gives every adapted layer the same rank — the same number of trainable parameters per layer, regardless of how much that layer matters to the model's output. mlx-optiq already knows which layers matter: they got more bits during quantization. Reusing that signal at training time turns out to be a meaningful win.

The same layers we kept at 8-bit during quantization also get more adapter rank during fine-tuning. Lower validation loss at the same parameter budget.

The mechanic

When you quantize an mlx-optiq model, we record the per-layer bit assignments in a sidecar — optiq_metadata.json, shipped inside every quant on Hugging Face. Every layer's quantization bit-width is right there. optiq lora train reads it and uses it to scale adapter rank:

4-bit layers (the model's robust ones) → base rank.
8-bit layers (the model's sensitive ones) → 2× base rank.

At --rank 8 --rank-scaling by_bits, that's rank-8 for 4-bit layers and rank-16 for 8-bit layers — same total parameter budget as constant rank-10, but capacity is moved toward the layers that demonstrably affect output more.

Why this should work, intuitively

LoRA's premise is that fine-tuning updates lie in a low-rank subspace — but it doesn't say anything about which layers need more capacity. In practice, the layers that need more capacity to fit a target task are not uniform. They tend to correlate with output-distribution sensitivity: a layer that, when perturbed, shifts the logits a lot is also a layer where small changes during fine-tuning produce large behavioral changes. Both phenomena are about the gain between a layer's weights and the model's output.

So the same signal that tells us "this layer is fragile under quantization" also tells us "this layer is responsive to fine-tuning." Allocating more rank where there's more signal-amplitude makes sense.

The empirical result

On a small GSM8K-finetuning A/B (Qwen3.5-9B-OptiQ-4bit, identical hyperparameters, identical seeds, identical data), by_bits rank scaling gives:

Method	Trainable params	Val loss @ iter 50
Constant rank-10	~3.2 M	2.41
`by_bits` (rank 8 / 16)	~3.2 M	2.12 (−12 %)

−12 % validation loss at the same parameter budget. Not earth-shattering, but free — same training time, same memory, just a smarter allocation.

The training-ceiling map

What follows is the empirical story we wish we'd had when we started fine-tuning mlx-optiq quants on a 36 GB Mac. All entries verified end-to-end against a real Hermes-traces dataset with the default config — q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits. Both iters stable, zero memory drift.

Model	Max seq len	Peak mem	Tokens / sec	Time / iter
Qwen3.5-0.8B	2,800	23.4 GB	29.2	96 s
Qwen3.5-2B	2,400	19.3 GB	38.3	63 s
Qwen3.5-4B	1,600	24.8 GB	19.1	84 s
Qwen3.5-9B	1,400	25.4 GB	21.6	65 s
Qwen3.5-27B / Qwen3.6-27B	512	27.7 GB	11.4	45 s
gemma-4-26B-A4B	512	27.6 GB	22.2	32 s
Qwen3.5/3.6-35B-A3B	128	25.3 GB	32.2	17 s
gemma-4-31B-it	32	21.4 GB	30.9	11 s

Two cliffs, not one

Pushing past these ceilings hits two distinct failure modes that show up at different points:

Memory cliff (~27–28 GB peak). When peak crosses the system-default GPU-wired cap, macOS absorbs the overflow via compressed memory. It works, but throughput drops 9–30 % depending on the activation/static-weight ratio. 9 B is most sensitive (−30 % at 28.0 GB); 27 B is least (−9 % at 29.2 GB) since most of its footprint is static weights.
MTLResource-count cliff (independent of bytes). Apple Silicon GPUs cap simultaneously-bound MTLResources at 499 K, and per-iter resource count grows with both num_layers and seq_len. 2 B at T = 3,200 hits a hard kIOGPUCommandBufferCallbackErrorOutOfMemory at iter 1 even though peak memory is only 22 GB. Don't extrapolate "more headroom in GB" → "can push T further."

The numbers in the table above are the conservative recipes — proven safe, proven reproducible. Pushing them further is possible if you know which cliff you're approaching.

Output is PEFT-compatible

Adapter output is standard adapter_config.json + adapters.safetensors — loadable by peft, mlx-lm, or any tool in the LoRA ecosystem. mlx-optiq adds one extra file: optiq_lora_config.json records the per-layer rank distribution so you can inspect what by_bits actually picked.

terminalbash

$ optiq lora info ./my_adapter
# mlx-optiq LoRA adapter
# base model:  mlx-community/Qwen3.5-9B-OptiQ-4bit
# total params: 2.9M trainable
# rank distribution:
#   rank 8  (4-bit layers): 96 modules
#   rank 16 (8-bit layers): 16 modules

Hot-swap at serve time

Once you have an adapter, the mounted-LoRA primitive lets you keep N of them resident on a single base and switch per request via a ContextVar the server flips. ~50 MB per extra adapter on top of one base — vs ~5.6 GB per full model copy. Details in the serve docs.

Full reference for the trainer, all rank-scaling modes, and the data format is in the LoRA fine-tuning guide.

— the mlx-optiq team