Sensitivity-aware LoRA: one signal, two optimizations.
Sensitivity-aware LoRA matches constant rank-16 accuracy on a logical-puzzles reasoning task using 39% fewer trainable parameters by reusing the same per-layer bit signal mlx-optiq computes during quantization. Standard LoRA gives every adapted layer the same rank regardless of which layers matter most to the model's output. mlx-optiq already knows which layers matter: they got more bits during quantization. Reusing that signal at training time produces a measurable Pareto improvement.
The same layers we kept at 8-bit during quantization also get more adapter rank during fine-tuning. Lower validation loss at the same parameter budget.
The mechanic
When you quantize an mlx-optiq model, we record the per-layer bit assignments in a sidecar: optiq_metadata.json, shipped inside every quant on Hugging Face. Every layer's quantization bit-width is right there. optiq lora train reads it and uses it to scale adapter rank:
- 4-bit layers (the model's robust ones) → base rank.
- 8-bit layers (the model's sensitive ones) → 2× base rank.
At --rank 8 --rank-scaling by_bits, that's rank-8 for 4-bit layers and rank-16 for 8-bit layers. Same total parameter budget as constant rank-10, but capacity is moved toward the layers that demonstrably affect output more.
Why this should work, intuitively
LoRA's premise is that fine-tuning updates lie in a low-rank subspace, but it doesn't say anything about which layers need more capacity. In practice, the layers that need more capacity to fit a target task are not uniform. They tend to correlate with output-distribution sensitivity: a layer that, when perturbed, shifts the logits a lot is also a layer where small changes during fine-tuning produce large behavioral changes. Both phenomena are about the gain between a layer's weights and the model's output.
So the same signal that tells us "this layer is fragile under quantization" also tells us "this layer is responsive to fine-tuning." Allocating more rank where there's more signal-amplitude makes sense.
The empirical result
Head-to-head on a 6-category logical-puzzles reasoning dataset where the base model produces correct chain-of-thought but never closes with the dataset's \boxed{...} answer format. Qwen3.5-4B-OptIQ-4bit, 1 epoch over 200 training samples, evaluated on 100 disjoint test samples by extracting the answer from the boxed expression.
| Config | Trainable params | Val loss | Test accuracy |
|---|---|---|---|
| Base (no LoRA) | — | — | 0 / 100 = 0% |
| Constant rank-8 | 11.58 M | 0.328 | 27 / 100 = 27% |
by_bits (rank 8 / 16) | 13.49 M (+16%) | 0.319 | 35 / 100 = 35% |
| Constant rank-16 | 22.20 M (+92%) | 0.360 | 36 / 100 = 36% |
by_bits matches constant rank-16 on accuracy (35% vs 36%, within noise at n=100) using 39% fewer trainable parameters. Versus the rank-8 baseline at almost the same parameter budget, by_bits is +8 accuracy points (27 → 35) for +16% more params. Both directions land by_bits on the Pareto frontier.
The base model scores 0/100 because it never emits the dataset's \boxed{...} format. It produces valid step-by-step reasoning but doesn't recognise the answer convention. All three LoRA configs learn the format from one epoch over 200 examples; the differences in accuracy are about how cleanly each preserves the underlying reasoning while bolting on the format.
Per-category breakdown across the three configs:
| Category | n | rank-8 | rank-16 | by_bits |
|---|---|---|---|---|
| Numeral Conversion | 39 | 21 | 31 | 31 |
| Unit Conversion | 13 | 2 | 0 | 4 |
| Equation Transformation | 35 | 4 | 5 | 0 |
| Bit Manipulation | 8 | 0 | 0 | 0 |
| Text Encryption | 3 | 0 | 0 | 0 |
| Gravitational Constant | 2 | 0 | 0 | 0 |
by_bits wins on Numeral and Unit Conversion (the structured pattern-matching tasks) and ties the larger model on the overall total, but loses on Equation Transformation (0/35 vs 4 and 5 for the constant-rank configs). The categories where all three score 0 (Bit, Text, Gravitational) are dataset-coverage limits: only ~10–25 training samples each in 200 mixed-type total, not enough to learn from one epoch. The Equation gap on by_bits at the same data is a real category-specific trade-off worth flagging: concentrating extra rank on the high-bit (mostly attention) layers appears to help format-style tasks and disadvantage symbolic-manipulation ones. Real-world recipe: pair by_bits with the right dataset balance for your task.
The training-ceiling map
What follows is the empirical story we wish we'd had when we started fine-tuning mlx-optiq quants on a 36 GB Mac. All entries verified end-to-end against a real Hermes-traces dataset with the default config (q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits). Both iters stable, zero memory drift.
| Model | Max seq len | Peak mem | Tokens / sec | Time / iter |
|---|---|---|---|---|
| Qwen3.5-0.8B | 2,800 | 23.4 GB | 29.2 | 96 s |
| Qwen3.5-2B | 2,400 | 19.3 GB | 38.3 | 63 s |
| Qwen3.5-4B | 1,600 | 24.8 GB | 19.1 | 84 s |
| Qwen3.5-9B | 1,400 | 25.4 GB | 21.6 | 65 s |
| Qwen3.5-27B / Qwen3.6-27B | 512 | 27.7 GB | 11.4 | 45 s |
| gemma-4-26B-A4B | 512 | 27.6 GB | 22.2 | 32 s |
| Qwen3.5/3.6-35B-A3B | 128 | 25.3 GB | 32.2 | 17 s |
| gemma-4-31B-it | 32 | 21.4 GB | 30.9 | 11 s |
Two cliffs, not one
Pushing past these ceilings hits two distinct failure modes that show up at different points:
- Memory cliff (~27–28 GB peak). When peak crosses the system-default GPU-wired cap, macOS absorbs the overflow via compressed memory. It works, but throughput drops 9–30 % depending on the activation/static-weight ratio. 9 B is most sensitive (−30 % at 28.0 GB); 27 B is least (−9 % at 29.2 GB) since most of its footprint is static weights.
- MTLResource-count cliff (independent of bytes). Apple Silicon GPUs cap simultaneously-bound MTLResources at 499 K, and per-iter resource count grows with both
num_layersandseq_len. 2 B at T = 3,200 hits a hardkIOGPUCommandBufferCallbackErrorOutOfMemoryat iter 1 even though peak memory is only 22 GB. Don't extrapolate "more headroom in GB" → "can push T further."
The numbers in the table above are the conservative recipes: proven safe, proven reproducible. Pushing them further is possible if you know which cliff you're approaching.
Output is PEFT-compatible
Adapter output is standard adapter_config.json + adapters.safetensors, loadable by peft, mlx-lm, or any tool in the LoRA ecosystem. mlx-optiq adds one extra file: optiq_lora_config.json records the per-layer rank distribution so you can inspect what by_bits actually picked.
$ optiq lora info ./my_adapter # mlx-optiq LoRA adapter # base model: mlx-community/Qwen3.5-9B-OptiQ-4bit # total params: 2.9M trainable # rank distribution: # rank 8 (4-bit layers): 96 modules # rank 16 (8-bit layers): 16 modules
Hot-swap at serve time
Once you have an adapter, the mounted-LoRA primitive lets you keep N of them resident on a single base and switch per request via a ContextVar the server flips. ~50 MB per extra adapter on top of one base, vs ~5.6 GB per full model copy. Details in the serve docs.
Full reference for the trainer, all rank-scaling modes, and the data format is in the LoRA fine-tuning guide.
— the mlx-optiq team