Sensitivity-aware LoRA — one signal, two optimizations.
Standard LoRA gives every adapted layer the same rank — the same number of trainable parameters per layer, regardless of how much that layer matters to the model's output. mlx-optiq already knows which layers matter: they got more bits during quantization. Reusing that signal at training time turns out to be a meaningful win.
The same layers we kept at 8-bit during quantization also get more adapter rank during fine-tuning. Lower validation loss at the same parameter budget.
The mechanic
When you quantize an mlx-optiq model, we record the per-layer bit assignments in a sidecar — optiq_metadata.json, shipped inside every quant on Hugging Face. Every layer's quantization bit-width is right there. optiq lora train reads it and uses it to scale adapter rank:
- 4-bit layers (the model's robust ones) → base rank.
- 8-bit layers (the model's sensitive ones) → 2× base rank.
At --rank 8 --rank-scaling by_bits, that's rank-8 for 4-bit layers and rank-16 for 8-bit layers — same total parameter budget as constant rank-10, but capacity is moved toward the layers that demonstrably affect output more.
Why this should work, intuitively
LoRA's premise is that fine-tuning updates lie in a low-rank subspace — but it doesn't say anything about which layers need more capacity. In practice, the layers that need more capacity to fit a target task are not uniform. They tend to correlate with output-distribution sensitivity: a layer that, when perturbed, shifts the logits a lot is also a layer where small changes during fine-tuning produce large behavioral changes. Both phenomena are about the gain between a layer's weights and the model's output.
So the same signal that tells us "this layer is fragile under quantization" also tells us "this layer is responsive to fine-tuning." Allocating more rank where there's more signal-amplitude makes sense.
The empirical result
On a small GSM8K-finetuning A/B (Qwen3.5-9B-OptiQ-4bit, identical hyperparameters, identical seeds, identical data), by_bits rank scaling gives:
| Method | Trainable params | Val loss @ iter 50 |
|---|---|---|
| Constant rank-10 | ~3.2 M | 2.41 |
by_bits (rank 8 / 16) | ~3.2 M | 2.12 (−12 %) |
−12 % validation loss at the same parameter budget. Not earth-shattering, but free — same training time, same memory, just a smarter allocation.
The training-ceiling map
What follows is the empirical story we wish we'd had when we started fine-tuning mlx-optiq quants on a 36 GB Mac. All entries verified end-to-end against a real Hermes-traces dataset with the default config — q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits. Both iters stable, zero memory drift.
| Model | Max seq len | Peak mem | Tokens / sec | Time / iter |
|---|---|---|---|---|
| Qwen3.5-0.8B | 2,800 | 23.4 GB | 29.2 | 96 s |
| Qwen3.5-2B | 2,400 | 19.3 GB | 38.3 | 63 s |
| Qwen3.5-4B | 1,600 | 24.8 GB | 19.1 | 84 s |
| Qwen3.5-9B | 1,400 | 25.4 GB | 21.6 | 65 s |
| Qwen3.5-27B / Qwen3.6-27B | 512 | 27.7 GB | 11.4 | 45 s |
| gemma-4-26B-A4B | 512 | 27.6 GB | 22.2 | 32 s |
| Qwen3.5/3.6-35B-A3B | 128 | 25.3 GB | 32.2 | 17 s |
| gemma-4-31B-it | 32 | 21.4 GB | 30.9 | 11 s |
Two cliffs, not one
Pushing past these ceilings hits two distinct failure modes that show up at different points:
- Memory cliff (~27–28 GB peak). When peak crosses the system-default GPU-wired cap, macOS absorbs the overflow via compressed memory. It works, but throughput drops 9–30 % depending on the activation/static-weight ratio. 9 B is most sensitive (−30 % at 28.0 GB); 27 B is least (−9 % at 29.2 GB) since most of its footprint is static weights.
- MTLResource-count cliff (independent of bytes). Apple Silicon GPUs cap simultaneously-bound MTLResources at 499 K, and per-iter resource count grows with both
num_layersandseq_len. 2 B at T = 3,200 hits a hardkIOGPUCommandBufferCallbackErrorOutOfMemoryat iter 1 even though peak memory is only 22 GB. Don't extrapolate "more headroom in GB" → "can push T further."
The numbers in the table above are the conservative recipes — proven safe, proven reproducible. Pushing them further is possible if you know which cliff you're approaching.
Output is PEFT-compatible
Adapter output is standard adapter_config.json + adapters.safetensors — loadable by peft, mlx-lm, or any tool in the LoRA ecosystem. mlx-optiq adds one extra file: optiq_lora_config.json records the per-layer rank distribution so you can inspect what by_bits actually picked.
$ optiq lora info ./my_adapter # mlx-optiq LoRA adapter # base model: mlx-community/Qwen3.5-9B-OptiQ-4bit # total params: 2.9M trainable # rank distribution: # rank 8 (4-bit layers): 96 modules # rank 16 (8-bit layers): 16 modules
Hot-swap at serve time
Once you have an adapter, the mounted-LoRA primitive lets you keep N of them resident on a single base and switch per request via a ContextVar the server flips. ~50 MB per extra adapter on top of one base — vs ~5.6 GB per full model copy. Details in the serve docs.
Full reference for the trainer, all rank-scaling modes, and the data format is in the LoRA fine-tuning guide.
— the mlx-optiq team