mlx-optiq
Engineering · April 8, 2026

Sensitivity-aware LoRA: one signal, two optimizations.

Sensitivity-aware LoRA matches constant rank-16 accuracy on a logical-puzzles reasoning task using 39% fewer trainable parameters by reusing the same per-layer bit signal mlx-optiq computes during quantization. Standard LoRA gives every adapted layer the same rank regardless of which layers matter most to the model's output. mlx-optiq already knows which layers matter: they got more bits during quantization. Reusing that signal at training time produces a measurable Pareto improvement.

The same layers we kept at 8-bit during quantization also get more adapter rank during fine-tuning. Lower validation loss at the same parameter budget.

The mechanic

When you quantize an mlx-optiq model, we record the per-layer bit assignments in a sidecar: optiq_metadata.json, shipped inside every quant on Hugging Face. Every layer's quantization bit-width is right there. optiq lora train reads it and uses it to scale adapter rank:

  • 4-bit layers (the model's robust ones) → base rank.
  • 8-bit layers (the model's sensitive ones) → 2× base rank.

At --rank 8 --rank-scaling by_bits, that's rank-8 for 4-bit layers and rank-16 for 8-bit layers. Same total parameter budget as constant rank-10, but capacity is moved toward the layers that demonstrably affect output more.

Why this should work, intuitively

LoRA's premise is that fine-tuning updates lie in a low-rank subspace, but it doesn't say anything about which layers need more capacity. In practice, the layers that need more capacity to fit a target task are not uniform. They tend to correlate with output-distribution sensitivity: a layer that, when perturbed, shifts the logits a lot is also a layer where small changes during fine-tuning produce large behavioral changes. Both phenomena are about the gain between a layer's weights and the model's output.

So the same signal that tells us "this layer is fragile under quantization" also tells us "this layer is responsive to fine-tuning." Allocating more rank where there's more signal-amplitude makes sense.

The empirical result

Head-to-head on a 6-category logical-puzzles reasoning dataset where the base model produces correct chain-of-thought but never closes with the dataset's \boxed{...} answer format. Qwen3.5-4B-OptIQ-4bit, 1 epoch over 200 training samples, evaluated on 100 disjoint test samples by extracting the answer from the boxed expression.

ConfigTrainable paramsVal lossTest accuracy
Base (no LoRA)0 / 100 = 0%
Constant rank-811.58 M0.32827 / 100 = 27%
by_bits (rank 8 / 16)13.49 M (+16%)0.31935 / 100 = 35%
Constant rank-1622.20 M (+92%)0.36036 / 100 = 36%

by_bits matches constant rank-16 on accuracy (35% vs 36%, within noise at n=100) using 39% fewer trainable parameters. Versus the rank-8 baseline at almost the same parameter budget, by_bits is +8 accuracy points (27 → 35) for +16% more params. Both directions land by_bits on the Pareto frontier.

The base model scores 0/100 because it never emits the dataset's \boxed{...} format. It produces valid step-by-step reasoning but doesn't recognise the answer convention. All three LoRA configs learn the format from one epoch over 200 examples; the differences in accuracy are about how cleanly each preserves the underlying reasoning while bolting on the format.

Per-category breakdown across the three configs:

Categorynrank-8rank-16by_bits
Numeral Conversion39213131
Unit Conversion13204
Equation Transformation35450
Bit Manipulation8000
Text Encryption3000
Gravitational Constant2000

by_bits wins on Numeral and Unit Conversion (the structured pattern-matching tasks) and ties the larger model on the overall total, but loses on Equation Transformation (0/35 vs 4 and 5 for the constant-rank configs). The categories where all three score 0 (Bit, Text, Gravitational) are dataset-coverage limits: only ~10–25 training samples each in 200 mixed-type total, not enough to learn from one epoch. The Equation gap on by_bits at the same data is a real category-specific trade-off worth flagging: concentrating extra rank on the high-bit (mostly attention) layers appears to help format-style tasks and disadvantage symbolic-manipulation ones. Real-world recipe: pair by_bits with the right dataset balance for your task.

The training-ceiling map

What follows is the empirical story we wish we'd had when we started fine-tuning mlx-optiq quants on a 36 GB Mac. All entries verified end-to-end against a real Hermes-traces dataset with the default config (q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits). Both iters stable, zero memory drift.

ModelMax seq lenPeak memTokens / secTime / iter
Qwen3.5-0.8B2,80023.4 GB29.296 s
Qwen3.5-2B2,40019.3 GB38.363 s
Qwen3.5-4B1,60024.8 GB19.184 s
Qwen3.5-9B1,40025.4 GB21.665 s
Qwen3.5-27B / Qwen3.6-27B51227.7 GB11.445 s
gemma-4-26B-A4B51227.6 GB22.232 s
Qwen3.5/3.6-35B-A3B12825.3 GB32.217 s
gemma-4-31B-it3221.4 GB30.911 s

Two cliffs, not one

Pushing past these ceilings hits two distinct failure modes that show up at different points:

  • Memory cliff (~27–28 GB peak). When peak crosses the system-default GPU-wired cap, macOS absorbs the overflow via compressed memory. It works, but throughput drops 9–30 % depending on the activation/static-weight ratio. 9 B is most sensitive (−30 % at 28.0 GB); 27 B is least (−9 % at 29.2 GB) since most of its footprint is static weights.
  • MTLResource-count cliff (independent of bytes). Apple Silicon GPUs cap simultaneously-bound MTLResources at 499 K, and per-iter resource count grows with both num_layers and seq_len. 2 B at T = 3,200 hits a hard kIOGPUCommandBufferCallbackErrorOutOfMemory at iter 1 even though peak memory is only 22 GB. Don't extrapolate "more headroom in GB" → "can push T further."

The numbers in the table above are the conservative recipes: proven safe, proven reproducible. Pushing them further is possible if you know which cliff you're approaching.

Output is PEFT-compatible

Adapter output is standard adapter_config.json + adapters.safetensors, loadable by peft, mlx-lm, or any tool in the LoRA ecosystem. mlx-optiq adds one extra file: optiq_lora_config.json records the per-layer rank distribution so you can inspect what by_bits actually picked.

terminalbash
$ optiq lora info ./my_adapter
# mlx-optiq LoRA adapter
# base model:  mlx-community/Qwen3.5-9B-OptiQ-4bit
# total params: 2.9M trainable
# rank distribution:
#   rank 8  (4-bit layers): 96 modules
#   rank 16 (8-bit layers): 16 modules

Hot-swap at serve time

Once you have an adapter, the mounted-LoRA primitive lets you keep N of them resident on a single base and switch per request via a ContextVar the server flips. ~50 MB per extra adapter on top of one base, vs ~5.6 GB per full model copy. Details in the serve docs.

Full reference for the trainer, all rank-scaling modes, and the data format is in the LoRA fine-tuning guide.

— the mlx-optiq team