LoRA fine-tuning
mlx-optiq ships a LoRA trainer that reads its own per-layer sensitivity assignments and gives sensitive layers proportionally more adapter capacity. Output is PEFT-compatible (adapter_config.json + adapters.safetensors) plus an mlx-optiq sidecar describing the per-layer rank distribution.
The same layers mlx-optiq kept at 8-bit during quantization also get more adapter rank during fine-tuning. One signal, two optimizations.
The basic recipe
# Train on JSONL data — same format as mlx-lm's lora trainer $ optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \ --data ./my_training_data \ --max-seq-length 1400 \ --rank 8 --rank-scaling by_bits \ --num-layers 16 --iters 1000 \ -o ./my_adapter # Show the per-layer rank distribution $ optiq lora info ./my_adapter
Data format
Standard JSONL — one example per line, with a text field (or messages for chat). Same format as mlx_lm.lora:
{"text": "Q: What is 2+2? A: 4"}
{"text": "Q: ... A: ..."}
{"messages": [{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}]}
Layout on disk:
my_training_data/
├── train.jsonl
└── valid.jsonl # optional, used for validation loss
Sensitivity-aware rank scaling
--rank-scaling by_bits gives each layer an adapter rank proportional to its quantization bit-width. With --rank 8:
- Layers mlx-optiq quantized at 4-bit get rank 8.
- Layers mlx-optiq quantized at 8-bit get rank 16.
The total parameter budget is roughly the same as constant rank, but capacity moves to where it matters. On a GSM8K subset, this consistently produces lower validation loss than constant rank at iter 50.
Other scaling modes:
# Constant rank (mlx-lm default behavior) $ optiq lora train ... --rank-scaling constant # Scale by raw KL sensitivity (more aggressive) $ optiq lora train ... --rank-scaling by_kl # Scale by sensitivity quantile $ optiq lora train ... --rank-scaling by_quantile
Training-ceiling map (36 GB Mac)
Empirical sequence-length and peak-memory ceilings at the system-default iogpu.wired_limit_mb=0 (no manual override needed; recommended for safety on long runs). Default config: q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits. Both iters stable, zero memory drift.
| Model | Max seq len | Peak mem | Tokens/sec | Time/iter |
|---|---|---|---|---|
| Qwen3.5-0.8B | 2,800 | 23.4 GB | 29.2 | 96 s |
| Qwen3.5-2B | 2,400 | 19.3 GB | 38.3 | 63 s |
| Qwen3.5-4B | 1,600 | 24.8 GB | 19.1 | 84 s |
| Qwen3.5-9B | 1,400 | 25.4 GB | 21.6 | 65 s |
| Qwen3.5-27B | 512 | 27.7 GB | 11.4 | 45 s |
| Qwen3.6-27B | 512 | 27.7 GB | 11.4 | 45 s |
| gemma-4-26B-A4B | 512 | 27.6 GB | 22.2 | 32 s |
| Qwen3.5-35B-A3B | 128 | 25.3 GB | 32.2 | 17 s |
| Qwen3.6-35B-A3B | 128 | 25.3 GB | 32.3 | 17 s |
| gemma-4-31B-it | 32 | 21.4 GB | 30.9 | 11 s |
- Memory cliff (~27–28 GB peak). When peak crosses the system-default GPU-wired cap, macOS absorbs overflow via compressed memory — works, but throughput drops 9–30%.
- MTLResource-count cliff (independent of bytes). Apple Silicon GPUs cap simultaneously-bound MTLResources at 499 K. Per-iter resource count grows with both
num_layersandseq_len. 2 B at T = 3,200 hits a hardkIOGPUCommandBufferCallbackErrorOutOfMemoryat iter 1 even though peak memory is only 22 GB. Don't extrapolate "more headroom in GB" → "can push T further."
Adapter output
After training:
my_adapter/ ├── adapter_config.json # PEFT-compatible config ├── adapters.safetensors # PEFT-compatible weights └── optiq_lora_config.json # mlx-optiq sidecar with per-layer ranks
Inspect the per-layer rank distribution:
$ optiq lora info ./my_adapter # mlx-optiq LoRA adapter # base model: mlx-community/Qwen3.5-9B-OptiQ-4bit # total params: 2.9M trainable # rank distribution: # rank 8 (4-bit layers): 96 modules # rank 16 (8-bit layers): 16 modules
Loading an adapter
from mlx_lm import load, generate model, tok = load( "mlx-community/Qwen3.5-9B-OptiQ-4bit", adapter_path="./my_adapter", ) print(generate(model, tok, prompt="...", max_tokens=200))
Hot-swap adapters at serve time
mlx-optiq's mounted-LoRA primitive lets you keep N adapters resident on one base, switching per-request. See the serving guide.