mlx-optiq
Workflow · LoRA

LoRA fine-tuning

mlx-optiq ships a LoRA trainer that reads its own per-layer sensitivity assignments and gives sensitive layers proportionally more adapter capacity. Output is PEFT-compatible (adapter_config.json + adapters.safetensors) plus an mlx-optiq sidecar describing the per-layer rank distribution.

The same layers mlx-optiq kept at 8-bit during quantization also get more adapter rank during fine-tuning. One signal, two optimizations.

The basic recipe

train.shbash
# Defaults: all transformer blocks adapted, all 7 trainable linears per
# block (Unsloth-aligned), rank 8 with by_bits sensitivity overlay,
# alpha = rank, mask_prompt enabled, max_seq=512, lr=2e-4.
$ optiq lora train mlx-community/Qwen3.5-4B-OptiQ-4bit \
    --data ./my_training_data \
    --iters 200 \
    -o ./my_adapter

# Show the per-layer rank distribution
$ optiq lora info ./my_adapter

Preset bundles for quick rank selection: --preset small (r=8, α=16), default (r=8, α=8), medium (r=16, α=16), large (r=32, α=32), xl (r=64, α=64), xxl (r=128, α=128). Presets set the BASE rank; with --rank-scaling by_bits (default), per-layer rank still scales up on layers OptIQ kept at higher bits.

Data format

JSONL with either messages (chat format) or prompt/completion pairs. Use one of these formats, not bare text. The text format can't expose a prompt/response boundary, so prompt masking falls through to full-sequence loss and degrades quality on tasks where the base model is already competent.

data.jsonljson
{"messages": [{"role": "user", "content": "..."},
              {"role": "assistant", "content": "..."}]}
{"prompt": "...", "completion": "..."}

Chat template is applied automatically; do not pre-template the data. mask_prompt is on by default, so loss is computed only on the assistant's response tokens.

Layout on disk:

directory layoutbash
my_training_data/
├── train.jsonl
└── valid.jsonl   # optional, used for validation loss

Sensitivity-aware rank scaling

--rank-scaling by_bits (default) gives each layer an adapter rank proportional to its quantization bit-width. With --rank 8:

  • Layers mlx-optiq quantized at 4-bit get rank 8.
  • Layers mlx-optiq quantized at 8-bit get rank 16.

Head-to-head on a 6-category logical-puzzles reasoning dataset (Qwen3.5-4B-OptIQ-4bit, 1 epoch over 200 training samples, 100-sample test split):

ConfigTrainable paramsTest accuracy
Constant rank-811.58 M27 %
by_bits (rank 8 / 16)13.49 M (+16%)35 %
Constant rank-1622.20 M (+92%)36 %

by_bits matches constant rank-16 on accuracy (35% vs 36%, within noise at n=100) using 39% fewer trainable parameters. Versus constant rank-8 at almost matched param budget, by_bits is +8 absolute accuracy points. Full per-category breakdown in the sensitivity-aware LoRA blog post.

Other scaling modes:

scaling.shbash
# Constant rank (matches Unsloth / PEFT default behaviour)
$ optiq lora train ... --rank-scaling constant

# Scale by raw KL sensitivity (more aggressive than by_bits)
$ optiq lora train ... --rank-scaling by_kl

Training-ceiling map (36 GB Mac)

Empirical sequence-length and peak-memory ceilings at the system-default iogpu.wired_limit_mb=0. Measured under the conservative recipe: num_layers=16 (only the last 16 transformer blocks adapted), target_modules=q_proj,v_proj, rank=8. The current default (num_layers=-1, all 7 target modules) adapts ~3× more LoRA modules so pushes seq-length proportionally lower for the same model on the same machine; drop num_layers or max_seq_length to land within these ceilings if you need the full capacity headroom for very long contexts.

Conservative recipes (peak ≤ 27.7 GB, no overflow penalty)
ModelMax seq lenPeak memTokens/secTime/iter
Qwen3.5-0.8B2,80023.4 GB29.296 s
Qwen3.5-2B2,40019.3 GB38.363 s
Qwen3.5-4B1,60024.8 GB19.184 s
Qwen3.5-9B1,40025.4 GB21.665 s
Qwen3.5-27B51227.7 GB11.445 s
Qwen3.6-27B51227.7 GB11.445 s
gemma-4-26B-A4B51227.6 GB22.232 s
Qwen3.5-35B-A3B12825.3 GB32.217 s
Qwen3.6-35B-A3B12825.3 GB32.317 s
gemma-4-31B-it3221.4 GB30.911 s
Two cliffs, not one Pushing past these ceilings hits two distinct failure modes at different points:
  • Memory cliff (~27–28 GB peak). When peak crosses the system-default GPU-wired cap, macOS absorbs overflow via compressed memory. It works, but throughput drops 9–30%.
  • MTLResource-count cliff (independent of bytes). Apple Silicon GPUs cap simultaneously-bound MTLResources at 499 K. Per-iter resource count grows with both num_layers and seq_len. 2 B at T = 3,200 hits a hard kIOGPUCommandBufferCallbackErrorOutOfMemory at iter 1 even though peak memory is only 22 GB. Don't extrapolate "more headroom in GB" → "can push T further."

Adapter output

After training:

directory layoutbash
my_adapter/
├── adapter_config.json       # PEFT-compatible config
├── adapters.safetensors      # PEFT-compatible weights
└── optiq_lora_config.json    # mlx-optiq sidecar with per-layer ranks

Inspect the per-layer rank distribution:

terminalbash
$ optiq lora info ./my_adapter
# OptIQ LoRA adapter
#   base: mlx-community/Qwen3.5-4B-OptiQ-4bit
#   rank: 8 (scaling: by_bits)
#   scale: 1.0  dropout: 0.0
#   rank distribution: {8: 101, 16: 27} across 128 adapted modules

Loading an adapter

load_adapter.pypython
from mlx_lm import load, generate

model, tok = load(
    "mlx-community/Qwen3.5-9B-OptiQ-4bit",
    adapter_path="./my_adapter",
)
print(generate(model, tok, prompt="...", max_tokens=200))

Hot-swap adapters at serve time

mlx-optiq's mounted-LoRA primitive lets you keep N adapters resident on one base, switching per-request. See the serving guide.