Workflow · LoRA

LoRA fine-tuning

mlx-optiq ships a LoRA trainer that reads its own per-layer sensitivity assignments and gives sensitive layers proportionally more adapter capacity. Output is PEFT-compatible (adapter_config.json + adapters.safetensors) plus an mlx-optiq sidecar describing the per-layer rank distribution.

The same layers mlx-optiq kept at 8-bit during quantization also get more adapter rank during fine-tuning. One signal, two optimizations.

The basic recipe

train.shbash

# Train on JSONL data — same format as mlx-lm's lora trainer
$ optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --max-seq-length 1400 \
    --rank 8 --rank-scaling by_bits \
    --num-layers 16 --iters 1000 \
    -o ./my_adapter

# Show the per-layer rank distribution
$ optiq lora info ./my_adapter

Data format

Standard JSONL — one example per line, with a text field (or messages for chat). Same format as mlx_lm.lora:

data.jsonljson

{"text": "Q: What is 2+2? A: 4"}
{"text": "Q: ... A: ..."}
{"messages": [{"role": "user", "content": "..."},
              {"role": "assistant", "content": "..."}]}

Layout on disk:

directory layoutbash

my_training_data/
├── train.jsonl
└── valid.jsonl   # optional, used for validation loss

Sensitivity-aware rank scaling

--rank-scaling by_bits gives each layer an adapter rank proportional to its quantization bit-width. With --rank 8:

Layers mlx-optiq quantized at 4-bit get rank 8.
Layers mlx-optiq quantized at 8-bit get rank 16.

The total parameter budget is roughly the same as constant rank, but capacity moves to where it matters. On a GSM8K subset, this consistently produces lower validation loss than constant rank at iter 50.

Other scaling modes:

scaling.shbash

# Constant rank (mlx-lm default behavior)
$ optiq lora train ... --rank-scaling constant

# Scale by raw KL sensitivity (more aggressive)
$ optiq lora train ... --rank-scaling by_kl

# Scale by sensitivity quantile
$ optiq lora train ... --rank-scaling by_quantile

Training-ceiling map (36 GB Mac)

Empirical sequence-length and peak-memory ceilings at the system-default iogpu.wired_limit_mb=0 (no manual override needed; recommended for safety on long runs). Default config: q_proj, v_proj, num_layers=16, rank=8, rank_scaling=by_bits. Both iters stable, zero memory drift.

Conservative recipes — peak ≤ 27.7 GB, no overflow penalty

Model	Max seq len	Peak mem	Tokens/sec	Time/iter
Qwen3.5-0.8B	2,800	23.4 GB	29.2	96 s
Qwen3.5-2B	2,400	19.3 GB	38.3	63 s
Qwen3.5-4B	1,600	24.8 GB	19.1	84 s
Qwen3.5-9B	1,400	25.4 GB	21.6	65 s
Qwen3.5-27B	512	27.7 GB	11.4	45 s
Qwen3.6-27B	512	27.7 GB	11.4	45 s
gemma-4-26B-A4B	512	27.6 GB	22.2	32 s
Qwen3.5-35B-A3B	128	25.3 GB	32.2	17 s
Qwen3.6-35B-A3B	128	25.3 GB	32.3	17 s
gemma-4-31B-it	32	21.4 GB	30.9	11 s

Two cliffs, not one Pushing past these ceilings hits two distinct failure modes at different points:

Memory cliff (~27–28 GB peak). When peak crosses the system-default GPU-wired cap, macOS absorbs overflow via compressed memory — works, but throughput drops 9–30%.
MTLResource-count cliff (independent of bytes). Apple Silicon GPUs cap simultaneously-bound MTLResources at 499 K. Per-iter resource count grows with both num_layers and seq_len. 2 B at T = 3,200 hits a hard kIOGPUCommandBufferCallbackErrorOutOfMemory at iter 1 even though peak memory is only 22 GB. Don't extrapolate "more headroom in GB" → "can push T further."

Adapter output

After training:

directory layoutbash

my_adapter/
├── adapter_config.json       # PEFT-compatible config
├── adapters.safetensors      # PEFT-compatible weights
└── optiq_lora_config.json    # mlx-optiq sidecar with per-layer ranks

Inspect the per-layer rank distribution:

terminalbash

$ optiq lora info ./my_adapter
# mlx-optiq LoRA adapter
# base model:  mlx-community/Qwen3.5-9B-OptiQ-4bit
# total params: 2.9M trainable
# rank distribution:
#   rank 8  (4-bit layers): 96 modules
#   rank 16 (8-bit layers): 16 modules

Loading an adapter

load_adapter.pypython

from mlx_lm import load, generate

model, tok = load(
    "mlx-community/Qwen3.5-9B-OptiQ-4bit",
    adapter_path="./my_adapter",
)
print(generate(model, tok, prompt="...", max_tokens=200))

Hot-swap adapters at serve time

mlx-optiq's mounted-LoRA primitive lets you keep N adapters resident on one base, switching per-request. See the serving guide.