mlx-optiq
Documentation

Documentation

mlx-optiq ships three things in one PyPI package: mixed-precision quantization, sensitivity-aware LoRA fine-tuning, and an OpenAI-compatible inference server. They share one signal — per-layer KL-divergence sensitivity, computed once on calibration data — and reuse it across the stack.

This site is the canonical reference. Every page is self-contained: code examples are copy-paste runnable on a stock Mac with Python 3.11+ and 16 GB+ RAM.

Pick a path

I want to use a pre-built quant

Start with Installation, then jump to your model family — Qwen3.5, Qwen3.6, or Gemma-4. Each has a 5-minute hello-world plus model-specific tips (chat template, sampling defaults, recommended context length).

I want to quantize my own model

Read How sensitivity works to understand the algorithm, then the convert CLI reference. The --reference auto flag picks bf16 when it fits and a uniform-4-bit baseline when it doesn't.

I want to fine-tune with LoRA

The LoRA fine-tuning guide covers PEFT-compatible adapter output, sensitivity-aware rank scaling, and the empirical training-ceiling map for a 36 GB Mac across all 10 supported models.

I want to serve an LLM

The KV-quant serving guide covers running optiq serve behind an OpenAI-compatible API with mixed-precision KV cache and hot-swappable LoRA adapters.

For agents and IDEs

The full library reference is also published as a single Markdown file: /llms.txt. Drop it into Claude Code, Cursor, or any agent context window — it's everything mlx-optiq in ~12 KB.

Get involved

Everything is on PyPI: pypi.org/project/mlx-optiq. Quants are on the mlx-community Hugging Face org — try a model, build something with it, and ship.