mlx-optiq
Documentation

Using mlx-optiq quants

mlx-optiq-quantized models are standard MLX checkpoints. They load with the unmodified mlx_lm.load function and generate with mlx_lm.generate. The only difference from a uniform-4-bit checkpoint is the per-layer bit-width recorded in metadata.

One-shot generation

oneshot.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok,
    prompt="Why is mixed-precision quantization a good idea?",
    max_tokens=300)
print(out)

Streaming generation

For interactive UIs and CLIs, stream tokens as they come:

streaming.pypython
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

for response in stream_generate(
    model, tok,
    prompt="Write a haiku about Apple Silicon.",
    max_tokens=200,
    sampler=sampler,
):
    print(response.text, end="", flush=True)

Chat templates

Instruction-tuned models (Qwen3.5-*-Instruct, Qwen3.6-*, Gemma-4-*-it) need their chat template applied. Always do this for chat-style use:

chat.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.6-27B-OptiQ-4bit")

messages = [
    {"role": "system", "content": "You are a concise expert."},
    {"role": "user", "content": "Explain RoPE in 3 bullet points."},
]
prompt = tok.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=400)
print(out)
Reasoning models Qwen3.5 and Qwen3.6 instruct variants have a built-in <think>...</think> reasoning channel. Pass enable_thinking=False to apply_chat_template to skip it (much faster, slightly less accurate on math/logic), or leave it on for best quality.

Multi-turn chat loop

chat_loop.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
messages = [{"role": "system", "content": "You are helpful."}]

while True:
    user = input("> ")
    if not user: break
    messages.append({"role": "user", "content": user})
    prompt = tok.apply_chat_template(messages,
        tokenize=False, add_generation_prompt=True)
    reply = generate(model, tok, prompt=prompt, max_tokens=800)
    print(reply)
    messages.append({"role": "assistant", "content": reply})

Inspecting mlx-optiq metadata

Each mlx-optiq checkpoint records its per-layer bit assignment in config.json under the quantization field, plus a sidecar optiq_metadata.json with the full sensitivity table:

inspect.pypython
import json
from huggingface_hub import snapshot_download

local = snapshot_download(
    "mlx-community/Qwen3.5-9B-OptiQ-4bit",
    allow_patterns=["optiq_metadata.json", "config.json"],
)
meta = json.load(open(f"{local}/optiq_metadata.json"))

print("Method:        ", meta["method"])
print("Achieved BPW:  ", round(meta["achieved_bpw"], 3))
print("8-bit layers:  ", meta["n_high_bits"])
print("4-bit layers:  ", meta["n_low_bits"])

Memory ceilings on a 36 GB Mac

Approximate working-set sizes during single-prompt inference at moderate context (≤8 k tokens). Add ~2 GB for the framework and OS reserve.

Approximate footprint at 8 k context
ModelDiskInference RAM (≤8 k)Recommended Mac
Qwen3.5-0.8B0.5 GB~1.5 GB8 GB+
Qwen3.5-2B1.4 GB~3 GB16 GB+
Qwen3.5-4B2.8 GB~5 GB16 GB+
Qwen3.5-9B5.6 GB~9 GB24 GB+
Qwen3.5-27B15.7 GB~22 GB36 GB+
Qwen3.6-27B15.7 GB~22 GB36 GB+
Qwen3.5-35B-A3B20.1 GB~26 GB36 GB+
Qwen3.6-35B-A3B20.1 GB~26 GB36 GB+
gemma-4-26B-A4B-it14.9 GB~21 GB32 GB+
gemma-4-31B-it18.1 GB~24 GB36 GB+

For 64 k context inference on the 9 B and larger models, use mixed-precision KV serving — it both shrinks the cache and speeds decode.

Next: dive into how sensitivity works, or jump to your model family's getting-started guide.