mlx-optiq
Family guide · Diffusion LLM

The Diffusion LLM family

Diffusion language models are a different shape of model: instead of decoding left-to-right one token at a time, they fill a block of tokens and iteratively un-mask it over a handful of denoising steps. OptIQ's Diffusion LLM family has two members at opposite extremes — a 26B frontier image-text model and a 250M tri-mode model you can fine-tune on a laptop. Both are custom architectures stock mlx-lm can't load; OptIQ ships vendored, dependency-free decoders, so pip install mlx-optiq is all you need.

The family · two models
ModelParamsShapeBuilt for
DiffusionGemma-26B-A4B-it26B · A4Bblock-diffusion MoE, image-textfrontier diffusion generation
dhara-250m250Mtri-mode: AR + diffusion + self-speca tiny base to fine-tune
Requires mlx-optiq ≥ 0.2.3 DiffusionGemma's decoder landed in v0.2.3; the dhara-250m port and its prefix-cached decode are new in v0.2.4. pip install -U mlx-optiq.

DiffusionGemma-26B-A4B-it

Google's DiffusionGemma-26B-A4B-it is a block-diffusion, 128-expert MoE, image-text-to-text model — the founding member of the family. It fills a fixed-size 256-token canvas and un-masks it over a few denoising steps. It is not loadable by stock mlx-lm or mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it.

The quant

OptIQ measures per-layer quantization sensitivity on the denoising-canvas logits and spends an 8-bit budget where it helps most. At the same ~4.66 bits-per-weight as the standard published 4-bit (mlx-vlm's hand-coded recipe), OptIQ moves the 8-bit budget off the dense-MLP and onto the early-layer attention + routers the measurement flags as more sensitive — a higher Capability Score on a smaller artifact.

Capability Score · 6-benchmark mean
Modelmlx-optiq sizeCapabilityΔ vs published 4-bit
diffusiongemma-26B-A4B-it-OptiQ-4bit14,000 MB59.90+0.07

OptIQ matches or beats the hand-tuned recipe on 5 of 6 benchmarks (MMLU +2.9, HumanEval +1.2) while being 0.5 GB smaller. HashHop is ~0 for both — the fixed 256-token canvas can't do 12k-context retrieval.

Hello world — text and image

generate.pypython
from optiq.vlm.diffusion_gemma import load, generate
from PIL import Image

model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a haiku about Apple Silicon."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt))

# image + text
print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")]))

Image preprocessing reuses the Gemma-4 SigLIP path (bit-exact to mlx-vlm). The canvas starts from random noise, so an occasional empty decode is retried automatically.

Inference speed — pick the sampler

DiffusionGemma decodes by un-masking a 256-token canvas; the sampler dominates speed. OptIQ defaults to confidence-threshold4.6–5× faster than the model's entropy-bound default (≈58 tok/s on code vs 12.7), with no quality loss.

python
generate(model, tokenizer, prompt, sampler="confidence-threshold")  # the default

LoRA fine-tuning

OptIQ trains LoRA with the model's native denoising objective — corrupt the target tokens to a random noise level, predict the clean tokens — not autoregressive cross-entropy (which mlx-lm's tuner uses, and which can't even load a diffusion model).

train.pypython
from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora

train_diffusion_lora(model_path, "data/", "adapter/", rank=8)  # data/train.jsonl: {prompt, completion}
model, tok = load_diffusion_lora(model_path, "adapter/")

Serving

optiq serve auto-detects DiffusionGemma and routes the OpenAI/Anthropic-compatible server through the vendored decode with the fast confidence-threshold sampler.

bash
optiq serve --model mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit

What doesn't apply

Diffusion is non-autoregressive, so a few autoregressive-only OptIQ features have no analog here: MTP / speculative drafting (the parallel canvas un-masking is the native version of that speedup) and KV-cache quantization (the fixed canvas means the cache only holds the prompt). The model has no MTP head or draft model.


dhara-250m — the tri-mode tiny model

The family's second member is its opposite extreme: codelion/dhara-250m, a 250M-parameter model you can fine-tune on a laptop. It is tri-mode — one set of weights that decodes three ways: standard autoregressive (left-to-right), block-diffusion (the canvas un-masking above), and self-speculation (draft a block with the diffusion forward, verify it with the AR forward). Like DiffusionGemma it is a custom architecture stock mlx-lm can't load — it adds Canon depthwise-conv layers, QK-norm after RoPE, and a logit soft-cap. OptIQ ships a vendored, mlx-native port that is bit-exact to the reference, so pip install mlx-optiq loads it and the full pipeline — convert, LoRA, eval, serve, KV-quant — works through the autoregressive path.

Built to be fine-tuned At 250M, dhara is a base to specialize — the way Google's Gemma-270M is — small enough to LoRA on-device for one task, not a general assistant. Its benchmark scores sit near the small-model floor; the point below is that quantization preserves them intact.

The quant — 4-bit is lossless here

dhara is small enough that the weights aren't the bottleneck, so OptIQ's win is size, not a capability rescue. We measured the full 6-benchmark Capability Score three ways — full-precision bf16, naive uniform 4-bit, and OptIQ measured mixed-precision — and all three land within run-to-run noise. The extra 8-bit budget OptIQ spends over uniform 4-bit buys no measurable capability on a model this robust; the measurement instead certifies that 4-bit is safe.

Capability Score · 6-benchmark mean · bf16 reference
VariantSizebpwCapabilityMMLUIFEval
bf16 (reference)460 MB168.3424.723.3
uniform 4-bit130 MB4.08.7924.327.2
dhara-250m-OptiQ-4bit170 MB4.868.5424.925.0

All three are within the IFEval noise band — 3.5× smaller at full quality (460 → 130–170 MB). GSM8K, HumanEval, BFCL, and HashHop sit at the 250M floor for every variant; we verified this is genuine (the model can't yet do multi-step math or tool calls — confirmed by inspecting raw generations with the model's own repetition penalty), not a harness artifact. The number to take away: quantization costs nothing here.

Three ways to decode — self-speculation is the default

dhara decodes three ways from one set of weights, and the recommended default is self-speculation (--mtp): it drafts a block in one parallel forward and verifies it autoregressively (two forwards per round, no commit pass), so the emitted output is identical to plain AR decode while committing ~3–4 tokens per round — AR accuracy at ~1.4× the speed of token-by-token AR. The model is overhead-bound (a 32-token forward costs about the same as a 1-token forward), and the 4-bit and bf16 weights decode at the same speed — so quantization buys size, not throughput.

ModeSpeed · M3 MaxCharacter
self-speculation (--mtp)~1.4× ARrecommended — output identical to AR, committing several tokens per round
autoregressive~130 tok/sthe exact reference; pair with a repetition penalty (greedy can loop)
block-diffusionparallelprefix-cached; bidirectional (infilling), trades denoising steps for speed

Self-speculation guarantees AR-identical output because the autoregressive verify decides every token — the speedup is free accuracy-wise, and largest for fine-tuned models decoded greedily (the deployment case for a model like this). Self-spec and block-diffusion are prefix-cached (KV + Canon-conv state), so each step processes only the new block — O(block) per step, not O(sequence). Quantization leaves the Canon convs, QK-norm, and soft-cap at bf16 automatically (they aren't Linear modules), so only the attention and MLP projections are quantized.

Hello world

generate.pypython
import optiq  # registers dhara_ar with mlx-lm
from mlx_lm import load, generate

model, tok = load("mlx-community/dhara-250m-OptiQ-4bit")
prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain the Mediterranean climate."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tok, prompt))

Block-diffusion and self-speculation run through optiq.mlx_lm_patches.dhara_decode. optiq serve --model mlx-community/dhara-250m-OptiQ-4bit serves the OpenAI/Anthropic-compatible API; the --mtp flag routes generation through the self-speculative path, and LoRA fine-tuning uses the standard optiq lora train autoregressive trainer.

License + provenance

Both quants are Apache-2.0, derived from google/diffusiongemma-26B-A4B-it and codelion/dhara-250m. OptIQ ships a native decoder for each, so pip install mlx-optiq is the only dependency.