The Diffusion LLM family
Diffusion language models are a different shape of model: instead of decoding left-to-right one token at a time, they fill a block of tokens and iteratively un-mask it over a handful of denoising steps. OptIQ's Diffusion LLM family has two members at opposite extremes — a 26B frontier image-text model and a 250M tri-mode model you can fine-tune on a laptop. Both are custom architectures stock mlx-lm can't load; OptIQ ships vendored, dependency-free decoders, so pip install mlx-optiq is all you need.
| Model | Params | Shape | Built for |
|---|---|---|---|
| DiffusionGemma-26B-A4B-it | 26B · A4B | block-diffusion MoE, image-text | frontier diffusion generation |
| dhara-250m | 250M | tri-mode: AR + diffusion + self-spec | a tiny base to fine-tune |
pip install -U mlx-optiq.
DiffusionGemma-26B-A4B-it
Google's DiffusionGemma-26B-A4B-it is a block-diffusion, 128-expert MoE, image-text-to-text model — the founding member of the family. It fills a fixed-size 256-token canvas and un-masks it over a few denoising steps. It is not loadable by stock mlx-lm or mlx-vlm; OptIQ ships a vendored, dependency-free decoder for it.
The quant
OptIQ measures per-layer quantization sensitivity on the denoising-canvas logits and spends an 8-bit budget where it helps most. At the same ~4.66 bits-per-weight as the standard published 4-bit (mlx-vlm's hand-coded recipe), OptIQ moves the 8-bit budget off the dense-MLP and onto the early-layer attention + routers the measurement flags as more sensitive — a higher Capability Score on a smaller artifact.
| Model | mlx-optiq size | Capability | Δ vs published 4-bit |
|---|---|---|---|
| diffusiongemma-26B-A4B-it-OptiQ-4bit | 14,000 MB | 59.90 | +0.07 |
OptIQ matches or beats the hand-tuned recipe on 5 of 6 benchmarks (MMLU +2.9, HumanEval +1.2) while being 0.5 GB smaller. HashHop is ~0 for both — the fixed 256-token canvas can't do 12k-context retrieval.
Hello world — text and image
from optiq.vlm.diffusion_gemma import load, generate
from PIL import Image
model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Write a haiku about Apple Silicon."}],
tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt))
# image + text
print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")]))Image preprocessing reuses the Gemma-4 SigLIP path (bit-exact to mlx-vlm). The canvas starts from random noise, so an occasional empty decode is retried automatically.
Inference speed — pick the sampler
DiffusionGemma decodes by un-masking a 256-token canvas; the sampler dominates speed. OptIQ defaults to confidence-threshold — 4.6–5× faster than the model's entropy-bound default (≈58 tok/s on code vs 12.7), with no quality loss.
generate(model, tokenizer, prompt, sampler="confidence-threshold") # the defaultLoRA fine-tuning
OptIQ trains LoRA with the model's native denoising objective — corrupt the target tokens to a random noise level, predict the clean tokens — not autoregressive cross-entropy (which mlx-lm's tuner uses, and which can't even load a diffusion model).
from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora
train_diffusion_lora(model_path, "data/", "adapter/", rank=8) # data/train.jsonl: {prompt, completion}
model, tok = load_diffusion_lora(model_path, "adapter/")Serving
optiq serve auto-detects DiffusionGemma and routes the OpenAI/Anthropic-compatible server through the vendored decode with the fast confidence-threshold sampler.
optiq serve --model mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bitWhat doesn't apply
Diffusion is non-autoregressive, so a few autoregressive-only OptIQ features have no analog here: MTP / speculative drafting (the parallel canvas un-masking is the native version of that speedup) and KV-cache quantization (the fixed canvas means the cache only holds the prompt). The model has no MTP head or draft model.
dhara-250m — the tri-mode tiny model
The family's second member is its opposite extreme: codelion/dhara-250m, a 250M-parameter model you can fine-tune on a laptop. It is tri-mode — one set of weights that decodes three ways: standard autoregressive (left-to-right), block-diffusion (the canvas un-masking above), and self-speculation (draft a block with the diffusion forward, verify it with the AR forward). Like DiffusionGemma it is a custom architecture stock mlx-lm can't load — it adds Canon depthwise-conv layers, QK-norm after RoPE, and a logit soft-cap. OptIQ ships a vendored, mlx-native port that is bit-exact to the reference, so pip install mlx-optiq loads it and the full pipeline — convert, LoRA, eval, serve, KV-quant — works through the autoregressive path.
The quant — 4-bit is lossless here
dhara is small enough that the weights aren't the bottleneck, so OptIQ's win is size, not a capability rescue. We measured the full 6-benchmark Capability Score three ways — full-precision bf16, naive uniform 4-bit, and OptIQ measured mixed-precision — and all three land within run-to-run noise. The extra 8-bit budget OptIQ spends over uniform 4-bit buys no measurable capability on a model this robust; the measurement instead certifies that 4-bit is safe.
| Variant | Size | bpw | Capability | MMLU | IFEval |
|---|---|---|---|---|---|
| bf16 (reference) | 460 MB | 16 | 8.34 | 24.7 | 23.3 |
| uniform 4-bit | 130 MB | 4.0 | 8.79 | 24.3 | 27.2 |
| dhara-250m-OptiQ-4bit | 170 MB | 4.86 | 8.54 | 24.9 | 25.0 |
All three are within the IFEval noise band — 3.5× smaller at full quality (460 → 130–170 MB). GSM8K, HumanEval, BFCL, and HashHop sit at the 250M floor for every variant; we verified this is genuine (the model can't yet do multi-step math or tool calls — confirmed by inspecting raw generations with the model's own repetition penalty), not a harness artifact. The number to take away: quantization costs nothing here.
Three ways to decode — self-speculation is the default
dhara decodes three ways from one set of weights, and the recommended default is self-speculation (--mtp): it drafts a block in one parallel forward and verifies it autoregressively (two forwards per round, no commit pass), so the emitted output is identical to plain AR decode while committing ~3–4 tokens per round — AR accuracy at ~1.4× the speed of token-by-token AR. The model is overhead-bound (a 32-token forward costs about the same as a 1-token forward), and the 4-bit and bf16 weights decode at the same speed — so quantization buys size, not throughput.
| Mode | Speed · M3 Max | Character |
|---|---|---|
self-speculation (--mtp) | ~1.4× AR | recommended — output identical to AR, committing several tokens per round |
| autoregressive | ~130 tok/s | the exact reference; pair with a repetition penalty (greedy can loop) |
| block-diffusion | parallel | prefix-cached; bidirectional (infilling), trades denoising steps for speed |
Self-speculation guarantees AR-identical output because the autoregressive verify decides every token — the speedup is free accuracy-wise, and largest for fine-tuned models decoded greedily (the deployment case for a model like this). Self-spec and block-diffusion are prefix-cached (KV + Canon-conv state), so each step processes only the new block — O(block) per step, not O(sequence). Quantization leaves the Canon convs, QK-norm, and soft-cap at bf16 automatically (they aren't Linear modules), so only the attention and MLP projections are quantized.
Hello world
import optiq # registers dhara_ar with mlx-lm
from mlx_lm import load, generate
model, tok = load("mlx-community/dhara-250m-OptiQ-4bit")
prompt = tok.apply_chat_template(
[{"role": "user", "content": "Explain the Mediterranean climate."}],
tokenize=False, add_generation_prompt=True)
print(generate(model, tok, prompt))Block-diffusion and self-speculation run through optiq.mlx_lm_patches.dhara_decode. optiq serve --model mlx-community/dhara-250m-OptiQ-4bit serves the OpenAI/Anthropic-compatible API; the --mtp flag routes generation through the self-speculative path, and LoRA fine-tuning uses the standard optiq lora train autoregressive trainer.
License + provenance
Both quants are Apache-2.0, derived from google/diffusiongemma-26B-A4B-it and codelion/dhara-250m. OptIQ ships a native decoder for each, so pip install mlx-optiq is the only dependency.