Family guide · Diffusion LLM

The Diffusion LLM family

Diffusion language models are a different shape of model: instead of decoding left-to-right one token at a time, they fill a block of tokens and iteratively un-mask it over a handful of denoising steps. OptiQ's Diffusion LLM family has two very different members, a 26B frontier image-text model and a 250M tri-mode model you can fine-tune on a laptop. Both are custom architectures stock mlx-lm can't load; OptiQ ships vendored, dependency-free decoders, so pip install mlx-optiq is all you need.

The family · two models

Model	Params	Shape	Built for
DiffusionGemma-26B-A4B-it	26B · A4B	block-diffusion MoE, image-text	frontier diffusion generation
dhara-250m	250M	tri-mode: AR + diffusion + self-spec	a tiny base to fine-tune

Requires mlx-optiq ≥ 0.2.3 OptiQ ships native decoders for DiffusionGemma and the dhara-250m port, including prefix-cached decode. pip install -U mlx-optiq.

DiffusionGemma-26B-A4B-it

Google's DiffusionGemma-26B-A4B-it is a block-diffusion, 128-expert MoE, image-text-to-text model, the founding member of the family. It fills a fixed-size 256-token canvas and un-masks it over a few denoising steps. It is not loadable by stock mlx-lm or mlx-vlm; OptiQ ships a vendored, dependency-free decoder for it.

The quant

OptiQ measures per-layer quantization sensitivity on the denoising-canvas logits, sampled at several points along the denoising schedule the model actually walks, and spends an 8-bit budget where the measurement says it matters. The SigLIP vision tower is not quantized at all: it rides at bf16 in a sidecar, so the image path keeps full precision and the whole bit budget goes to the language tower, 247 of 299 language tensors at 8-bit, 4.685 bits-per-weight.

Capability Score · 6-benchmark mean

Model	mlx-optiq size	Capability	Δ vs published 4-bit
diffusiongemma-26B-A4B-it-OptiQ-4bit	17.8 GB	68.25	+4.12

Per-benchmark: MMLU 56.1, GSM8K 94.4, IFEval 72.5, BFCL 69.0, HumanEval 87.2, HashHop 34.0.

Because the vision tower is bf16 and outside the bit budget, image understanding is not degraded by the quant. Asked to describe photos in one sentence it names the animal, reads an upside-down stop sign, and picks out the remote controls lying next to the cats.

Hello world, text and image

generate.pypython

from optiq.vlm.diffusion_gemma import load, generate
from PIL import Image

model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a haiku about Apple Silicon."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt))

# image + text
print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")]))

Image preprocessing reuses the Gemma-4 SigLIP path (bit-exact to mlx-vlm). The canvas starts from random noise, so an occasional empty decode is retried automatically.

Inference speed, pick the sampler

DiffusionGemma decodes by un-masking a 256-token canvas; the sampler dominates speed. OptiQ defaults to confidence-threshold, 4.6–5× faster than the model's entropy-bound default (≈58 tok/s on code vs 12.7), with no quality loss.

python

generate(model, tokenizer, prompt, sampler="confidence-threshold")  # the default

LoRA fine-tuning

OptiQ trains LoRA with the model's native denoising objective (corrupt the target tokens to a random noise level, predict the clean tokens) not autoregressive cross-entropy (which mlx-lm's tuner uses, and which can't even load a diffusion model).

train.pypython

from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora

train_diffusion_lora(model_path, "data/", "adapter/", rank=8)  # data/train.jsonl: {prompt, completion}
model, tok = load_diffusion_lora(model_path, "adapter/")

Serving

optiq serve auto-detects DiffusionGemma and routes the OpenAI/Anthropic-compatible server through the vendored decode with the fast confidence-threshold sampler.

bash

optiq serve --model mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit

What doesn't apply

Diffusion is non-autoregressive, so a few autoregressive-only OptiQ features have no analog here: MTP / speculative drafting (the parallel canvas un-masking is the native version of that speedup) and KV-cache quantization (the fixed canvas means the cache only holds the prompt). The model has no MTP head or draft model.

dhara-250m, the tri-mode tiny model

The family's second member is very different: codelion/dhara-250m, a 250M-parameter model you can fine-tune on a laptop. It is tri-mode. One set of weights that decodes three ways: standard autoregressive (left-to-right), block-diffusion (the canvas un-masking above), and self-speculation (draft a block with the diffusion forward, verify it with the AR forward). Like DiffusionGemma it is a custom architecture stock mlx-lm can't load. It adds Canon depthwise-conv layers, QK-norm after RoPE, and a logit soft-cap. OptiQ ships a vendored, mlx-native port that is bit-exact to the reference, so pip install mlx-optiq loads it and the full pipeline (convert, LoRA, eval, serve, KV-quant) works through the autoregressive path.

Built to be fine-tuned At 250M, dhara is a base to specialize (the way Google's Gemma-270M is) small enough to LoRA on-device for one task, not a general assistant. Its benchmark scores sit near the small-model floor; the point below is that quantization preserves them intact.

The quant, 8-bit mixed precision

At 250M there is no redundancy to spend, so dhara is quantized for fidelity to the reference rather than for the smallest file. --candidate-bits 8,16 gives the optimizer a lossless tier (16 is not a quantized format, since MLX has none, so it means leave this layer at bf16) and the sensitivity sweep decides which layers get it. This quant holds 125 tensors at bf16 and drops 99 to 8-bit, 10.25 bits-per-weight.

KL divergence vs the bf16 reference · lower is closer to the original model

Variant	Size	bpw	KL vs bf16	Reproduces bf16 output
bf16 (reference)	460 MB	16	—	—
uniform 4-bit	130 MB	4.53	0.0608	no
uniform 8-bit	266 MB	8.52	0.0007	partly
dhara-250m-OptiQ-8bit	357 MB	10.25	0.0005	yes

Decoded with the reference implementation's settings (greedy, repetition penalty 1.3), autoregressive and self-speculative decode are byte-identical to bf16. Block-diffusion matches at 0.87 similarity; more precision does not tighten it (at 12.5 bpw it gets worse), because the residual comes from confidence-threshold un-masking, not from quantization.

Capability Score · 6-benchmark mean · why we do not gate on it here

Variant	KL vs bf16	Capability	MMLU	GSM8K	IFEval
bf16 (reference)	—	8.07	24.7	1.6	22.2
dhara-250m-OptiQ-8bit	0.0005	7.83	24.5	1.8	20.7
uniform 4-bit (broken)	0.0594	8.53	24.3	2.0	25.0

Read the last row carefully. The uniform 4-bit build diverges from its own bf16 reference 120× more than the quant we ship, and it outscores both that quant and the bf16 model it was made from. On a 250M model the Capability Score is not merely blind to quantization damage, it is anti-correlated with it: rank these three by benchmark average and you ship the broken one on purpose.

BFCL, HumanEval and HashHop return a hard 0 for every variant including bf16, because a 250M model cannot do tool-calling, code or 12k-context retrieval at all. Four of the six channels are dead, so the score rests on MMLU (at chance, 24–25%, ±2.7pp) and IFEval (±3.5pp). One noisy benchmark moving four points reorders everything, and degrading the weights happened to move IFEval up. So the bit budget for a model this small is set by KL against bf16, which tracks correctness, and the release contract enforces that with a hard KL bar.

Three ways to decode, self-speculation is the default

dhara decodes three ways from one set of weights, and the recommended default is self-speculation (--mtp): it drafts a block in one parallel forward and verifies it autoregressively (two forwards per round, no commit pass), so the emitted output is identical to plain AR decode while committing ~3–4 tokens per round, AR accuracy at ~1.4× the speed of token-by-token AR. The model is overhead-bound (a 32-token forward costs about the same as a 1-token forward), and the quant decodes at bf16 speed. Quantization reduces size without changing throughput here.

Mode	Speed · M3 Max	Character
self-speculation (`--mtp`)	~1.4× AR	recommended, output identical to AR, committing several tokens per round
autoregressive	~130 tok/s	the exact reference; pair with a repetition penalty (greedy can loop)
block-diffusion	parallel	prefix-cached; bidirectional (infilling), trades denoising steps for speed

Self-speculation guarantees AR-identical output because the autoregressive verify decides every token, the speedup is free accuracy-wise, and largest for fine-tuned models decoded greedily (the deployment case for a model like this). Self-spec and block-diffusion are prefix-cached (KV + Canon-conv state), so each step processes only the new block, O(block) per step, not O(sequence). Quantization leaves the Canon convs, QK-norm, and soft-cap at bf16 automatically (they aren't Linear modules), so only the attention and MLP projections are quantized.

Hello world

generate.pypython

import optiq  # registers dhara_ar with mlx-lm
from mlx_lm import load, generate

model, tok = load("mlx-community/dhara-250m-OptiQ-8bit")
prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain the Mediterranean climate."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tok, prompt))

Block-diffusion and self-speculation run through optiq.mlx_lm_patches.dhara_decode. optiq serve --model mlx-community/dhara-250m-OptiQ-8bit serves the OpenAI/Anthropic-compatible API; the --mtp flag routes generation through the self-speculative path, and LoRA fine-tuning uses the standard optiq lora train autoregressive trainer.