mlx-optiq
Workflow · vision

Vision (image input)

As of v0.2.0, mlx-optiq answers image and text prompts on the Gemma-4 family. The language tower is still OptIQ mixed-precision quantized and decoded by mlx-lm; the vision tower is vendored into mlx-optiq (no mlx-vlm runtime dependency) and kept at bf16 in a sidecar that rides alongside the quantized weights.

At a glance Vision support is on the Gemma-4 family (e2b, e4b, and the larger variants). The vision/audio towers stay at bf16; only the language tower is quantized. Audio (speech) input is not supported.

One artifact, two ways to load it

OptIQ stores the vision and audio towers, at bf16, in a sidecar file named optiq_vision.safetensors next to the quantized language shards. mlx-lm selects its weights with glob("model*.safetensors"), so it never matches the sidecar. The result is a single published repo that loads two ways:

LoaderReadsYou get
stock mlx-lmmodel*.safetensorsText-only model (sidecar ignored)
OptIQmodel*.safetensors + optiq_vision.safetensorsFull image + text

There is no separate vision build. Vision stays at bf16 because 4-bit vision degrades OCR and fine detail; the language tower, where almost all of the size lives, is still fully quantized.

Serving images

When a model carries the sidecar, optiq serve turns on image support automatically. Send an OpenAI-style image_url content part (a data URL or an http(s) URL):

terminalbash
optiq serve --model mlx-community/gemma-4-e2b-it-OptiQ-4bit

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":[
        {"type":"image_url","image_url":{"url":"data:image/png;base64,..."}},
        {"type":"text","text":"What is in this image?"}]}]}'

Text-only requests are unchanged: the vision path only runs when a request actually carries an image, so MTP speculation, mounted LoRA adapters, KV-cache quantization, and plain text generation all behave exactly as without it.

In the Lab

The Lab's Chat tab takes image uploads directly. Run optiq lab --model <sidecar-equipped quant>, open Chat, click attach, drop in a picture, and ask a question.

OptIQ Lab analyzing an uploaded image of shapes.
gemma-4-e2b at 4-bit, reading an uploaded image in the Lab.

Python API

The engine takes images= (paths, URLs, data URLs, or PIL images) or a full messages= list with image_url parts:

pythonpy
from mlx_lm import load
from optiq.runtime.engine import OptiqEngine

model, tok = load("mlx-community/gemma-4-e2b-it-OptiQ-4bit")
eng = OptiqEngine.from_loaded(model, tok, "mlx-community/gemma-4-e2b-it-OptiQ-4bit")

st = eng.generate("What is in this image?", images=["cat.jpg"], max_tokens=128)
print(st.text)

Adding the sidecar to a quant

If you have an existing OptIQ language quant and the bf16 base it came from, attach a vision sidecar with one call. It extracts the bf16 vision and audio towers, writes optiq_vision.safetensors into the quant directory, and restores the multimodal config keys:

pythonpy
from optiq.vlm import build_vision_sidecar

build_vision_sidecar(
    base="google/gemma-4-e2b-it",        # bf16 base with the towers
    quant_dir="./gemma-4-e2b-it-OptiQ-4bit",  # existing OptIQ language quant
)

How it works

The vision front-end preprocesses the pixels, runs the vendored Gemma-4 SigLIP tower, projects the result into the language model's hidden space, and scatters those soft tokens into the text-embedding sequence at the image-placeholder positions. The merged embeddings go to mlx-lm's gemma4_text through its input_embeddings and per_layer_inputs hooks, and decode proceeds with the same quantized weights, KV cache, and sampler as text.

The vendoring is validated against mlx-vlm tensor for tensor: feeding mlx-vlm's own pixel values through mlx-optiq's preprocessing, vision tower, and projection reproduces its outputs to a maximum absolute difference of zero. Two details matter for that to hold: gemma4_text always rescales the incoming embeddings by embed_scale, so the vision features are pre-divided to compensate; and the per-layer inputs zero the image-token positions before projection.

See also The release write-up: mlx-optiq can see. Family details and sampling defaults: Gemma-4 family guide.