Vision (image input)
As of v0.2.0, mlx-optiq answers image and text prompts on the Gemma-4 family. The language tower is still OptIQ mixed-precision quantized and decoded by mlx-lm; the vision tower is vendored into mlx-optiq (no mlx-vlm runtime dependency) and kept at bf16 in a sidecar that rides alongside the quantized weights.
One artifact, two ways to load it
OptIQ stores the vision and audio towers, at bf16, in a sidecar file named optiq_vision.safetensors next to the quantized language shards. mlx-lm selects its weights with glob("model*.safetensors"), so it never matches the sidecar. The result is a single published repo that loads two ways:
| Loader | Reads | You get |
|---|---|---|
stock mlx-lm | model*.safetensors | Text-only model (sidecar ignored) |
| OptIQ | model*.safetensors + optiq_vision.safetensors | Full image + text |
There is no separate vision build. Vision stays at bf16 because 4-bit vision degrades OCR and fine detail; the language tower, where almost all of the size lives, is still fully quantized.
Serving images
When a model carries the sidecar, optiq serve turns on image support automatically. Send an OpenAI-style image_url content part (a data URL or an http(s) URL):
optiq serve --model mlx-community/gemma-4-e2b-it-OptiQ-4bit curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages":[{"role":"user","content":[ {"type":"image_url","image_url":{"url":"data:image/png;base64,..."}}, {"type":"text","text":"What is in this image?"}]}]}'
Text-only requests are unchanged: the vision path only runs when a request actually carries an image, so MTP speculation, mounted LoRA adapters, KV-cache quantization, and plain text generation all behave exactly as without it.
In the Lab
The Lab's Chat tab takes image uploads directly. Run optiq lab --model <sidecar-equipped quant>, open Chat, click attach, drop in a picture, and ask a question.
Python API
The engine takes images= (paths, URLs, data URLs, or PIL images) or a full messages= list with image_url parts:
from mlx_lm import load from optiq.runtime.engine import OptiqEngine model, tok = load("mlx-community/gemma-4-e2b-it-OptiQ-4bit") eng = OptiqEngine.from_loaded(model, tok, "mlx-community/gemma-4-e2b-it-OptiQ-4bit") st = eng.generate("What is in this image?", images=["cat.jpg"], max_tokens=128) print(st.text)
Adding the sidecar to a quant
If you have an existing OptIQ language quant and the bf16 base it came from, attach a vision sidecar with one call. It extracts the bf16 vision and audio towers, writes optiq_vision.safetensors into the quant directory, and restores the multimodal config keys:
from optiq.vlm import build_vision_sidecar build_vision_sidecar( base="google/gemma-4-e2b-it", # bf16 base with the towers quant_dir="./gemma-4-e2b-it-OptiQ-4bit", # existing OptIQ language quant )
How it works
The vision front-end preprocesses the pixels, runs the vendored Gemma-4 SigLIP tower, projects the result into the language model's hidden space, and scatters those soft tokens into the text-embedding sequence at the image-placeholder positions. The merged embeddings go to mlx-lm's gemma4_text through its input_embeddings and per_layer_inputs hooks, and decode proceeds with the same quantized weights, KV cache, and sampler as text.
The vendoring is validated against mlx-vlm tensor for tensor: feeding mlx-vlm's own pixel values through mlx-optiq's preprocessing, vision tower, and projection reproduces its outputs to a maximum absolute difference of zero. Two details matter for that to hold: gemma4_text always rescales the incoming embeddings by embed_scale, so the vision features are pre-divided to compensate; and the per-layer inputs zero the image-token positions before projection.