Engineering · June 20, 2026

Fine-tuning a vision model on a Mac.

Topic Vision · Fine-tuning Reading time 7 min Related vision support

OptiQ now fine-tunes the language tower of a quantized vision-language model on image+text data, entirely on a 24 GB Mac. A LoRA on Qwen3.5-0.8B-OptiQ-4bit trained on ChartQA lifts strict exact-match from 26% to 40% and output similarity from 0.39 to 0.60 on held-out charts. The vision tower stays frozen; only the language tower learns. No GPU, no cloud, and you can run the whole flow (build the dataset, train the LoRA) from the OptiQ Lab.

The result

80 held-out ChartQA questions, base versus the LoRA, both with images letterboxed to a 512px canvas, scored three ways: ChartQA relaxed accuracy (substring or numeric-within-5%), strict exact-match, and a similarity ratio against the ground-truth answer string.

Metric	Base	+ LoRA	Δ
Relaxed accuracy	50.0%	55.0%	+5.0 pp
Exact match	26.2%	40.0%	+13.8 pp
Similarity	0.385	0.598	+0.21

The samples show it. Asked for a value, the base answers "There are 10 food items shown in the bar graph" (right idea, wrong format, fails strict matching); the fine-tuned model answers "3". Exact-match nearly doubles. Relaxed accuracy moves a smaller +5 points: the bigger win is format and consistency, with a real gain in correctness on top.

Why this matters OptiQ already runs images through quantized VLMs. This is the other half: adapting one to your data. The vision tower is frozen and the language tower trains, which is the common, low-risk VLM fine-tune (domain VQA, OCR, captioning), and it now fits on a 24 GB Mac.

Build the dataset in the Lab

The OptiQ Lab's dataset builder gained a VLM image+text template. Point it at any image+text dataset on the Hub (here, ChartQA), map the columns, and it standardizes and exports the JSONL the trainer reads.

OptiQ Lab Build dataset page, the VLM image+text template. Fields filled for HuggingFaceM4/ChartQA: image column 'image', question column 'query', answer column 'label', a 'Standardize to 512px square' field, and a row cap. — The VLM image+text template. Note *Standardize to 512px square*: every image is letterboxed to one fixed canvas. That single choice is what keeps training memory bounded, so the Lab makes it the default.

OptiQ Lab dataset build complete. 'Dataset written to' a local path, with a 'Point Fine-tune at this dataset' note and a push-to-Hugging-Face panel. — It streams the dataset, letterboxes each image, and writes `{image, prompt, completion}` JSONL locally (optionally pushed to the Hub).

Then fine-tune, also in the Lab

The Fine-tune wizard gained a Vision objective. Pick it, point at the dataset you just built, and the vision-safe defaults are pre-filled: a 512px canvas, scale 8, gradient checkpointing on.

OptiQ Lab Fine-tune wizard, Hyperparameters step. Training objective is 'Vision, image+text LoRA'. A green banner explains the frozen vision tower and uniform-canvas requirement. Image canvas 512, rank 8, scale 8. — Selecting Vision sets scale to 8 automatically (the Qwen3.5/3.6 hybrid family collapses at the text-SFT default of 20) and surfaces the image-canvas field. Start training, and the loss chart streams as it trains.

Or three commands

terminal · dataset, train, servebash

$ pip install 'mlx-optiq'

# 1. prep an image+text jsonl (one row per {image, prompt, completion})
#    images letterboxed to a uniform canvas; see scripts/prep_chartqa.py

$ optiq lora train mlx-community/Qwen3.5-0.8B-OptiQ-4bit \
    --vision --data ./chartqa/train.jsonl \
    --rank 8 --iters 800 --learning-rate 5e-5 \
    --output ./chartqa-lora

$ optiq serve --model mlx-community/Qwen3.5-0.8B-OptiQ-4bit \
    --adapter ./chartqa-lora

--vision auto-engages when the model ships an optiq_vision sidecar. The vision tower is frozen; LoRA trains the language tower's attention and MLP projections with gradient checkpointing on.

How the training fits on a Mac

Two defaults keep a VLM LoRA inside 24 GB, both on automatically. Every image is letterboxed to a uniform square canvas, which holds the per-step memory constant. Gradient checkpointing recomputes each decoder block's activations in the backward pass instead of storing them, which fits the Qwen3.5 hybrid (gated-delta) attention backward in a few gigabytes. The vision defaults also set gradient clipping and a 5e-5 learning rate, which keep training stable on the short answers chart datasets use. A full ChartQA run holds above 6 GB free throughout.

Try it

The VLM image+text dataset template and the Vision fine-tune objective ship in the OptiQ Lab, and optiq lora train --vision is in the CLI. Point either at a 0.8B VLM quant and your own image+text data, and you have a local vision fine-tune that fits on a Mac.

the mlx-optiq team