Tools in OptIQ Lab chat
A small quantized model running locally is great for chat. It is much more useful if it can also search the web, run Python, and execute shell commands. OptIQ Lab v0.1.0 ships all three, gated behind a single sandbox layer, with a healer for the malformed tool calls that quantized open-weight models routinely emit.

Three tools, one sandbox
The chat surface offers three tools the model can call. None of them need an API key, none of them phone home, and all three share the same isolation layer.
- web_search. Pass
{"query": "..."}to get the top DuckDuckGo results back as title / URL / snippet triples, or{"url": "..."}to fetch a single page as compact markdown viahtml2text. We use theddgslibrary (no API key, no rate limit ceremony), cap page bodies at 8 KB so a tool reply stays inside the context budget, and refuse non-http(s) schemes. - python. Runs Python source in a sandbox with a 30 s wall clock and a 1 GB memory cap. The AST is pre-scanned for
os.system,subprocess.*, signal tampering (signal.signal,signal.alarm, etc.), and direct network access (socket.socket,urllib.request.urlopen,requests.*). Anything that clears the AST check runs in the same sandbox the HumanEval evaluator uses. - terminal. Runs a bash one-liner in the same sandbox, with a per-call temporary working directory. Dangerous commands (
rm,dd,sudo,curl,ssh, plus the usual suspects) are blocked, but only when they appear in command position.echo "do not use sudo"still works, because the blocker walks tokens withshlexand tracks shell separators, command prefixes, and assignment-precedes-command patterns.
The three-tier sandbox
The Python and terminal tools share an isolation chain that picks the strongest layer available on the host:
| Tier | Used when | Isolation |
|---|---|---|
container | apple/container is installed | Full VM, no network, alpine:3.20 base |
sandbox-exec | macOS host (default) | (deny default) profile, file-write only to a temp dir |
subprocess | Linux / fallback | POSIX setrlimit + socket.socket patched out |
The chat UI shows which tier is active under Model & params → Sandbox. The same sandbox runs HumanEval programs during evaluation, so when you see sandbox: sandbox-exec rc=0 in a tool card, that is the same code path that scored the model's pass@1.
Healing malformed tool calls
Tool calling is theoretically a structured field in the assistant message: "tool_calls": [{"function": {"name": "...", "arguments": "{...}"}}]. Frontier models hit this every time. Small quantized open-weight models, in practice, often emit something else.
We track six recurring broken shapes:
| Shape | Looks like |
|---|---|
| Hermes / Qwen tags | <tool_call>{"name": "python", "arguments": {...}}</tool_call> |
| Fenced JSON | ```json\n{"name": "...", "arguments": {...}}\n``` |
| Bare JSON in content | {"name": "terminal", "arguments": {"command": "ls"}} |
| Trailing commas | {"name": "python", "arguments": {"code": "1",},} |
| Function-call form | python({"code": "..."}) |
| Key-is-tool-name | {"python": {"code": "..."}} |
The healer walks each variant, runs progressive JSON cleanups (trailing-comma stripping, fancy-quote replacement, embedded-object extraction), and synthesizes the standard tool_calls array. The model name is matched against the live tool registry, so the healer cannot hallucinate a tool that does not exist: {"name": "nuke", "arguments": {}} falls through to plain content and the user sees the literal JSON, rather than the orchestrator executing something undefined.
Healed calls are flagged so the UI can show a small "healed" chip on the tool card. It is a hint to the user that the model's tool-call output is off-spec, not an error per se.
The orchestration loop
When tools are on, the browser hits a server-side SSE endpoint instead of streaming directly from mlx-lm. The orchestrator does the work the JS cannot safely do:
- POST the conversation to
/v1/chat/completionswithtools=[...]andstream=false. Local generation is fast enough that buffering one assistant turn is cheaper than streaming + parsing tool deltas. - Heal the response. If
tool_callsis non-empty after healing, execute each call via the tool registry, capturing stdout / stderr / sandbox kind / elapsed time. - Append a
role=toolmessage per result and loop. Cap at six turns so a model that refuses to stop calling tools gets cut off rather than burning the host's RAM. - When the model finally produces a plain text reply, stream it back as a
tokenSSE event. The UI renders it the same way it would the non-tools path.
The wire format is small: event: session, event: tool_call, event: tool_result, event: token, event: assistant, event: cancelled, event: error, event: done. Each frame is one JSON line. The UI uses this to render tool cards in real time, with arguments and output collapsed by default and a per-call elapsed timer.
Tuning for local-model failure modes
A 6-turn cap was the first instinct: chats normally need one or two tool calls, six is plenty. It also turned out to be wrong. After watching Qwen3.5-9B hit the cap on a research task that needed five searches plus a Python summary, we audited what Unsloth Studio does in the same situation. Their default is 25 turns, paired with three small refinements we adopted:
- Budget-exhausted re-prompt. When the model hits the turn cap, the orchestrator doesn't return an error. It appends a user-role message that says "stop calling tools and answer now using only the information you have", removes the
toolsfield from the request, and sends one more call. The model has to commit to a text reply. This is strictly better than failing the chat outright. - Duplicate-call de-dup, but only on successful calls. Consecutive identical tool calls are common when a model gets stuck. We detect
(name, arguments)matches and substitute a "you already called this" nudge so the sandbox doesn't burn a second execution. The "only on success" filter is the subtle bit: if a tool errored last turn, the same arguments are allowed to run again so the model can iterate on a fix. Without that filter, a model trying to recover from a blocked Python call would get permanently stuck. - Recovery nudge on tool errors. When a tool result starts with a recognized error prefix (
Error:,Blocked:,Exit code,Search failed,sandbox: rejected, etc.) the orchestrator appends a short instruction telling the model to try a different approach or different arguments. The UI shows anerrorchip on the card so the user knows what happened.
Stop button
A 25-turn budget is no good if you have to wait for it to elapse before getting your terminal back. The Lab supports cancellation end-to-end: clicking Stop in the composer hits a /api/chat/cancel endpoint, the orchestrator polls a threading.Event between turns, and any currently-running tool subprocess gets SIGKILL'd at its process group. The change inside the sandbox was the substantive part: we replaced the blocking subprocess.run with a Popen + watcher loop that polls the cancel event every 50 ms. A stuck sleep 30 in the terminal tool now dies inside half a second.
Inline charts from the python tool
The most useful thing a chat model with a Python tool can do is plot something. The challenge: the python sandbox is a transient subprocess in a temp directory, so any matplotlib output gets shredded when the workdir is cleaned up. We snapshot the workdir for image files (.png, .jpg, .svg, .webp) immediately after the subprocess exits, base64-encode the bytes, and surface them on a separate images field of the tool_result SSE event. The UI renders them as <img> tags inside the tool card.
The base64 payload never enters the model's context. The orchestrator strips an __IMAGES__: sentinel line out of the result string before the result is appended to the conversation; the model only sees a short "[1 image attached to the chat]" footnote. Otherwise a single 50 KB plot would consume 67 KB of context every turn, which is silly.
The sandbox-exec profile needed two patches to make matplotlib work: realpath the temp workdir before formatting the SBPL policy (otherwise macOS's /private/var/folders symlink path slips past the subpath check), and set MPLCONFIGDIR=$workdir in the environment so matplotlib's config-dir lookup doesn't try to write to ~/.matplotlib.
Grouping adjacent tool calls
A model doing real work makes a lot of tool calls. A research session might be search → fetch → fetch → python summarize → answer. Five tool cards in a row, stacked vertically in the thread, drowns out the user's question and the assistant's eventual answer. The Lab groups consecutive role: tool messages into a single "N tool calls" accordion with little chips showing what got called. The most-recent group is auto-expanded so the user sees what's currently happening; older groups auto-collapse when the assistant starts replying with text.
File attachments
The composer accepts text formats with no extraction round-trip: the JS reads the file, wraps it in a fenced code block, and inserts it into the input. PDF and DOCX bounce through a server-side endpoint backed by pypdf and docx2txt. We cap extraction at 100 K characters and 200 PDF pages so a textbook does not blow the context window.
Image and audio are out of scope for v0.1.0. The Lab's quantization pipeline is LLM-first; we would rather not pretend to handle VLM inputs that the rest of the stack does not yet support.
Try it
$ pip install --upgrade "mlx-optiq[lab]" $ optiq lab --model mlx-community/Qwen3.5-9B-OptiQ-4bit
Navigate to http://127.0.0.1:7860/chat, open Model & params, confirm Tools = enabled, and ask the model something that requires a calculation or a search. The Lab docs have the full reference.