mlx-optiq
Lab · Build dataset

Build dataset

Twelve templates to turn pairs / docs / code / seeds / target text / scenarios into JSONL the fine-tune workflow can read.

LLM-driven templates call the Lab's own API server, so generation runs against the model you have loaded. Reasoning is auto-disabled on the generation calls so a thinking model doesn't burn its budget on the <think> block.

  • SFT from QA pairs, DPO from preference pairs, Style transfer, Code completion, Self-instruct expansion, Format conversion.
  • Prompt reconstruction: work backwards from a target paragraph; the assistant target stays verbatim so facts and formatting are preserved.
  • Multi-turn chat synthesis, Tool-use traces (messages-format with a top-level tools field), RAG Q/A from documents.
  • Reasoning trace (CoT) synthesis and Verified code generation (asserts run in the sandbox; rows tagged verified: true/false).

Dataset templates

Datasets push to HF as repo_type="dataset".