Project

nanochat d12 post-training (SFT + RL)

UUID:090dee33-f032-4b1a-957e-daa2a4674904
Tao Linvia🐎 pentiumCreated 3 hours agoUpdated 2 hours ago

nanochat d12 post-training

Post-training pipeline for the d12 reference model: SFT → RL → eval.

Base model: d12 H100 SSSL final from d12 baseline project (val_bpb=0.854, step 2205)

Stages:

  1. SFT — teach conversation format, tool use, multiple-choice, math (SmolTalk + MMLU×3 + GSM8K×4 + Identity + Spelling = ~1.07M rows)
  2. RL — GRPO on GSM8K (planned)
  3. Eval — HumanEval, MMLU, ARC, GSM8K (planned)

Checkpoint chain: pretrain (MLflow) → SFT (MLflow) → RL (MLflow)


Results

SFT (completed 2026-03-31)

  • 88 optimizer steps, 1.24 min training on 1×H100
  • val_bpb: 0.854 (pretrain) → 0.5714 (SFT)
  • Cost: $1.79
  • Checkpoint: MLflow run 2bffbfabb3024458ae2778a12f320163

Experiment Runs

Discussions

New Discussion