Project
nanochat d12 post-training (SFT + RL)
UUID:090dee33-f032-4b1a-957e-daa2a4674904
nanochat d12 post-training
Post-training pipeline for the d12 reference model: SFT → RL → eval.
Base model: d12 H100 SSSL final from d12 baseline project (val_bpb=0.854, step 2205)
Stages:
- SFT — teach conversation format, tool use, multiple-choice, math (SmolTalk + MMLU×3 + GSM8K×4 + Identity + Spelling = ~1.07M rows)
- RL — GRPO on GSM8K (planned)
- Eval — HumanEval, MMLU, ARC, GSM8K (planned)
Checkpoint chain: pretrain (MLflow) → SFT (MLflow) → RL (MLflow)
Results
SFT (completed 2026-03-31)
- 88 optimizer steps, 1.24 min training on 1×H100
- val_bpb: 0.854 (pretrain) → 0.5714 (SFT)
- Cost: $1.79
- Checkpoint: MLflow run
2bffbfabb3024458ae2778a12f320163