Project
nanochat d12 post-training (SFT + RL)
UUID:090dee33-f032-4b1a-957e-daa2a4674904
nanochat d12 post-training
Post-training pipeline for the d12 reference model: SFT → RL → eval.
Base model: d12 H100 SSSL final from d12 baseline project (val_bpb=0.854, step 2205)
Stages:
- SFT — teach conversation format, tool use, multiple-choice, math (SmolTalk + MMLU×3 + GSM8K×4 + Identity + Spelling = ~1.07M rows)
- RL — GRPO on GSM8K (blocked — see below)
- Eval — ARC, MMLU, GSM8K, HumanEval, SpellingBee (partial results)
Checkpoint chain: pretrain (MLflow) → SFT (MLflow) → RL (MLflow)
Results
SFT (completed 2026-03-31)
- 88 optimizer steps, 1.24 min training on 1×H100
- val_bpb: 0.854 (pretrain) → 0.5714 (SFT)
- Cost: $1.79
- Checkpoint: MLflow run
2bffbfabb3024458ae2778a12f320163
Chat Eval — partial (2026-04-01)
Ran on SFT checkpoint, 1×H100. Timed out at 1200s — only categorical tasks completed, generative tasks (GSM8K, HumanEval, SpellingBee) did not finish.
| Task | Accuracy | Baseline |
|---|---|---|
| ARC-Easy | 25.42% | 25% |
| ARC-Challenge | 26.96% | 25% |
| MMLU | 27.95% | 25% |
| GSM8K | — (timeout) | 0% |
| HumanEval | — (timeout) | 0% |
| SpellingBee | — (timeout) | 0% |
Categorical scores near random baseline — expected for a 135M param model. Cost: $4.42.
RL Probe — failed (2026-04-01)
Attempted 15-min probe to collect step timing data. Timed out at 900s with 0 training steps completed — startup (tokenizer training 60s + checkpoint download 10s + model load) plus initial eval consumed the entire budget.
Cost: $3.34.
Lessons Learned
- Tokenizer retrain is wasteful — 60s every run for deterministic output. Should upload tokenizer to MLflow alongside checkpoint.
- MLPatron dryrun model doesn't fit RL/eval — RL needs sampling (slow), eval has no training-length param. skip-dryrun + max_time_seconds is the workaround, but timeout estimation is tricky. Filed feature request (feedback 110b168c).
num_iterationsin chat_sft.py counts dataloader yields, not optimizer steps — with grad_accum=8, actual optimizer steps ≈ yields / 9.- 1×H100 vs 8×H100 for RL — upstream uses 8 GPUs. With 1 GPU, each RL step does 8× more work (examples_per_rank=16 vs 2). Full epoch likely needs 2-3 hours on 1×H100.
Next Steps
- Upload tokenizer to MLflow in pretrain/SFT artifacts — eliminates 60s+800MB overhead per downstream run.
- Re-run eval with longer timeout (3600s) or run tasks individually to avoid single-run timeout.
- RL: skip step-0 eval (
eval_every> num_steps) and use longer timeout (1800-3600s). Or consider using 8×H100 to match upstream. - Update awesome-mlpatron-presets with eval and RL entry points once baseline runs complete successfully.