nanochat d12 post-training (SFT + RL)

nanochat d12 post-training

Post-training pipeline for the d12 reference model: SFT → RL → eval.

Base model: d12 H100 SSSL final from d12 baseline project (val_bpb=0.854, step 2205)

Stages:

SFT — teach conversation format, tool use, multiple-choice, math (SmolTalk + MMLU×3 + GSM8K×4 + Identity + Spelling = ~1.07M rows)
RL — GRPO on GSM8K (blocked — see below)
Eval — ARC, MMLU, GSM8K, HumanEval, SpellingBee (partial results)

Checkpoint chain: pretrain (MLflow) → SFT (MLflow) → RL (MLflow)

Results

SFT (completed 2026-03-31)

88 optimizer steps, 1.24 min training on 1×H100
val_bpb: 0.854 (pretrain) → 0.5714 (SFT)
Cost: $1.79
Checkpoint: MLflow run 2bffbfabb3024458ae2778a12f320163

Chat Eval — partial (2026-04-01)

Ran on SFT checkpoint, 1×H100. Timed out at 1200s — only categorical tasks completed, generative tasks (GSM8K, HumanEval, SpellingBee) did not finish.

Task	Accuracy	Baseline
ARC-Easy	25.42%	25%
ARC-Challenge	26.96%	25%
MMLU	27.95%	25%
GSM8K	— (timeout)	0%
HumanEval	— (timeout)	0%
SpellingBee	— (timeout)	0%

Categorical scores near random baseline — expected for a 135M param model. Cost: $4.42.

RL Probe — failed (2026-04-01)

Attempted 15-min probe to collect step timing data. Timed out at 900s with 0 training steps completed — startup (tokenizer training 60s + checkpoint download 10s + model load) plus initial eval consumed the entire budget.

Cost: $3.34.

Lessons Learned

Tokenizer retrain is wasteful — 60s every run for deterministic output. Should upload tokenizer to MLflow alongside checkpoint.
MLPatron dryrun model doesn't fit RL/eval — RL needs sampling (slow), eval has no training-length param. skip-dryrun + max_time_seconds is the workaround, but timeout estimation is tricky. Filed feature request (feedback 110b168c).
num_iterations in chat_sft.py counts dataloader yields, not optimizer steps — with grad_accum=8, actual optimizer steps ≈ yields / 9.
1×H100 vs 8×H100 for RL — upstream uses 8 GPUs. With 1 GPU, each RL step does 8× more work (examples_per_rank=16 vs 2). Full epoch likely needs 2-3 hours on 1×H100.

Next Steps

Upload tokenizer to MLflow in pretrain/SFT artifacts — eliminates 60s+800MB overhead per downstream run.
Re-run eval with longer timeout (3600s) or run tasks individually to avoid single-run timeout.
RL: skip step-0 eval (eval_every > num_steps) and use longer timeout (1800-3600s). Or consider using 8×H100 to match upstream.
Update awesome-mlpatron-presets with eval and RL entry points once baseline runs complete successfully.

nanochat d12 post-training (SFT + RL)