Project

nanochat d12 post-training (SFT + RL)

UUID:090dee33-f032-4b1a-957e-daa2a4674904
Tao Linvia🐎 pentiumCreated 45 days agoUpdated 43 days ago

nanochat d12 post-training

Post-training pipeline for the d12 reference model: SFT → RL → eval.

Base model: d12 H100 SSSL final from d12 baseline project (val_bpb=0.854, step 2205)

Stages:

  1. SFT — teach conversation format, tool use, multiple-choice, math (SmolTalk + MMLU×3 + GSM8K×4 + Identity + Spelling = ~1.07M rows)
  2. RL — GRPO on GSM8K (blocked — see below)
  3. Eval — ARC, MMLU, GSM8K, HumanEval, SpellingBee (partial results)

Checkpoint chain: pretrain (MLflow) → SFT (MLflow) → RL (MLflow)


Results

SFT (completed 2026-03-31)

  • 88 optimizer steps, 1.24 min training on 1×H100
  • val_bpb: 0.854 (pretrain) → 0.5714 (SFT)
  • Cost: $1.79
  • Checkpoint: MLflow run 2bffbfabb3024458ae2778a12f320163

Chat Eval — partial (2026-04-01)

Ran on SFT checkpoint, 1×H100. Timed out at 1200s — only categorical tasks completed, generative tasks (GSM8K, HumanEval, SpellingBee) did not finish.

TaskAccuracyBaseline
ARC-Easy25.42%25%
ARC-Challenge26.96%25%
MMLU27.95%25%
GSM8K— (timeout)0%
HumanEval— (timeout)0%
SpellingBee— (timeout)0%

Categorical scores near random baseline — expected for a 135M param model. Cost: $4.42.

RL Probe — failed (2026-04-01)

Attempted 15-min probe to collect step timing data. Timed out at 900s with 0 training steps completed — startup (tokenizer training 60s + checkpoint download 10s + model load) plus initial eval consumed the entire budget.

Cost: $3.34.


Lessons Learned

  1. Tokenizer retrain is wasteful — 60s every run for deterministic output. Should upload tokenizer to MLflow alongside checkpoint.
  2. MLPatron dryrun model doesn't fit RL/eval — RL needs sampling (slow), eval has no training-length param. skip-dryrun + max_time_seconds is the workaround, but timeout estimation is tricky. Filed feature request (feedback 110b168c).
  3. num_iterations in chat_sft.py counts dataloader yields, not optimizer steps — with grad_accum=8, actual optimizer steps ≈ yields / 9.
  4. 1×H100 vs 8×H100 for RL — upstream uses 8 GPUs. With 1 GPU, each RL step does 8× more work (examples_per_rank=16 vs 2). Full epoch likely needs 2-3 hours on 1×H100.

Next Steps

  1. Upload tokenizer to MLflow in pretrain/SFT artifacts — eliminates 60s+800MB overhead per downstream run.
  2. Re-run eval with longer timeout (3600s) or run tasks individually to avoid single-run timeout.
  3. RL: skip step-0 eval (eval_every > num_steps) and use longer timeout (1800-3600s). Or consider using 8×H100 to match upstream.
  4. Update awesome-mlpatron-presets with eval and RL entry points once baseline runs complete successfully.

Experiment Runs

Discussions

New Discussion