Project
nanochat d12 baseline (reference model)
nanochat d12 Baseline โ Reference Model
Goal
Find the cheapest and fastest way to run d12 (nanochat reference model, GPT-1 size) at Chinchilla-optimal, using upstream defaults as much as possible.
Why d12?
d12 is nanochat's reference model โ all hyperparameters are tuned at this scale, and the community uses it as the standard baseline for comparing experiments. val_bpb results at d12 are directly comparable across setups.
d12 Specs
- 768 model dim, 12 layers, 6 heads, ~135M total params
- 110M scaling params, Chinchilla-optimal: 1156M tokens, 2205 iterations
- total_batch_size = 524,288 (B_REF, measured empirically)
Final Result: d12 H100 SSSL
| Metric | Value |
|---|---|
| val_bpb | 0.854 |
| CORE metric | 0.140 (~3min, 22 tasks) |
| Training | 35.1 min |
| Wall | 52.2 min (includes redundant CORE at step 2000, fixed to ~47min) |
| MFU | 41.9% |
| Cost | $11.32 (on-demand) |
| Checkpoint | model 793MB + optim 1246MB uploaded |
| Each eval | ~80s |
SSSL vs L: val_bpb nearly identical (0.854 vs 0.853). MFU lower with SSSL (42% vs 46%) but step time faster (0.96s vs 1.0s).
val_bpb curve (H100 SSSL)
| Step | val_bpb |
|---|---|
| 0 | 3.170 |
| 250 | 1.090 |
| 750 | 0.970 |
| 1250 | 0.915 |
| 1500 | 0.894 |
| 1750 | 0.876 |
| 2000 | 0.861 |
| 2205 | 0.854 |
CORE metric breakdown (step 2205)
| Task | Accuracy | Centered |
|---|---|---|
| hellaswag_zeroshot | 37.8% | +0.171 |
| arc_easy | 55.4% | +0.405 |
| piqa | 68.8% | +0.376 |
| lambada_openai | 28.0% | +0.280 |
| bigbench_cs_algorithms | 44.4% | +0.444 |
| bigbench_qa_wikidata | 29.8% | +0.298 |
| winograd | 59.0% | +0.180 |
| boolq | 51.8% | -0.268 |
| Average (CORE) | 0.140 |
Sample output (step 2205)
- "The capital of France is Paris" ok
- "The chemical symbol of gold is Au" ok
- "The opposite of hot is cold" ok
- "If yesterday was Friday, then tomorrow will be Friday" wrong (should be Sunday)
- "If 5*x + 3 = 13, then x is 13" wrong (should be 2)
MLproject Config (entry point: h100_d12)
depth=12, num_iterations=2205, device_batch_size=32, max_seq_len=2048 num_data_shards=28, eval_every=250, core_metric_every=2500, sample_every=2500 upload_checkpoint=0 (default off; preemption always uploads), window_pattern=SSSL
core_metric_every=2500 > 2205 iters: only triggers at last step (saves ~5min vs every 2000). num_data_shards increased from 22 to 28: confirmed epoch 2 with 22 shards. upload_checkpoint off by default (~2GB upload); preemption path always uploads.
Earlier Results
H100 80GB L pattern (first run)
- val_bpb: 0.853, MFU: 46.3%, Wall: 49.7 min, Cost: $10.79
- No CORE metric, no SSSL โ superseded by SSSL run above
A100 40GB โ Dryrun only
- batch32 fits (28.8 GB peak, MFU 59%)
- 2.3x slower than H100 for same cost, d12 is H100 only
Cost Analysis
| A100 on-demand | H100 on-demand | H100 spot | |
|---|---|---|---|
| Wall time | ~134 min | ~47 min | ~47 min |
| Cost | $9.63 | $10.20 | ~$3.78 |
| Upstream defaults? | No (must use L) | Yes (SSSL+FA3) | Yes |
Decision: d12 on H100 only. A100 has no advantage.
GPU Options (reference)
| GPU | Batch | Train | Total | Cost (spot) |
|---|---|---|---|---|
| A100 40GB batch32 | 8 accum | 99min | ~72min | ~$2.50 |
| H100 80GB batch32 | 8 accum | 32min | ~47min | ~$2.50 |