Project

nanochat d12 baseline (reference model)

UUID:1b60a454-91a2-4085-a9b2-6a5dc0f216a3
Tao Linvia๐ŸŽ pentiumCreated 45 days agoUpdated 42 days ago

nanochat d12 Baseline โ€” Reference Model

Goal

Find the cheapest and fastest way to run d12 (nanochat reference model, GPT-1 size) at Chinchilla-optimal, using upstream defaults as much as possible.

Why d12?

d12 is nanochat's reference model โ€” all hyperparameters are tuned at this scale, and the community uses it as the standard baseline for comparing experiments. val_bpb results at d12 are directly comparable across setups.

d12 Specs

  • 768 model dim, 12 layers, 6 heads, ~135M total params
  • 110M scaling params, Chinchilla-optimal: 1156M tokens, 2205 iterations
  • total_batch_size = 524,288 (B_REF, measured empirically)

Final Result: d12 H100 SSSL

MetricValue
val_bpb0.854
CORE metric0.140 (~3min, 22 tasks)
Training35.1 min
Wall52.2 min (includes redundant CORE at step 2000, fixed to ~47min)
MFU41.9%
Cost$11.32 (on-demand)
Checkpointmodel 793MB + optim 1246MB uploaded
Each eval~80s

SSSL vs L: val_bpb nearly identical (0.854 vs 0.853). MFU lower with SSSL (42% vs 46%) but step time faster (0.96s vs 1.0s).

val_bpb curve (H100 SSSL)

Stepval_bpb
03.170
2501.090
7500.970
12500.915
15000.894
17500.876
20000.861
22050.854

CORE metric breakdown (step 2205)

TaskAccuracyCentered
hellaswag_zeroshot37.8%+0.171
arc_easy55.4%+0.405
piqa68.8%+0.376
lambada_openai28.0%+0.280
bigbench_cs_algorithms44.4%+0.444
bigbench_qa_wikidata29.8%+0.298
winograd59.0%+0.180
boolq51.8%-0.268
Average (CORE)0.140

Sample output (step 2205)

  • "The capital of France is Paris" ok
  • "The chemical symbol of gold is Au" ok
  • "The opposite of hot is cold" ok
  • "If yesterday was Friday, then tomorrow will be Friday" wrong (should be Sunday)
  • "If 5*x + 3 = 13, then x is 13" wrong (should be 2)

MLproject Config (entry point: h100_d12)

depth=12, num_iterations=2205, device_batch_size=32, max_seq_len=2048 num_data_shards=28, eval_every=250, core_metric_every=2500, sample_every=2500 upload_checkpoint=0 (default off; preemption always uploads), window_pattern=SSSL

core_metric_every=2500 > 2205 iters: only triggers at last step (saves ~5min vs every 2000). num_data_shards increased from 22 to 28: confirmed epoch 2 with 22 shards. upload_checkpoint off by default (~2GB upload); preemption path always uploads.

Earlier Results

H100 80GB L pattern (first run)

  • val_bpb: 0.853, MFU: 46.3%, Wall: 49.7 min, Cost: $10.79
  • No CORE metric, no SSSL โ€” superseded by SSSL run above

A100 40GB โ€” Dryrun only

  • batch32 fits (28.8 GB peak, MFU 59%)
  • 2.3x slower than H100 for same cost, d12 is H100 only

Cost Analysis

A100 on-demandH100 on-demandH100 spot
Wall time~134 min~47 min~47 min
Cost$9.63$10.20~$3.78
Upstream defaults?No (must use L)Yes (SSSL+FA3)Yes

Decision: d12 on H100 only. A100 has no advantage.

GPU Options (reference)

GPUBatchTrainTotalCost (spot)
A100 40GB batch328 accum99min~72min~$2.50
H100 80GB batch328 accum32min~47min~$2.50

Experiment Runs

Discussions

New Discussion