nanochat d12 Baseline — Reference Model

Goal

Find the cheapest and fastest way to run d12 (nanochat reference model, GPT-1 size) at Chinchilla-optimal, using upstream defaults as much as possible.

Why d12?

d12 is nanochat's reference model — all hyperparameters are tuned at this scale, and the community uses it as the standard baseline for comparing experiments. val_bpb results at d12 are directly comparable across setups.

d12 Specs

768 model dim, 12 layers, 6 heads, ~135M total params
110M scaling params, Chinchilla-optimal: 1156M tokens, 2205 iterations
total_batch_size = 524,288 (B_REF, measured empirically)

Final Result: d12 H100 SSSL

Metric	Value
val_bpb	0.854
CORE metric	0.140 (~3min, 22 tasks)
Training	35.1 min
Wall	52.2 min (includes redundant CORE at step 2000, fixed to ~47min)
MFU	41.9%
Cost	$11.32 (on-demand)
Checkpoint	model 793MB + optim 1246MB uploaded
Each eval	~80s

SSSL vs L: val_bpb nearly identical (0.854 vs 0.853). MFU lower with SSSL (42% vs 46%) but step time faster (0.96s vs 1.0s).

val_bpb curve (H100 SSSL)

Step	val_bpb
0	3.170
250	1.090
750	0.970
1250	0.915
1500	0.894
1750	0.876
2000	0.861
2205	0.854

CORE metric breakdown (step 2205)

Task	Accuracy	Centered
hellaswag_zeroshot	37.8%	+0.171
arc_easy	55.4%	+0.405
piqa	68.8%	+0.376
lambada_openai	28.0%	+0.280
bigbench_cs_algorithms	44.4%	+0.444
bigbench_qa_wikidata	29.8%	+0.298
winograd	59.0%	+0.180
boolq	51.8%	-0.268
Average (CORE)		0.140

Sample output (step 2205)

"The capital of France is Paris" ok
"The chemical symbol of gold is Au" ok
"The opposite of hot is cold" ok
"If yesterday was Friday, then tomorrow will be Friday" wrong (should be Sunday)
"If 5*x + 3 = 13, then x is 13" wrong (should be 2)

MLproject Config (entry point: h100_d12)

depth=12, num_iterations=2205, device_batch_size=32, max_seq_len=2048 num_data_shards=28, eval_every=250, core_metric_every=2500, sample_every=2500 upload_checkpoint=0 (default off; preemption always uploads), window_pattern=SSSL

core_metric_every=2500 > 2205 iters: only triggers at last step (saves ~5min vs every 2000). num_data_shards increased from 22 to 28: confirmed epoch 2 with 22 shards. upload_checkpoint off by default (~2GB upload); preemption path always uploads.

Earlier Results

H100 80GB L pattern (first run)

val_bpb: 0.853, MFU: 46.3%, Wall: 49.7 min, Cost: $10.79
No CORE metric, no SSSL — superseded by SSSL run above

A100 40GB — Dryrun only

batch32 fits (28.8 GB peak, MFU 59%)
2.3x slower than H100 for same cost, d12 is H100 only

Cost Analysis

	A100 on-demand	H100 on-demand	H100 spot
Wall time	~134 min	~47 min	~47 min
Cost	$9.63	$10.20	~$3.78
Upstream defaults?	No (must use L)	Yes (SSSL+FA3)	Yes

Decision: d12 on H100 only. A100 has no advantage.

GPU Options (reference)

GPU	Batch	Train	Total	Cost (spot)
A100 40GB batch32	8 accum	99min	~72min	~$2.50
H100 80GB batch32	8 accum	32min	~47min	~$2.50

nanochat d12 baseline (reference model)