nanochat Baseline Search — COMPLETED

Final Production Baseline

depth=10, num_iterations=1500, device_batch_size=32
max_seq_len=2048, num_data_shards=18, eval_every=250
core_metric_every=-1, sample_every=-1
upload_checkpoint=0, tok_max_chars=2000000000
dryrun: 20 iterations

Entry point: a100_d10 (renamed from a100_preset)

Final validated result: val_bpb=0.897, wall_time=60.7min, MFU=55%, cost=$4.36

Deep Analysis

Batch Size

batch32 is optimal (MFU=55%, 21 GB peak)
batch64 OOMs: tried to allocate 8 GB with only 697 MB free on A100 40GB
batch48 impossible: 524288/(48*2048)=5.33, must be integer. Only powers of 2 work.
batch16 works but slower: MFU=53%, grad_accum=16

Data Sufficiency

d10 needs 786M tokens (1500*524K batch)
Each shard ~ 250M chars. At 3.5-5.0 chars/token: 50-71M tokens/shard
14 shards (13 train) was borderline — could go multi-epoch if compression ratio > 4.0
Fixed: 18 shards (17 train) guarantees single-epoch even at 5 chars/token

Tokenizer

tok_1B (current): 45s training time
tok_2B (upstream default): 85s (+40s), MFU ~1% higher, val_bpb identical at 20 steps
1B is the right choice: marginal quality gain doesn't justify +40s overhead
Upstream default 2B may help at larger scales, but for d10 baseline it's noise

Convergence

LR warmdown starts at 35% of training (warmdown_ratio=0.65)
Final LR = 5% of initial (final_lr_frac=0.05)
No early stopping — convergence is by schedule design
val_bpb=0.897 at step 1500 was still slightly decreasing
Chinchilla-optimal is 1404 iters; we run 1500 (1.07x) for safety

Eval Cost

Default eval_tokens=41.9M -> 640 micro-batches -> ~70s per eval
Reducing to 10M saves ~52s/eval but increases noise
eval_every=250 with 41.9M tokens: ~8 points, ~10min overhead

Hardware Profile (A100-SXM4-40GB)

Config	MFU	Step	Accum	Mem	Status
d10 batch16	53%	1820ms	16	11.6 GB	works
d10 batch32	55%	1770ms	8	21.2 GB	optimal
d10 batch64	-	-	4	>39 GB	OOM

All Runs

#	Title	val_bpb	Wall	Cost	Result
1	d6-500 v1	-	-	$0.53	dryrun bug
2	d6-500 v2	1.096	16m	$1.12	under-trained
3	d10-1500 eval@100	0.898	71m	$5.07	eval overhead
4	d8-1764	-	-	free	MLflow infra
5	d10-1500 eval@300	0.898	60m	$4.33	good
6	d10 batch32 [dry]	-	-	~$0.50	MFU 56%
7	d10 tok_1B [dry]	-	-	~$0.50	feasible
8	d10 prod v1	-	-	$1.54	spot preemption
9	d10 prod v2	0.897	55m	$3.95	production
10	d10 batch64 [dry]	-	-	~$0.50	OOM
11	d10 tok_2B [dry]	-	-	~$0.50	+40s, ~1% MFU

Total d10 project: ~$25 spent.

Note: d12 baseline moved to separate project (nanochat d12 baseline). d12 on H100 achieves val_bpb=0.854 in ~47min. See that project for details.

Remaining Questions (low priority)

Does tok_2B actually improve final val_bpb? (needs funded run, ~$4)
Does reducing eval_tokens to 10M degrade eval quality? (needs funded run)
d11 or different aspect_ratio? (architectural exploration)
How does val_bpb scale with depth? (d8/d12 at Chinchilla-optimal)

nanochat baseline search