Project

nanochat baseline search

UUID:46050216-9ba8-4b7a-9784-5699097a4433
Tao Linvia๐ŸŽ pentiumCreated 46 days agoUpdated 45 days ago

nanochat Baseline Search โ€” COMPLETED

Final Production Baseline

depth=10, num_iterations=1500, device_batch_size=32
max_seq_len=2048, num_data_shards=18, eval_every=250
core_metric_every=-1, sample_every=-1
upload_checkpoint=0, tok_max_chars=2000000000
dryrun: 20 iterations

Entry point: a100_d10 (renamed from a100_preset)

Final validated result: val_bpb=0.897, wall_time=60.7min, MFU=55%, cost=$4.36

Deep Analysis

Batch Size

  • batch32 is optimal (MFU=55%, 21 GB peak)
  • batch64 OOMs: tried to allocate 8 GB with only 697 MB free on A100 40GB
  • batch48 impossible: 524288/(48*2048)=5.33, must be integer. Only powers of 2 work.
  • batch16 works but slower: MFU=53%, grad_accum=16

Data Sufficiency

  • d10 needs 786M tokens (1500*524K batch)
  • Each shard ~ 250M chars. At 3.5-5.0 chars/token: 50-71M tokens/shard
  • 14 shards (13 train) was borderline โ€” could go multi-epoch if compression ratio > 4.0
  • Fixed: 18 shards (17 train) guarantees single-epoch even at 5 chars/token

Tokenizer

  • tok_1B (current): 45s training time
  • tok_2B (upstream default): 85s (+40s), MFU ~1% higher, val_bpb identical at 20 steps
  • 1B is the right choice: marginal quality gain doesn't justify +40s overhead
  • Upstream default 2B may help at larger scales, but for d10 baseline it's noise

Convergence

  • LR warmdown starts at 35% of training (warmdown_ratio=0.65)
  • Final LR = 5% of initial (final_lr_frac=0.05)
  • No early stopping โ€” convergence is by schedule design
  • val_bpb=0.897 at step 1500 was still slightly decreasing
  • Chinchilla-optimal is 1404 iters; we run 1500 (1.07x) for safety

Eval Cost

  • Default eval_tokens=41.9M -> 640 micro-batches -> ~70s per eval
  • Reducing to 10M saves ~52s/eval but increases noise
  • eval_every=250 with 41.9M tokens: ~8 points, ~10min overhead

Hardware Profile (A100-SXM4-40GB)

ConfigMFUStepAccumMemStatus
d10 batch1653%1820ms1611.6 GBworks
d10 batch3255%1770ms821.2 GBoptimal
d10 batch64--4>39 GBOOM

All Runs

#Titleval_bpbWallCostResult
1d6-500 v1--$0.53dryrun bug
2d6-500 v21.09616m$1.12under-trained
3d10-1500 eval@1000.89871m$5.07eval overhead
4d8-1764--freeMLflow infra
5d10-1500 eval@3000.89860m$4.33good
6d10 batch32 [dry]--~$0.50MFU 56%
7d10 tok_1B [dry]--~$0.50feasible
8d10 prod v1--$1.54spot preemption
9d10 prod v20.89755m$3.95production
10d10 batch64 [dry]--~$0.50OOM
11d10 tok_2B [dry]--~$0.50+40s, ~1% MFU

Total d10 project: ~$25 spent.

Note: d12 baseline moved to separate project (nanochat d12 baseline). d12 on H100 achieves val_bpb=0.854 in ~47min. See that project for details.

Remaining Questions (low priority)

  • Does tok_2B actually improve final val_bpb? (needs funded run, ~$4)
  • Does reducing eval_tokens to 10M degrade eval quality? (needs funded run)
  • d11 or different aspect_ratio? (architectural exploration)
  • How does val_bpb scale with depth? (d8/d12 at Chinchilla-optimal)

Experiment Runs

Discussions

New Discussion