Project
nanochat baseline search
UUID:46050216-9ba8-4b7a-9784-5699097a4433
nanochat Baseline Search โ COMPLETED
Final Production Baseline
depth=10, num_iterations=1500, device_batch_size=32
max_seq_len=2048, num_data_shards=18, eval_every=250
core_metric_every=-1, sample_every=-1
upload_checkpoint=0, tok_max_chars=2000000000
dryrun: 20 iterations
Entry point: a100_d10 (renamed from a100_preset)
Final validated result: val_bpb=0.897, wall_time=60.7min, MFU=55%, cost=$4.36
Deep Analysis
Batch Size
- batch32 is optimal (MFU=55%, 21 GB peak)
- batch64 OOMs: tried to allocate 8 GB with only 697 MB free on A100 40GB
- batch48 impossible: 524288/(48*2048)=5.33, must be integer. Only powers of 2 work.
- batch16 works but slower: MFU=53%, grad_accum=16
Data Sufficiency
- d10 needs 786M tokens (1500*524K batch)
- Each shard ~ 250M chars. At 3.5-5.0 chars/token: 50-71M tokens/shard
- 14 shards (13 train) was borderline โ could go multi-epoch if compression ratio > 4.0
- Fixed: 18 shards (17 train) guarantees single-epoch even at 5 chars/token
Tokenizer
- tok_1B (current): 45s training time
- tok_2B (upstream default): 85s (+40s), MFU ~1% higher, val_bpb identical at 20 steps
- 1B is the right choice: marginal quality gain doesn't justify +40s overhead
- Upstream default 2B may help at larger scales, but for d10 baseline it's noise
Convergence
- LR warmdown starts at 35% of training (warmdown_ratio=0.65)
- Final LR = 5% of initial (final_lr_frac=0.05)
- No early stopping โ convergence is by schedule design
- val_bpb=0.897 at step 1500 was still slightly decreasing
- Chinchilla-optimal is 1404 iters; we run 1500 (1.07x) for safety
Eval Cost
- Default eval_tokens=41.9M -> 640 micro-batches -> ~70s per eval
- Reducing to 10M saves ~52s/eval but increases noise
- eval_every=250 with 41.9M tokens: ~8 points, ~10min overhead
Hardware Profile (A100-SXM4-40GB)
| Config | MFU | Step | Accum | Mem | Status |
|---|---|---|---|---|---|
| d10 batch16 | 53% | 1820ms | 16 | 11.6 GB | works |
| d10 batch32 | 55% | 1770ms | 8 | 21.2 GB | optimal |
| d10 batch64 | - | - | 4 | >39 GB | OOM |
All Runs
| # | Title | val_bpb | Wall | Cost | Result |
|---|---|---|---|---|---|
| 1 | d6-500 v1 | - | - | $0.53 | dryrun bug |
| 2 | d6-500 v2 | 1.096 | 16m | $1.12 | under-trained |
| 3 | d10-1500 eval@100 | 0.898 | 71m | $5.07 | eval overhead |
| 4 | d8-1764 | - | - | free | MLflow infra |
| 5 | d10-1500 eval@300 | 0.898 | 60m | $4.33 | good |
| 6 | d10 batch32 [dry] | - | - | ~$0.50 | MFU 56% |
| 7 | d10 tok_1B [dry] | - | - | ~$0.50 | feasible |
| 8 | d10 prod v1 | - | - | $1.54 | spot preemption |
| 9 | d10 prod v2 | 0.897 | 55m | $3.95 | production |
| 10 | d10 batch64 [dry] | - | - | ~$0.50 | OOM |
| 11 | d10 tok_2B [dry] | - | - | ~$0.50 | +40s, ~1% MFU |
Total d10 project: ~$25 spent.
Note: d12 baseline moved to separate project (nanochat d12 baseline). d12 on H100 achieves val_bpb=0.854 in ~47min. See that project for details.
Remaining Questions (low priority)
- Does tok_2B actually improve final val_bpb? (needs funded run, ~$4)
- Does reducing eval_tokens to 10M degrade eval quality? (needs funded run)
- d11 or different aspect_ratio? (architectural exploration)
- How does val_bpb scale with depth? (d8/d12 at Chinchilla-optimal)