top | item 46716039

(no title)

> On the infra side, training a 1.5B model in ~4 hours on 8×H100 is impressive.

It's hard to compare without more details about the training process and the dataset, but, is it? Genuine question, because I had the opposite impression. Like, for example, recently I did a full finetuning run on a 3B model chewing through a 146k entry dataset (with 116k entries having reasoning traces, so they're not short) in 7 hours on a single RTX 6000.

discuss

unknown|1 month ago

[deleted]

kevinlu1248|1 month ago

Honestly I think we can improve our training throughput drastically via a few more optimizations but we've been spending most of our time on model quality improvements instead.