(no title)
kouteiheika | 1 month ago
It's hard to compare without more details about the training process and the dataset, but, is it? Genuine question, because I had the opposite impression. Like, for example, recently I did a full finetuning run on a 3B model chewing through a 146k entry dataset (with 116k entries having reasoning traces, so they're not short) in 7 hours on a single RTX 6000.
unknown|1 month ago
[deleted]
kevinlu1248|1 month ago