sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
absolutely!
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see
https://arxiv.org/pdf/2510.09378 for second-order methods.
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
no ensembling means train 8 models and during inference avg logits of all 8 models to make a prediction.
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!
sdpmas
|
4 days ago
|
on: NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
hey, it's Samip (behind the Slowrun repo). yeah that's a fair point, we will mention them in the blog. but there are a couple of major differences:
1. our emphasis is on using more compute to get better data efficiency. this is important because there are lots of hacky chances that will get lower loss, but when compared to general methods that leverage a lot of compute, they don't do so well. and you can already see how this emphasis on compute leads to different methods to BabyLM!
2. our reasoning behind the repo is not anything to do with how much data a child sees. and our dataset is not tailored towards that either. it's simple pretraining on random subset of the internet. we know there are better training algorithms that get lower loss on that data, and we are finding those.