top | item 42555629

(no title)

olaulaja | 1 year ago

Did you explore microbatch sizes below 100? Curious about how far this can be pushed and what happens when approaching the limit microbatch size of 1.

discuss

ithkuil|1 year ago

My intuition is that with very small microbatch sizes you're very likely to end up in one of the two modes: either the vast majority of the samples are aligned and thus pruned away, or they are not aligned. Thus effectively you're dropping a fraction of the samples but without the advantage of removing the variance between samples that belong in different microbatches.

fchaubard|1 year ago

Yes. It’s more of a class spanning thing. I wanted batch composition across the two microbatches to be the same. So if you have class 1,2,3 in batch one and class 4,5,6 in class two I would fully expect the cosine distance to be orthogonal or worse, and it could be a good update. But if you have class 1,2,3 in batch one and class 1,2,3 in class two I would fully expect the cosine distance to be positively correlated and if not you should skip. So you could bring this to MB of size 5 for example but just make sure you have the same batch composition. This poses a big challenge in LLM training honestly bc technically classes is vocab size. So I need one “a”, one “b”, etc which is silly. This is why micro gradients in LLMs hit cosine distance of 2. So when you are sampling you kind of need to ensure the microbatches are of the same task at least.