My intuition is that with very small microbatch sizes you're very likely to end up in one of the two modes: either the vast majority of the samples are aligned and thus pruned away, or they are not aligned. Thus effectively you're dropping a fraction of the samples but without the advantage of removing the variance between samples that belong in different microbatches.
Yes. It’s more of a class spanning thing. I wanted batch composition across the two microbatches to be the same. So if you have class 1,2,3 in batch one and class 4,5,6 in class two I would fully expect the cosine distance to be orthogonal or worse, and it could be a good update. But if you have class 1,2,3 in batch one and class 1,2,3 in class two I would fully expect the cosine distance to be positively correlated and if not you should skip. So you could bring this to MB of size 5 for example but just make sure you have the same batch composition. This poses a big challenge in LLM training honestly bc technically classes is vocab size. So I need one “a”, one “b”, etc which is silly. This is why micro gradients in LLMs hit cosine distance of 2. So when you are sampling you kind of need to ensure the microbatches are of the same task at least.
ithkuil|1 year ago
fchaubard|1 year ago