top | item 42559173

(no title)

Yes. It’s more of a class spanning thing. I wanted batch composition across the two microbatches to be the same. So if you have class 1,2,3 in batch one and class 4,5,6 in class two I would fully expect the cosine distance to be orthogonal or worse, and it could be a good update. But if you have class 1,2,3 in batch one and class 1,2,3 in class two I would fully expect the cosine distance to be positively correlated and if not you should skip. So you could bring this to MB of size 5 for example but just make sure you have the same batch composition. This poses a big challenge in LLM training honestly bc technically classes is vocab size. So I need one “a”, one “b”, etc which is silly. This is why micro gradients in LLMs hit cosine distance of 2. So when you are sampling you kind of need to ensure the microbatches are of the same task at least.

discuss

No comments yet.