top | item 42466254

(no title)

cubie | 1 year ago

Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention. I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.

discuss

No comments yet.