top | item 42467393

(no title)

bclavie | 1 year ago

Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.

For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.

For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!

discuss

No comments yet.