top | item 42417550

(no title)

entilzha | 1 year ago

(Author Here)

Related thought, I think BPE is quite a good, cheap inductive bias to have in a model, which is part of what made it challenging to scale better against. I also suspect this is part of why with less training FLOPs BPE is better (left side of figure 1), BLT has to expend some of its FLOPs budget to recover/learn some of this useful bias. With more training FLOPs this becomes a smaller fraction of the budget though leading to better scaling.

discuss

No comments yet.