(no title)
entilzha | 1 year ago
Related thought, I think BPE is quite a good, cheap inductive bias to have in a model, which is part of what made it challenging to scale better against. I also suspect this is part of why with less training FLOPs BPE is better (left side of figure 1), BLT has to expend some of its FLOPs budget to recover/learn some of this useful bias. With more training FLOPs this becomes a smaller fraction of the budget though leading to better scaling.
No comments yet.