top | item 40811310

(no title)

Your training input has the shape of (sequence length x batch size). If a lot of your samples are shorter than sequence length, as is usually the case, you will have a lot of padding tokens in the input, which is wasted compute.

To compensate for that, you can pack multiple examples in the same sequence. This is there EOS and BOS come in, as they indicate to the model that the two parts of the sequence are not related.

discuss

thomasahle|1 year ago

You can just do that my shaping the attention mask, no? That also gives you an actual guarantee that no information is leaked between conversations.

suryabhupa|1 year ago

In practice, and at scale, that's exactly what having <bos> and <eos> tokens allow you to easily and programmatically do.

danielmarkbruce|1 year ago

You can't pack multiple examples into a single row of a matrix without knowing where one begins and one ends.