top | item 42986254

(no title)

jbay808 | 1 year ago

It might seem like you could sort with just pairwise correlations, but on closer analysis, you cannot. Generating the next correct token requires correctly weighing the entire context window.

discuss

dartos|1 year ago

Of course, that’s how attention works, after all.

But by specifically avoiding certain cases, wet could verify if the model is generalizing or not.

jbay808|1 year ago

I mean that needing to scan the full context of tokens before the nth is inherent to the problem of sorting. Transformers do scan that input, which is good; it's not surprising that they're up to the task. But pairwise numeral correlations will not do the job.

As for avoiding certain cases, that could be done to some extent. But remember that the untrained transformer has no preconception of numbers or ordering (it doesn't use the hardware ALU or integer data type) so there has to be enough data in the training set to learn 0<1<2<3<4<5<6, etc.