top | item 41739916

(no title)

tadala | 1 year ago

Everyone wants to use less compute to fit more in, but (obviously?) the solution will be to use more compute and fit less. Attention isn't (topologically) attentive enough. All these RNN-lite approaches are doomed, beyond saving costs, they're going to get cooked by some other arch—even more expensive than transformers.

discuss

falcor84|1 year ago

Would you mind expanding upon your thesis? If that compute and all those parameters aren't "fitting" the training examples, what is it that the model is learning, and how should that be analyzed?

ithkuil|1 year ago

I think there are two distinct areas. One is the building of the representations, which is achieved by fitting. The other area is loosely defined as "computing" which is some kind of searching for a path through representation space. All of that is wrapped in a translation layer that can turn those representations into stuff we humans can understand and interact with. All of that is achieved to some extent by current transformer architectures, but I guess some believe that they are not quite as effective at the "computation/search" stage.