top | item 44470761

(no title)

wohoef | 8 months ago

In my experience LLMs have a hard time working with text grids like this. It seems to find columns harder to “detect” then rows. Probably because it’s input shows it as a giant row if that makes sense.

It has the same problem with playing chess. But I’m not sure if there is a datatype it could work with for this kinda game. Currently it seems more like LLMs can’t really work on spacial problems. But this should actually be something that can be fixed (pretty sure I saw an article about it on HN recently)

discuss

order

fi-le|8 months ago

Good point. The architectural solution that would come to mind is 2D text embeddings, i.e. we add 2 sines and cosines to each token embedding instead of 1. Apparently people have done it before: https://arxiv.org/abs/2409.19700v2

ninjha|8 months ago

I think I remember one of the original ViT papers saying something about 2D embeddings on image patches not actually increasing performance on image recognition or segmentation, so it’s kind of interesting that it helps with text!

E: I found the paper: https://arxiv.org/pdf/2010.11929

> We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4).

Although it looks like that was just ImageNet so maybe this isn't that surprising.

froobius|8 months ago

Transformers can easily be trained / designed to handle grids, it's just that off the shelf standard LLMs haven't been particularly, (although they would have seen some)

nine_k|8 months ago

Are there some well-known examples of success in it?

stavros|8 months ago

If this were a limitation in the architecture, they wouldn't be able to work with images, no?

hnlmorg|8 months ago

LLMs don’t work with images.