It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models.
We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokensThe question is what is the most efficient and high-quality representation we could use to improve that
groby_b|11 months ago
That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)
Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".
Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]
ajcp|11 months ago
In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.
martbakler|11 months ago