top | item 43332607

(no title)

martbakler | 11 months ago

Just to jump in here as one of the authors

We designed the API to be as spatially descriptive as possible (include x-y coordinates and neighbors in game state descriptions) and the agents have tools to aid them in carrying out actions which would benefit from vision (i.e find buildable areas on the map with different sizes, placing entities next to other entities etc).

As Jack said, we completed most of the lab tasks manually ourselves and while it took us a lot longer compared to having vision, the tasks were still doable and the human performance is significantly higher than current agents. We are thinking of supporting vision for future evals but from a small number of tests we ran, current models got even more confused as the number of entities on the map grows quite quickly. This is likely due to VLMs being notoriously bad at visual reasoning on images with lots of detail and in a game where one misplaced entity in a large factory breaks everything, the errors start to compound

discuss

Hammershaft|11 months ago

Someone below mentioned the ASCII interface for dwarf fortress as being ideal for this, and I wonder if that kind of representation with a legend might produce spatially better results. The drawback I see is that elements can be layered on a tile in Factorio, or have properties that are not visually obvious in ASCII, so the llm would need to be able to introspect on the map.

noddybear|11 months ago

I think your intuition is correct about the amount of information that needs to be encoded into an ASCII char. You could potentially use unicode to pack more more into each char, e.g direction, type, status etc. Or make each representation available on-demand, i.e 'show me the direction of all inserters in a 10 tile radius'.