(no title)
martbakler | 11 months ago
We designed the API to be as spatially descriptive as possible (include x-y coordinates and neighbors in game state descriptions) and the agents have tools to aid them in carrying out actions which would benefit from vision (i.e find buildable areas on the map with different sizes, placing entities next to other entities etc).
As Jack said, we completed most of the lab tasks manually ourselves and while it took us a lot longer compared to having vision, the tasks were still doable and the human performance is significantly higher than current agents. We are thinking of supporting vision for future evals but from a small number of tests we ran, current models got even more confused as the number of entities on the map grows quite quickly. This is likely due to VLMs being notoriously bad at visual reasoning on images with lots of detail and in a game where one misplaced entity in a large factory breaks everything, the errors start to compound
Hammershaft|11 months ago
noddybear|11 months ago