I see, integrating image inputs can be very challenging in this case as the models work with text input. I was not even thinking about the full isometric image, but just some simple 2D map where each pixel can be color-coded based on the entity type. I guess the problem is that these maps would look like nothing the models were trained on, so as you say, it might not provide any value.The reason I was suggesting this is that I worked in robotics making RL policies, and supplying image data (be it maps, lidar scans, etc.) was a common practice. But our networks were custom made to ingest these data and trained from scratch, which is quite different from this approach.
martbakler|11 months ago