top | item 45534837

(no title)

legucy | 4 months ago

Could we do RL in simulated environments, and use a vision LLM to provide the verification? I.e test a policy then take a 2d image of the end state, VLM yields 0 or 1.

Another idea: video extension model as a world model. We fine tune Sora on first person robot videos (and we train another model to predict actuation states from FPV). Then we extend the video using Sora “a robot in first person view finishes moving laundry from washer to dryer”. Then predict actuation states from the extended video?

discuss

No comments yet.