(no title)
legucy | 4 months ago
Another idea: video extension model as a world model. We fine tune Sora on first person robot videos (and we train another model to predict actuation states from FPV). Then we extend the video using Sora “a robot in first person view finishes moving laundry from washer to dryer”. Then predict actuation states from the extended video?
No comments yet.