(no title)
quanto | 5 months ago
What's interesting is that action tokens are learned from video. In other words, the training dataset does not include actions like "go left" and "go right"; and these actions are learned from the pixels that moved. This means that learned actions may not map exactly to the game actions available to the user. That means we (humans) cannot necessarily use this world model to play the game.
I suspect the inferred actions probably directly correspond to human-understandable actions; and after playing with the action tokens, a reasonable human can probably guess what, say, the third action token in the dictionary corresponds to ("jump"). This is likely as game actions are sparse (in both time and action spaces) and often independent/orthogonal (in action space).
No comments yet.