(no title)
cs702 | 4 days ago
> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.
While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.
I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.
Thank you for sharing this on HN.
nee1r|4 days ago
cs702|3 days ago
Same here.
The notion of inducing these models to "hypothesize" distributions over possible actions given subsequent observed transitions makes me think of "contrastive divergence," the method Hinton and others came up with for unsupervised training of Restricted Boltzmann Machines (RBMs), in the prehistoric era of deep learning.
Given each training sample, an RBM would 1) execute a forward pass, 2) sample its output units, 3) "hypothesize" its input units, 4) execute another forward pass on the "hypothesized" input units to sample new output units, and (5) compute a type of contrastive error for local backpropagation. RMBs could be stacked, with output units from one becoming input units for the next one. Hinton called the input units "visible," and the output ones "hidden."
It's not the same, obviously, but the idea of modeling machine-generated inputs (or actions) given outputs (or transitions) has always been appealing. It has a long history.