top | item 47159054

(no title)

cs702 | 4 days ago

At first glance, this looks incredible to me. The authors train one model on 40K hours of computer-use video, previously labeled by contractors with keyboard and mouse actions, then use that model, in effect, to label 11M hours of computer-use video, which they use to train the computer-action model. The key advance is in compression. Quoting from the OP:

> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.

While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.

I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.

Thank you for sharing this on HN.

[a] https://arxiv.org/abs/1805.01954

discuss

nee1r|4 days ago

yeah! i love the BCO paper, i think its extremely intuitive and these methods are really interesting in a time where data without labels is abundant. i especially like the idea of iteratively making the inverse dynamics better—might lean closer to that in the future

cs702|3 days ago

> i especially like the idea of iteratively making the inverse dynamics better

Same here.

The notion of inducing these models to "hypothesize" distributions over possible actions given subsequent observed transitions makes me think of "contrastive divergence," the method Hinton and others came up with for unsupervised training of Restricted Boltzmann Machines (RBMs), in the prehistoric era of deep learning.

Given each training sample, an RBM would 1) execute a forward pass, 2) sample its output units, 3) "hypothesize" its input units, 4) execute another forward pass on the "hypothesized" input units to sample new output units, and (5) compute a type of contrastive error for local backpropagation. RMBs could be stacked, with output units from one becoming input units for the next one. Hinton called the input units "visible," and the output ones "hidden."

It's not the same, obviously, but the idea of modeling machine-generated inputs (or actions) given outputs (or transitions) has always been appealing. It has a long history.