danijar's comments

danijar | 11 months ago | on: DeepMind program finds diamonds in Minecraft without being taught

For a lot of things, VLMs are good enough already to provide rewards. Give them the recent images and a text description of the task and ask whether the task was accomplished or not.

For a more general system, you can annotate videos with text descriptions of all the tasks that have been accomplished and when, then train a reward model on those to later RL against.

danijar | 11 months ago | on: DeepMind program finds diamonds in Minecraft without being taught

I think learning to hold a button down in itself isn't too hard for a human or robot that's been interacting with the physical world for a while and has learned all kinds of skills in that environment.

But for an algorithm learning from scratch in Minecraft, it's more like having to guess the cheat code for a helicopter in GTA, it's not something you'd stumble upon unless you have prior knowledge/experience.

Obviously, pretraining world models for common-sense knowledge is another important research frontier, but that's for another paper.

danijar | 11 months ago | on: DeepMind program finds diamonds in Minecraft without being taught

When it dies it loses all items and the world resets to a new random seed. It learns to stay alive quite well but sometimes falls into lava or gets killed by monsters.

It only gets a +1 for the first iron pickaxe it makes in each world (same for all other items), so it can't hack rewards by repeating a milestone.

Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.

danijar | 11 months ago | on: DeepMind program finds diamonds in Minecraft without being taught

Hi, author here! Dreamer learns to find diamonds from scratch by interacting with the environment, without access to external data. So there are no explainer videos or internet text here.

It gets a sparse reward of +1 for each of the 12 items that lead to the diamond, so there is a lot it needs to discover by itself. Fig. 5 in the paper shows the progression: https://www.nature.com/articles/s41586-025-08744-2

danijar | 7 years ago | on: PlaNet: A Deep Planning Network for Reinforcement Learning

Author here. First of all, I'd like to clarify that the data efficiency gain over D4PG is 5000% or 50x.

Regarding computational efficiency, we match D4PG, a top model-free agent that uses experience replay among other techniques (actor critic, distributional loss, n-step returns, prioritized replay, distributed experience collection).

Your point about exposure bias is interesting, and applies equally to agents that do not learn a model. Personally, I think we need reliable uncertainty estimates in neural networks to make progress on this research question, so the agent can know what it doesn't know.

Hindsight experience replay doesn't apply to tasks where the inputs are images because it requires knowledge of a meaningful goal space with a distance function (e.g. 2D coordinates of goal positions).

page 1