(no title)
naed90
|
1 year ago
Hey, developer of Oasis here! You are very correct. Here are a few points:
1. We trained the model on a context window of even 30 sec. What's the problem? It barely pays any attention to frames beyond the past few ones. This certainly makes sense as it's a question of the loss function of the model during training. We are running now many different training runs to experiment with a better loss func (and datamix) to solve this issue. You'll see newer versions soon!
2. In the long term, we believe the "ultimate" solution is 2 models: 1 model that maintains game state + 1 model that turns that state into pixel. Think of it as having the first model be something resembling more of an LLM that gets the current state + user action and produces the new state, and then the second model being a diffusion model that takes from this state and maps to pixels. This would win the best of both worlds.
throwaway314155|1 year ago
naed90|1 year ago
The nice thing is that we can run tons of experiments at once. For Oasis v1, we ran over 1000 experiments (end-to-end training a 500M model) on the model arch, datamix, etc., before we created the final checkpoint that's deployed on the site. At Decart (we just came out of stealth yesterday: https://www.theinformation.com/articles/why-sequoias-shaun-m...) we have 2 teams: Decart Infrastructure and Decart Experiences. The first team provides insanely fast infra for training/inferencing (writes from scratch everything from CUDA to redoing the python garbage collector) -- we are able to get a 500M model to converge during training in ~20h instead of 1-2 weeks. Then, Decart Experiences uses this infra to create these new types of end-to-end "Generated Experiences"