This might be a dumb question to ask, but what exactly is this useful for? B-Roll for YouTube videos? I'm not sure why so much effort is being put into something like this when the applications are so limited.
If you want to train a model to have a general understanding of the physical world, one way is to show it videos and ask it to predict what comes next, and then evaluate it on how close it was to what actually came next.
To really do well on this task, the model basically has to understand physics, and human anatomy, and all sorts of cultural things. So you're forcing the model to learn all these things about the world, but it's relatively easy to train because you can just collect a lot of videos and show the model parts of them -- you know what the next frame is, but the model doesn't.
Along the way, this also creates a video generation model - but you can think of this as more of a nice side effect rather than the ultimate goal.
It doesn’t have to understand anything, none of these demonstrate reasoning or understanding.
All these models have just “seen” enough videos of all those things to build a probability distribution to predict the next step.
This is not bad, or make it inherently dumb, a major component of human intelligence is built on similar strategies.
I couldn’t tell what grammatical rules are broken in text or what physical rules in a photograph but can tell it is wrong using the same methods .
Inference can take it far with large enough data sets, but sooner or later without reasoning you will hit a ceiling .
This is true for humans as well, plenty of people go far in life with just memorization and replication do a lot of jobs fairly competently, but not in everything.
Reasoning is essential for higher order functions and transformers is not the path for that
Back when computers took up a whole room, you'd also have asked: "but what exactly is this useful for? B-Roll some simple calculations that anybody can do with a piece of paper and a pen."?
Think 5-10 years into the future, this is a stepping stone
That's comparing apples to oranges though isn't it? Generating videos is the output of the technology, not the tech itself. It would be like someone asking "this computer that takes up a whole room printed out ascii art, what is this useful for?"
this is kind of an unfair comparison. Whats the endpoint of generating AI videos? What can this do that is useful, contributes something to society, has artistic value, etc etc. We can make educational videos with a script but its also pretty easy for motivated parties to do that already, and its getting easier as cameras get better and smaller. I think asking "whats the point of this" is at least fair.
We're preparing to use video generation (specifically image+text => video so we can also include an initial screenshot of the current game state for style control) for generating in-game cutscenes at our video game studio. Specifically, we're generating them at play-time in a sandbox-like game where the game plays differently each time, and therefore we don't want to prerecord any cutscenes.
Okay, so is the aim to run this locally on a client's computer or served from a cloud? How does the math work out where it's not just easier at that point to render it in game?
in it's current state, it's already useful for b-roll, video backgrounds for websites, and any other sort of "generic" application where the point of the shot is just to establish mood and fill time.
but more than anything it's useful as a stepping stone to more full-featured video generation that can maintain characters and story across multiple scenes. it seems clear that at some point tools like this will be able to generate full videos, not just shots.
This is a first step towards "the holodeck". You describe a scene and it exists. Imagine you could jump in and interact with it. That seems like something that could happen in 10-20 years.
You and your friends gather around the TV to watch a video about the time that you all traveled abroad and met a mysterious stranger. In the film, you witness each other take incredible risks, have intimate private conversations, and change in profound ways. Of course none of it actually happened; your voices and likenesses were fed into the movie generator. And did I mention in the film you’re driving expensive cars and wearing designer clothes?
Are they that limited? It's a machine that can make videos from user input: it can ostensibly be used wherever you need video, including for creative, technical and professional applications.
Now, it may not be the best fit for those yet due to its limitations, but you've gotta walk before you can run: compare Stable Diffusion 1.x to FLUX.1 with ControlNet to see where quality and controllability could head in the future.
Because it's pretty cool to be able to imagine any kind of scene in your head, put it into words, then see it be made into a video file that you can actually see and share and refine.
It's got a lot of potential as a way for google to get paid for other people's skills and hard work instead of the people that made all of that "data".
It’s kind of hilarious that anybody considers this “democratizing” creating media. How many people that need a video clip are going to be capable of running an open version of this themselves? The wonky “open” models aren’t even close. How much do you think these services are going to cost once the introductory period financed by race-to-the-bottom money stops? OpenAI already charges $200/mo if you want to be guaranteed more than 30-60 minutes of Advanced Voice. The introductory period exists solely to get people engaged enough to push through blatantly stealing millions of artists creative output so they can have a beautiful tool they sell to Hollywood for a whole lot of money that’s still less than traditional vfx, and to m everyone gets to dink around in the useless free models or too-expensive-for-most prosumer tools and people with expensive video card arrays or the functional equivalent will still be niche tinkering hobbyists with inferior tooling and models and the skilled commercial artists still employed are being paid shit because of market forces. Great job SV. Making the world a better place.
jonas21|1 year ago
To really do well on this task, the model basically has to understand physics, and human anatomy, and all sorts of cultural things. So you're forcing the model to learn all these things about the world, but it's relatively easy to train because you can just collect a lot of videos and show the model parts of them -- you know what the next frame is, but the model doesn't.
Along the way, this also creates a video generation model - but you can think of this as more of a nice side effect rather than the ultimate goal.
manquer|1 year ago
All these models have just “seen” enough videos of all those things to build a probability distribution to predict the next step.
This is not bad, or make it inherently dumb, a major component of human intelligence is built on similar strategies. I couldn’t tell what grammatical rules are broken in text or what physical rules in a photograph but can tell it is wrong using the same methods .
Inference can take it far with large enough data sets, but sooner or later without reasoning you will hit a ceiling .
This is true for humans as well, plenty of people go far in life with just memorization and replication do a lot of jobs fairly competently, but not in everything.
Reasoning is essential for higher order functions and transformers is not the path for that
terhechte|1 year ago
Think 5-10 years into the future, this is a stepping stone
alectroem|1 year ago
code_for_monkey|1 year ago
carlosjobim|1 year ago
drusepth|1 year ago
moritonal|1 year ago
notatoad|1 year ago
but more than anything it's useful as a stepping stone to more full-featured video generation that can maintain characters and story across multiple scenes. it seems clear that at some point tools like this will be able to generate full videos, not just shots.
wnolens|1 year ago
nope96|1 year ago
mbil|1 year ago
Philpax|1 year ago
Now, it may not be the best fit for those yet due to its limitations, but you've gotta walk before you can run: compare Stable Diffusion 1.x to FLUX.1 with ControlNet to see where quality and controllability could head in the future.
unknown|1 year ago
[deleted]
picafrost|1 year ago
aenvoker|1 year ago
https://www.reddit.com/r/aivideo/comments/1hbnyi2/comment/m1...
Another more serious music video also made entirely by one person. https://www.youtube.com/watch?v=pdqcnRGzH5c Don't know how long it took though.
hnuser123456|1 year ago
carlosjobim|1 year ago
yieldcrv|1 year ago
my templates all are waiting for stock videos to be added looping in the background
you have no idea how cool I am with the lack of copyright protections afforded to these videos I will generate, I'm making my money other ways
krunck|1 year ago
code_for_monkey|1 year ago
chefandy|1 year ago
ElemenoPicuares|1 year ago
tucnak|1 year ago