In the video towards the bottom of the page, there are two birds (blue jays), but in the background there are two identical buildings (which look a lot like the CN Tower). CN Tower is the main landmark of Toronto, whose baseball team happens to be the Blue Jays. It's located near the main sportsball stadium downtown.
I vaguely understand how text-to-image works, and so it makes sense that the vector space for "blue jays" would be near "toronto" or "cn tower". The improvements in scale and speed (image -> now video) are impressive, but given how incredibly able the image generation models are, they simultaneously feel crippled and limited by their lack of editing / iteration ability.
Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.
> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.
I feel like we're close too, but for another reason.
For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.
However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.
And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.
Wanna move that bicycle? Move it in the 3D scene exactly where you want.
That is coming.
And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?
I recently tried to generate clip art for a presentation using GPT-4/DALL-E 3. I found it could handle some updates but the output generally varied wildly as I tried to refine the image. For instance, I'd have a cartoon character checking its watch and also wearing a pocket watch. Trying to remove the pocket watch resulted in an entirely new cartoon with little stylistic continuity to the first.
Also, I originally tried to get the 3 characters in the image to be generated simultaneously, but eventually gave up as DALL-E had a hard time understanding how I wanted them positioned relative to each other. I just generated 3 separate characters and positioned them in the same image using Gimp.
> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.
Emu can do that.
The bluejay/toronto thing may be addressable later (I suspect via more detailed annotations a la dalle3) - these current video models are highly focused on figuring out temporal coherence
Adobe is doing some great work here in my opinion in terms of building AI tools that make sense for artist workflows. This "sneak peak" demo from the recent Adobe Max conference is pretty much exactly what you described, actually better because you can just click on an object in the image and drag it.
> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.
Nearly all of the available models have this, even the highly commercialized ones like in Adobe Firefly and Canva, it’s called inpainting in most tools.
As for your last question yes that exists. There are two models from Meta that do exactly this, instruction based iteration on photos, Emu Edit[0], and videos, Emu Video[1].
There's also LLaVa-interactive[2] for photos where you can even chat with the model about the current image.
> they simultaneously feel crippled and limited by their lack of editing / iteration ability.
Yeah. They're not "videos" so much as images that move around a bit.
This doesn't really look any better than those Midjourney + RunwayML videos we had half a year ago.
>Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.
Google has a model called Phenaki that supposedly allows for that kind of stuff. But the public can't use it so it's hard to say how good it actually is.
Have you seen fal.ai/dynamic where you can perform image to image synthesis (basically editing an existing image with the help of diffusion process) using LCMs to provide a real time UI?
I don’t spend a lot of time keeping up with the space, but I could have sworn I’ve seen a demo that allowed you to iterate in the way you’re suggesting. Maybe someone else can link it.
I also wonder if the model takes capitalization into account. Capitalized "Blue Jays" seems more likely to reference the sports team; the birds would be lowercase.
This is not the flex you think it is. You don't have to like sports, but snarking on people who do doesn't make you intellectual, it just makes you come across as a douchebag, no different than a sports fan making fun of "D&D nerds" or something.
The rate of progress in ML this past year has been breath taking.
I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.
I understand the magnitude of innovation that's going on here. But still feel like we are generating these videos with both hands tied behind our backs. In other words, it's nearly impossible to edit the videos in this constraints. (Imagine trying to edit the blue Jays to get the perfect view).
Since videos are rarely consumed raw, what if this becomes a pipeline in Blender instead? (Blender the 3d software). Now the video becomes a complete scene with all the key elements of the text input animated. You have your textures, you have your animation, you have your camera, you have all the objects in place. We can even have the render engine in the pipeline to increase the speed of video generation.
It may sound like I'm complaining, but I'm just ask making a feature request...
What would solve all these issues is full generation of 3D models that we hopefully get a chance to see over the next decade. I’ve been advocating for a solid LiDAR camera on the iPhone so there is a lot of training data for these LLMs.
I'm still puzzled as to how these "non-commercial" model licenses are supposed to be enforceable. Software licenses govern the redistribution of the software, not products produced with it. An image isn't GPL'd because it was produced with GIMP.
The license is a contract that allows you to use the software provided you fulfill some conditions. If you do not fulfill the conditions, you have no right to a copy of the software and can be sued. This enforcement mechanism is the same whether the conditions are that you include source code with copies you redistribute, or that you may only use it for evil, or that you must pay a monthly fee. Of course this enforcement mechanism may turn out to be ineffective if it's hard to discover that you're violating the conditions.
It doesn't have to be enforceable. This licensing model works exactly the same as Microsoft Windows licensing or WinRAR licensing. Lots and lots of people have pirated Windows or just buy some cheap keys off Ebay, but no one of them in their sane mind would use anything like that at their company.
The same way you can easily violate any "non-commercial" clauses of models like this one as private person or as some tiny startup, but company that decide to use them for their business will more likely just go and pay.
So it's possible to ignore license, but legal and financial risks are not worth it for businesses.
Visual Studio Community (and many other products) only allows "non-commercial" usage. Sounds like it limits what you can do with what you produce with it.
At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.
As an example, see the Creative Commons license, ShareAlike clause:
> If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
So, there's a few different things interacting here that are a little confusing.
First off, you have copyright law, which grants monopolies on the act of copying to the creators of the original. In order to legally make use of that work you need to either have permission to do so (a license), or you need to own a copy of the work that was made by someone with permission to make and sell copies (a sale). For the purposes of computer software, you will almost always get rights to the software through a license and not a sale. In fact, there is an argument that usage of computer software requires a license and that a sale wouldn't be enough because you wouldn't have permission to load it into RAM[0].
Licenses are, at least under US law, contracts. These are Turing-complete priestly rites written in a special register of English that legally bind people to do or not do certain things. A license can grant rights, or, confusingly, take them away. For example, you could write a license that takes away your fair use rights[1], and courts will actually respect that. So you can also have a license that says you're only allowed to use software for specific listed purposes but not others.
In copyright you also have the notion of a derivative work. This was invented whole-cloth by the US Supreme Court, who needed a reason to prosecute someone for making a SSSniperWolf-tier abridgement[2] of someone else's George Washington biography. Normal copyright infringement is evidenced by substantial similarity and access: i.e. you saw the original, then you made something that's nearly identical, ergo infringement. The law regarding derivative works goes a step further and counts hypothetical works that an author might make - like sequels, translations, remakes, abridgements, and so on - as requiring permission in order to make. Without that permission, you don't own anything and your work has no right to exist.
The GPL is the anticopyright "judo move", invented by a really ornery computer programmer that was angry about not being able to fix their printer drivers. It disclaims almost the entire copyright monopoly, but it leaves behind one license restriction, called a "copyleft": any derivative work must be licensed under the GPL. So if you modify the software and distribute it, you have to distribute your changes under GPL terms, thus locking the software in the commons.
Images made with software are not derivative works of the software, nor do they contain a substantially similar copy of the software in them. Ergo, the GPL copyleft does not trip. In fact, even if it did trip, your image is still not a derivative work of the software, so you don't lose ownership over the image because you didn't get permission. This also applies to model licenses on AI software, insamuch as the AI companies don't own their training data[3].
However, there's still something that licenses can take away: your right to use the software. If you use the model for "commercial" purposes - whatever those would be - you'd be in breach of the license. What happens next is also determined by the license. It could be written to take away your noncommercial rights if you breach the license, or it could preserve them. In either case, however, the primary enforcement mechanism would be a court of law, and courts usually award money damages. If particularly justified, they could demand you destroy all copies of the software.
If it went to SCOTUS (unlikely), they might even decide that images made by software are derivative works of the software after all, just to spite you. The Betamax case said that advertising a copying device with potentially infringing scenarios was fine as long as that device could be used in a non-infringing manner, but then the Grokster case said it was "inducement" and overturned it. Static, unchanging rules are ultimately a polite fiction, and the law can change behind your back if the people in power want or need it to. This is why you don't talk about the law in terms of something being legal or illegal, you talk about it in terms of risk.
[0] Yes, this is a real argument that courts have actually made. Or at least the Ninth Circuit.
The actual facts of the case are even more insane - basically a company trying to sue former employees for fixing it's customers computers. Imagine if Apple sued Louis Rossman for pirating macOS every time he turned on a customer laptop. The only reason why they can't is because Congress actually created a special exemption for computer repair and made it part of the DMCA.
[1] For example, one of the things you agree to when you buy Oracle database software is to give up your right to benchmark the software. I'm serious! The tech industry is evil and needs to burn down to the ground!
[2] They took 300 pages worth of material from 12 books and copied it into a separate, 2 volume work.
[3] Whether or not copyright on the training data images flows through to make generated images a derivative work is a separate legal question in active litigation.
It makes me think of the difference between ancestral and non-ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler, the output is somewhat deterministic and doesn't vary with increasing sampling steps, but with Ancestral, noise is added to each step which creates more variety but is more random/stochastic.
I assume to create video, the sampler needs to lean heavily on the previous frame while injecting some kind of sub-prompt, like rotate <object> to the left by 5 degrees, etc. I like the phrase another commenter used, "temporal consistency".
Edit: Indeed the special sauce is "temporal layers". [0]
> Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets
The hardest problem the Stable Diffusion community has dealt with in terms of quality has been in the video space, largely in relation to the consistency between frames. It's probably the most commonly discussed problem for example on r/stablediffusion. Temporal consistency is the popular term for that.
So this example was posted an hour ago, and it's jumping all over the place frame to frame (somewhat weak temporal consistency). The author appears to have used pretty straight-forward text2img + Animatediff:
This field moves so fast. Blink an eye and there is another new paper. This is really cool and the learning speed of us humans is insane! Really excited on using it for downstream tasks! I wonder how easy it is to integrate animatediff with this model?
Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!
It makes sense that they had to take out all of the cuts and fades from the training data to improve results.
I’m the background section of the research paper they mention “temporal convolution layers”, can anyone explain what that is? What sort of training data is the input to represent temporal states between images that make up a video? Or does that mean something else?
I would assume is something similar to joining multiple frames/attentions? in channel dimension and then moving values inside so convolution will have access to some channels from other video frames.
I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker.
https://arxiv.org/abs/1811.08383
This is really, really cool. A few months ago I was playing with some of the "video" generation models on Replicate, and I got some really neat results[1], but it was very clear that the resulting videos were made from prompting each "frame" with the previous one. This looks like it can actually figure out how to make something that has a higher level context to it.
It's crazy to see this level of progress in just a bit over half a year.
Looks like I'm still good for my bet with some friends that before 2028 a team of 5-10 people will create a blockbuster style movie that today costs 100+ million USD on a shoestring budget and we won't be able to tell.
Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic only to be improved upon with each subsequent blockbuster game.
I think we're in a similar phase with AI[0]: every new release in $category is better, gets hailed as super fantastic world changing, is improved upon in the subsequent Two Minute Papers video on $category, and the cycle repeats.
[0] all of them: LLMs, image generators, cars, robots, voice recognition and synthesis, scientific research, …
I'm imagining more of an AI that takes a standard movie screenplay and a sidecar file, similar to a CSS file for the web and generates the movie. This sidecar file would contain the "director" of the movie, with camera angles, shot length and speed, color grading, etc. Don't like how the new Dune movie looks? Edit the stylesheet and make it your own. Personalized remixed blockbusters.
On a more serious note, I don't think Roger Deakins has anything to worry about right now. Or maybe ever. We've been here before. DAWs opened up an entire world of audio production to people that could afford a laptop and some basic gear. But we certainly do not have a thousand Beatles out there. It still requires talent and effort.
I'm pumped for this future, but I'm not sure that I buy your optimistic timeline. If the history of AI has taught us anything, it is that the last 1% of of progress is the hardest half. And given the unforgiving nature of the uncanny valley, the video produced by such a system will be worthless until it is damn-near perfect. That's a tall order!
The first full-length AI generated movie will be an important milestone for sure, and will probably become a "required watch" for future AI history classes. I wonder what the Rotten Tomatoes page will look like.
VRAM requirements are big for this launch.
We're hosting this for free at https://app.decoherence.co/stablevideo.
Disclaimer: Google log-in required to help us reduce spam.
(disclaimer: worked in the sim industry for 25 years, still active in terms of physics-based rendering).
First off, there are zero technical proofs that we are in a sim, just a number of philosophical arguments.
In practical terms, we cannot yet simulate a single human cell at the molecular level, given the massive number of interactions that occur every microsecond. Simulating our entire universe is not technically possible within the lifetime of our universe, according to our current understanding of computation and physics. You either have to assume that ‘the sim’ is very narrowly focussed in scope and fidelity, and / or that the outer universe that hosts ‘the sim’ has laws of physics that are essentially magic from our perspective. In which case the simulation hypothesis is essentially a religious argument, where the creator typed 'let there be light' into his computer. If there isn't such a creator, the sim hypothesis 'merely' suggests that our universe, at its lowest levels, looks somewhat computational, which is an entirely different argument.
The brain does simulate reality in the sense that what you experience isn't direct sensory input, but more like a dream being generated to predict what it thinks is happening based on conflicting and imperfect sensory input.
Why does it matter? Not trying to dismiss, but truly, what would it mean to you if you could somehow verify the "simulation"?
If it would mean something drastic to you, I would be very curious to hear your preexisting existential beliefs/commitments.
People say this sometimes and its kind of slowly revealed to me that its just a new kind of geocentrism: its not just a simulation people have in mind, but one where earth/humans are centered, and the rest of the universe is just for the benefit of "our" part of the simulation.
Which is a fine theory I guess, but is also just essentially wanting God to exist with extra steps!
There can be no technical counters to the assertion that our world is a simulation. If our world is a simulation, then hardware/software that simulates it is outside of our world and it's technical constitution is inaccessible to us.
It's purely a religious question. When humanity invented the wheel, religion described the world as a giant wheel rotating in cycles. When humanity invented books, religion described the world as a book, and God as a it's writer. When humanity invented complex mechanism, religion described the world as giant mechanism and God as a watchmaker. Then computers where invented, and you can guess what happened next.
A little too freshman's first bit off a bong for me. There is, of course, substantial differences between video and reality.
Let's steel-man — you mean 3D VR. Let's stipulate there's a headset today that renders 3D visually indistinguishable from reality. We're still short the other 4 senses
Much like faith, there's always a way to sort of escape the traps here and say "can you PROVE this is base reality"
The general technical argument against "brain in a vat being stimulated" would be the computation expense of doing such, but you can also write that off with the equivalent of foveated rendering but for all senses / entities
I've been following this space very very closely and the killer feature would be to be able to generate these full featured videos for longer than a few seconds with consistently shaped "characters" (e.g., flowers, and grass, and houses, and cars, actors, etc.). Right now, it's not clear to me that this is achieving that objective. This feels like it could be great to create short GIFs, but at what cost?
To be clear, this remains wicked, wicked, wicked exciting.
I admit I'm ignorant about these model's inner workings, but I don't understand why text is the chosen input format for these models.
It was the same for image generation, where one needed to produce text prompts to create the image, and stuff like img2img and Controlnet that allowed things like controlling poses and inpainting, or having multiple prompts with masks controlling which part of the image is influenced by which prompt.
According to the GitHub repo this is an "image-to-video model". They tease of an upcoming "text to video" interface on the linked landing page, though. My guess is that interface will use a text-to-image model and then feed that into the image-to-video model.
Porn will be one of the main use cases for this technology. Porn sites pioneered video streaming technologies back in the day, and drove a lot of the innovation there.
Has anyone managed to run the thing? I got the streamlit demo to start after fighting with pytorch, mamba, and pip for half an hour, but the demo runs out of GPU memory after a little while. I have 24GB on GPU on the machine I used, does it need more?
Yeah, got a 24GB 4090, try to reduce the number of frames decoded to something like 4 or 8. Although, keep in mind it caps the 24Gb and goes to RAM (with the latest nvidia drivers).
Have heard from others attempting it that it needs 40GB, so basically an A100/A6000/H100 or other large card. Or an Apple Silicon Mac with a bunch of unified memory, I guess.
We're hosting this free (no credit card needed) at https://app.decoherence.co/stablevideo Disclaimer: Google log-in required to help us reduce spam.
Let me know what you think of it! It works best on landscape images from my tests.
These are basically like animated postcards, like you often see now on loading screens in videogames. A single picture has been animated. Still a long shot from actual video.
It seems like the breakthrough is that the video generating method is now baked into the model and generator. I've seen several fairly impressive AI animations as well, but until now, I assumed they were tediously cobbled together by hacking on the still-image SD models.
Once text-to-video is good enough and once text generation is good enough, we could legit actually have endless TV shows produced by individuals! We're probably still far away from that, but it is exciting to think about!
I think this will really open new ways and new doors to creativity and creative expression.
Question for anyone more familiar with this space: are there any high-quality tools which take an image and make it into a short video? For example, an image of a tree becomes a video of a tree swaying in the wind.
I have googled for it but mostly just get low quality web tools.
Very soon, we will be able to change story line of a web series dynamically, a little more thrill, a little more comedy, changing character face to matching ours and others, all in 3D with 360 degree view, how far are we from this ? 5 year ?
Instance One : Act as a top tier Hollywood scenarist, use the public available data for emotional sentiment to generate a storyline, apply the well known archetypes from proven blockbusters for character development. Move to instance two.
Instance Two: Act as top tier producer. {insert generated prompt}. Move to instance three.
Instance Three: Generate Meta-humans and load personality traits. Move to instance four.
Instance Four: Act as a top tier director.{insert generated prompt}. Move to instance five.
Instance Five: Act as a top tier editor.{insert generated prompt}. Move to instance six.
Instance Six: Act as a top tier marketing and advertisement agency.{insert generated prompt}. Move to instance seven.
Instance Seven: Act as a top tier accountant, generate an interface to real-time ROI data and give me the results on an optimized timeline into my AI induced dream.
Personal GPT: Buy some stocks, diversify my portfolio, stock up on synthetic meat, bug-coke and Soma. Call my mom and tell her I made it.
Much like in static images, the subtle unintended imperfections are quite interesting to observe.
For example, the man in the cowboy hat seems he is almost gagging. In the train video the tracks seem to be too wide while the train ice skates across them.
How much longer will it be until we can play "video games" which consist of user-input streamed to an AI that generates video output and streams it to the player's screen?
If you're willing to accept text based output then Text adventure style games and even simulating bash was possible using chatgpt until openAI nerfed it.
Finally ! Now that this is out, I can finally start adding proper video widgets to CushyStudio https://github.com/rvion/CushyStudio#readme . Really hope I can get in touch with StabilityAi people soon. Maybe Hacker News will help
cannot join the waiting list (nor opt in for marketing newsletter), because the sign-up form checkboxes don't toggle on android mobile Chrome or Firefox.
It's definitely pretty impressive already. If there could be some kind of "final pass" to remove the slightly glitchy generative artifacts, these look completely passible for simple .gif/.webm header images. Especially if they could be made to loop smoothly ala Snapchat's bounce filter.
btbuildem|2 years ago
I vaguely understand how text-to-image works, and so it makes sense that the vector space for "blue jays" would be near "toronto" or "cn tower". The improvements in scale and speed (image -> now video) are impressive, but given how incredibly able the image generation models are, they simultaneously feel crippled and limited by their lack of editing / iteration ability.
Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.
TacticalCoder|2 years ago
I feel like we're close too, but for another reason.
For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.
However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.
And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.
Wanna move that bicycle? Move it in the 3D scene exactly where you want.
That is coming.
And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?
That is coming too.
xianshou|2 years ago
01100011|2 years ago
Also, I originally tried to get the 3 characters in the image to be generated simultaneously, but eventually gave up as DALL-E had a hard time understanding how I wanted them positioned relative to each other. I just generated 3 separate characters and positioned them in the same image using Gimp.
filterfiber|2 years ago
Emu can do that.
The bluejay/toronto thing may be addressable later (I suspect via more detailed annotations a la dalle3) - these current video models are highly focused on figuring out temporal coherence
amoshebb|2 years ago
Do the parameters think that Jazz musicians are mormon? Padres often surf? Wizards like the Lincoln Memorial?
dsmmcken|2 years ago
See video: https://www.adobe.com/max/2023/sessions/project-stardust-gs6...
achileas|2 years ago
Nearly all of the available models have this, even the highly commercialized ones like in Adobe Firefly and Canva, it’s called inpainting in most tools.
omneity|2 years ago
As for your last question yes that exists. There are two models from Meta that do exactly this, instruction based iteration on photos, Emu Edit[0], and videos, Emu Video[1].
There's also LLaVa-interactive[2] for photos where you can even chat with the model about the current image.
[0]: https://emu-edit.metademolab.com/
[1]: https://emu-video.metademolab.com/
[2]: https://llava-vl.github.io/llava-interactive/
COAGULOPATH|2 years ago
Yeah. They're not "videos" so much as images that move around a bit.
This doesn't really look any better than those Midjourney + RunwayML videos we had half a year ago.
>Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.
Google has a model called Phenaki that supposedly allows for that kind of stuff. But the public can't use it so it's hard to say how good it actually is.
treesciencebot|2 years ago
appplication|2 years ago
JoshTriplett|2 years ago
zeckalpha|2 years ago
stevage|2 years ago
kshacker|2 years ago
Also, maybe you can't edit post facto, but when you give prompts, would you not be able to say : two blue jays but no CN tower
FrozenTuna|2 years ago
ProfessorZoom|2 years ago
Veraslane|2 years ago
[deleted]
jackmott|2 years ago
[deleted]
psunavy03|2 years ago
This is not the flex you think it is. You don't have to like sports, but snarking on people who do doesn't make you intellectual, it just makes you come across as a douchebag, no different than a sports fan making fun of "D&D nerds" or something.
valine|2 years ago
I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.
alberth|2 years ago
I ask as a noob in this area.
Der_Einzige|2 years ago
kornesh|2 years ago
Right now, AnimateDiff is leading the way in consistency but I'm really excited to see what people will do with this new model.
hanniabu|2 years ago
The main utility will me misinformation
firefoxd|2 years ago
Since videos are rarely consumed raw, what if this becomes a pipeline in Blender instead? (Blender the 3d software). Now the video becomes a complete scene with all the key elements of the text input animated. You have your textures, you have your animation, you have your camera, you have all the objects in place. We can even have the render engine in the pipeline to increase the speed of video generation.
It may sound like I'm complaining, but I'm just ask making a feature request...
huytersd|2 years ago
jwoodbridge|2 years ago
ericpauley|2 years ago
yorwba|2 years ago
SXX|2 years ago
The same way you can easily violate any "non-commercial" clauses of models like this one as private person or as some tiny startup, but company that decide to use them for their business will more likely just go and pay.
So it's possible to ignore license, but legal and financial risks are not worth it for businesses.
dist-epoch|2 years ago
At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.
As an example, see the Creative Commons license, ShareAlike clause:
> If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
kmeisthax|2 years ago
First off, you have copyright law, which grants monopolies on the act of copying to the creators of the original. In order to legally make use of that work you need to either have permission to do so (a license), or you need to own a copy of the work that was made by someone with permission to make and sell copies (a sale). For the purposes of computer software, you will almost always get rights to the software through a license and not a sale. In fact, there is an argument that usage of computer software requires a license and that a sale wouldn't be enough because you wouldn't have permission to load it into RAM[0].
Licenses are, at least under US law, contracts. These are Turing-complete priestly rites written in a special register of English that legally bind people to do or not do certain things. A license can grant rights, or, confusingly, take them away. For example, you could write a license that takes away your fair use rights[1], and courts will actually respect that. So you can also have a license that says you're only allowed to use software for specific listed purposes but not others.
In copyright you also have the notion of a derivative work. This was invented whole-cloth by the US Supreme Court, who needed a reason to prosecute someone for making a SSSniperWolf-tier abridgement[2] of someone else's George Washington biography. Normal copyright infringement is evidenced by substantial similarity and access: i.e. you saw the original, then you made something that's nearly identical, ergo infringement. The law regarding derivative works goes a step further and counts hypothetical works that an author might make - like sequels, translations, remakes, abridgements, and so on - as requiring permission in order to make. Without that permission, you don't own anything and your work has no right to exist.
The GPL is the anticopyright "judo move", invented by a really ornery computer programmer that was angry about not being able to fix their printer drivers. It disclaims almost the entire copyright monopoly, but it leaves behind one license restriction, called a "copyleft": any derivative work must be licensed under the GPL. So if you modify the software and distribute it, you have to distribute your changes under GPL terms, thus locking the software in the commons.
Images made with software are not derivative works of the software, nor do they contain a substantially similar copy of the software in them. Ergo, the GPL copyleft does not trip. In fact, even if it did trip, your image is still not a derivative work of the software, so you don't lose ownership over the image because you didn't get permission. This also applies to model licenses on AI software, insamuch as the AI companies don't own their training data[3].
However, there's still something that licenses can take away: your right to use the software. If you use the model for "commercial" purposes - whatever those would be - you'd be in breach of the license. What happens next is also determined by the license. It could be written to take away your noncommercial rights if you breach the license, or it could preserve them. In either case, however, the primary enforcement mechanism would be a court of law, and courts usually award money damages. If particularly justified, they could demand you destroy all copies of the software.
If it went to SCOTUS (unlikely), they might even decide that images made by software are derivative works of the software after all, just to spite you. The Betamax case said that advertising a copying device with potentially infringing scenarios was fine as long as that device could be used in a non-infringing manner, but then the Grokster case said it was "inducement" and overturned it. Static, unchanging rules are ultimately a polite fiction, and the law can change behind your back if the people in power want or need it to. This is why you don't talk about the law in terms of something being legal or illegal, you talk about it in terms of risk.
[0] Yes, this is a real argument that courts have actually made. Or at least the Ninth Circuit.
The actual facts of the case are even more insane - basically a company trying to sue former employees for fixing it's customers computers. Imagine if Apple sued Louis Rossman for pirating macOS every time he turned on a customer laptop. The only reason why they can't is because Congress actually created a special exemption for computer repair and made it part of the DMCA.
[1] For example, one of the things you agree to when you buy Oracle database software is to give up your right to benchmark the software. I'm serious! The tech industry is evil and needs to burn down to the ground!
[2] They took 300 pages worth of material from 12 books and copied it into a separate, 2 volume work.
[3] Whether or not copyright on the training data images flows through to make generated images a derivative work is a separate legal question in active litigation.
cubefox|2 years ago
Der_Einzige|2 years ago
stevage|2 years ago
> An image isn't GPL'd because it was produced with GIMP.
That's because of how the GPL is written, not because of some limitation of software licences.
accrual|2 years ago
It makes me think of the difference between ancestral and non-ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler, the output is somewhat deterministic and doesn't vary with increasing sampling steps, but with Ancestral, noise is added to each step which creates more variety but is more random/stochastic.
I assume to create video, the sampler needs to lean heavily on the previous frame while injecting some kind of sub-prompt, like rotate <object> to the left by 5 degrees, etc. I like the phrase another commenter used, "temporal consistency".
Edit: Indeed the special sauce is "temporal layers". [0]
> Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets
[0] https://stability.ai/research/stable-video-diffusion-scaling...
adventured|2 years ago
So this example was posted an hour ago, and it's jumping all over the place frame to frame (somewhat weak temporal consistency). The author appears to have used pretty straight-forward text2img + Animatediff:
https://www.reddit.com/r/StableDiffusion/comments/180no09/on...
Fixing that frame to frame jitter related to animation is probably the most in-demand thing around Stable Diffusion right now.
Animatediff motion painting made a splash the other day:
https://www.reddit.com/r/StableDiffusion/comments/17xnqn7/ro...
It's definitely an exciting time around SD + animation. You can see how close it is to reaching the next level of generation.
shaileshm|2 years ago
Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!
awongh|2 years ago
I’m the background section of the research paper they mention “temporal convolution layers”, can anyone explain what that is? What sort of training data is the input to represent temporal states between images that make up a video? Or does that mean something else?
flaghacker|2 years ago
A good resource for the "instead" case: https://unit8.com/resources/temporal-convolutional-networks-...
The "also" case is an example of 3D convolution, an example of a paper that uses it: https://www.cv-foundation.org/openaccess/content_iccv_2015/p...
machinekob|2 years ago
I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker. https://arxiv.org/abs/1811.08383
epiccoleman|2 years ago
It's crazy to see this level of progress in just a bit over half a year.
[1]: https://epiccoleman.com/posts/2023-03-05-deforum-stable-diff...
christkv|2 years ago
ben_w|2 years ago
Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic only to be improved upon with each subsequent blockbuster game.
I think we're in a similar phase with AI[0]: every new release in $category is better, gets hailed as super fantastic world changing, is improved upon in the subsequent Two Minute Papers video on $category, and the cycle repeats.
[0] all of them: LLMs, image generators, cars, robots, voice recognition and synthesis, scientific research, …
deckard1|2 years ago
On a more serious note, I don't think Roger Deakins has anything to worry about right now. Or maybe ever. We've been here before. DAWs opened up an entire world of audio production to people that could afford a laptop and some basic gear. But we certainly do not have a thousand Beatles out there. It still requires talent and effort.
CamperBob2|2 years ago
marcusverus|2 years ago
accrual|2 years ago
throwaway743|2 years ago
henriquecm8|2 years ago
Geordi: "Computer, in the Holmesian style, create a mystery to confound Data with an opponent who has the ability to defeat him"
rbhuta|2 years ago
xena|2 years ago
zvictor|2 years ago
spaceman_2020|2 years ago
Like, at this point, what are the technical counters to the assertion that our world is a simulation?
KineticLensman|2 years ago
First off, there are zero technical proofs that we are in a sim, just a number of philosophical arguments.
In practical terms, we cannot yet simulate a single human cell at the molecular level, given the massive number of interactions that occur every microsecond. Simulating our entire universe is not technically possible within the lifetime of our universe, according to our current understanding of computation and physics. You either have to assume that ‘the sim’ is very narrowly focussed in scope and fidelity, and / or that the outer universe that hosts ‘the sim’ has laws of physics that are essentially magic from our perspective. In which case the simulation hypothesis is essentially a religious argument, where the creator typed 'let there be light' into his computer. If there isn't such a creator, the sim hypothesis 'merely' suggests that our universe, at its lowest levels, looks somewhat computational, which is an entirely different argument.
tracerbulletx|2 years ago
beepbooptheory|2 years ago
If it would mean something drastic to you, I would be very curious to hear your preexisting existential beliefs/commitments.
People say this sometimes and its kind of slowly revealed to me that its just a new kind of geocentrism: its not just a simulation people have in mind, but one where earth/humans are centered, and the rest of the universe is just for the benefit of "our" part of the simulation.
Which is a fine theory I guess, but is also just essentially wanting God to exist with extra steps!
2-718-281-828|2 years ago
How about this theory is neither verifiable nor falsifiable.
sesm|2 years ago
It's purely a religious question. When humanity invented the wheel, religion described the world as a giant wheel rotating in cycles. When humanity invented books, religion described the world as a book, and God as a it's writer. When humanity invented complex mechanism, religion described the world as giant mechanism and God as a watchmaker. Then computers where invented, and you can guess what happened next.
refulgentis|2 years ago
Let's steel-man — you mean 3D VR. Let's stipulate there's a headset today that renders 3D visually indistinguishable from reality. We're still short the other 4 senses
Much like faith, there's always a way to sort of escape the traps here and say "can you PROVE this is base reality"
The general technical argument against "brain in a vat being stimulated" would be the computation expense of doing such, but you can also write that off with the equivalent of foveated rendering but for all senses / entities
SXX|2 years ago
https://youtu.be/udPY5rQVoW0
To an extent...
PS: Video is 2 years old, but still really impressive.
justanotherjoe|2 years ago
aliljet|2 years ago
To be clear, this remains wicked, wicked, wicked exciting.
torginus|2 years ago
It was the same for image generation, where one needed to produce text prompts to create the image, and stuff like img2img and Controlnet that allowed things like controlling poses and inpainting, or having multiple prompts with masks controlling which part of the image is influenced by which prompt.
gorbypark|2 years ago
pizzafeelsright|2 years ago
The input eventually becomes meanings mapped to reality.
helpmenotok|2 years ago
artursapek|2 years ago
1024core|2 years ago
hbn|2 years ago
theodric|2 years ago
alkonaut|2 years ago
Diffusion models for moving images are already used to a limited extent for this. And I'm sure it will be the use case, not just an edge case.
Racing0461|2 years ago
SXX|2 years ago
After all fine-tuning wouldn't take that long.
citrusui|2 years ago
I do not think so as the chance of constructing a fleshy eldritch horror is quite high.
speedgoose|2 years ago
skonteam|2 years ago
mkaic|2 years ago
nwoli|2 years ago
nuclearsugar|2 years ago
rbhuta|2 years ago
minimaxir|2 years ago
The LICENSE is a special non-commercial one: https://huggingface.co/stabilityai/stable-video-diffusion-im...
It's unclear how exactly to run it easily: diffusers has video generation support now but need to see if it plugs in seamlessly.
chankstein38|2 years ago
ronsor|2 years ago
unknown|2 years ago
[deleted]
AltruisticGapHN|2 years ago
siddbudd|2 years ago
neaumusic|2 years ago
spupy|2 years ago
Sohcahtoa82|2 years ago
unknown|2 years ago
[deleted]
pcj-github|2 years ago
dinvlad|2 years ago
accrual|2 years ago
LoveMortuus|2 years ago
I think this will really open new ways and new doors to creativity and creative expression.
keiferski|2 years ago
I have googled for it but mostly just get low quality web tools.
circuit10|2 years ago
iamgopal|2 years ago
niek_pas|2 years ago
nbzso|2 years ago
Instance One : Act as a top tier Hollywood scenarist, use the public available data for emotional sentiment to generate a storyline, apply the well known archetypes from proven blockbusters for character development. Move to instance two.
Instance Two: Act as top tier producer. {insert generated prompt}. Move to instance three.
Instance Three: Generate Meta-humans and load personality traits. Move to instance four.
Instance Four: Act as a top tier director.{insert generated prompt}. Move to instance five.
Instance Five: Act as a top tier editor.{insert generated prompt}. Move to instance six.
Instance Six: Act as a top tier marketing and advertisement agency.{insert generated prompt}. Move to instance seven.
Instance Seven: Act as a top tier accountant, generate an interface to real-time ROI data and give me the results on an optimized timeline into my AI induced dream.
Personal GPT: Buy some stocks, diversify my portfolio, stock up on synthetic meat, bug-coke and Soma. Call my mom and tell her I made it.
chrononaut|2 years ago
For example, the man in the cowboy hat seems he is almost gagging. In the train video the tracks seem to be too wide while the train ice skates across them.
renlo|2 years ago
slow_numbnut|2 years ago
didip|2 years ago
devdiary|2 years ago
rvion|2 years ago
RandomBK|2 years ago
Eduard|2 years ago
jonplackett|2 years ago
TruthWillHurt|2 years ago
gregorymichael|2 years ago
rbhuta|2 years ago
Let me know what you think of it! It works best on landscape images from my tests.
radicality|2 years ago
youssefabdelm|2 years ago
accrual|2 years ago
richthekid|2 years ago
Chabsff|2 years ago
Don't get me wrong, this is insanely cool, but it's still a long way from good enough to be truly disruptive.
jetsetk|2 years ago