Stable Video Diffusion

btbuildem|2 years ago

In the video towards the bottom of the page, there are two birds (blue jays), but in the background there are two identical buildings (which look a lot like the CN Tower). CN Tower is the main landmark of Toronto, whose baseball team happens to be the Blue Jays. It's located near the main sportsball stadium downtown.

I vaguely understand how text-to-image works, and so it makes sense that the vector space for "blue jays" would be near "toronto" or "cn tower". The improvements in scale and speed (image -> now video) are impressive, but given how incredibly able the image generation models are, they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

TacticalCoder|2 years ago

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

I feel like we're close too, but for another reason.

For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.

Wanna move that bicycle? Move it in the 3D scene exactly where you want.

That is coming.

And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?

That is coming too.

xianshou|2 years ago

Emu edit should be exactly what you're looking for: https://ai.meta.com/blog/emu-text-to-video-generation-image-...

01100011|2 years ago

I recently tried to generate clip art for a presentation using GPT-4/DALL-E 3. I found it could handle some updates but the output generally varied wildly as I tried to refine the image. For instance, I'd have a cartoon character checking its watch and also wearing a pocket watch. Trying to remove the pocket watch resulted in an entirely new cartoon with little stylistic continuity to the first.

Also, I originally tried to get the 3 characters in the image to be generated simultaneously, but eventually gave up as DALL-E had a hard time understanding how I wanted them positioned relative to each other. I just generated 3 separate characters and positioned them in the same image using Gimp.

filterfiber|2 years ago

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Emu can do that.

The bluejay/toronto thing may be addressable later (I suspect via more detailed annotations a la dalle3) - these current video models are highly focused on figuring out temporal coherence

amoshebb|2 years ago

I wonder what other odd connections are made due to city-name almost certainly being the most common word next to sportsball-name.

Do the parameters think that Jazz musicians are mormon? Padres often surf? Wizards like the Lincoln Memorial?

dsmmcken|2 years ago

Adobe is doing some great work here in my opinion in terms of building AI tools that make sense for artist workflows. This "sneak peak" demo from the recent Adobe Max conference is pretty much exactly what you described, actually better because you can just click on an object in the image and drag it.

See video: https://www.adobe.com/max/2023/sessions/project-stardust-gs6...

achileas|2 years ago

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Nearly all of the available models have this, even the highly commercialized ones like in Adobe Firefly and Canva, it’s called inpainting in most tools.

omneity|2 years ago

Nice eye!

As for your last question yes that exists. There are two models from Meta that do exactly this, instruction based iteration on photos, Emu Edit[0], and videos, Emu Video[1].

There's also LLaVa-interactive[2] for photos where you can even chat with the model about the current image.

[0]: https://emu-edit.metademolab.com/

[1]: https://emu-video.metademolab.com/

[2]: https://llava-vl.github.io/llava-interactive/

COAGULOPATH|2 years ago

> they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Yeah. They're not "videos" so much as images that move around a bit.

This doesn't really look any better than those Midjourney + RunwayML videos we had half a year ago.

>Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Google has a model called Phenaki that supposedly allows for that kind of stuff. But the public can't use it so it's hard to say how good it actually is.

treesciencebot|2 years ago

Have you seen fal.ai/dynamic where you can perform image to image synthesis (basically editing an existing image with the help of diffusion process) using LCMs to provide a real time UI?

appplication|2 years ago

I don’t spend a lot of time keeping up with the space, but I could have sworn I’ve seen a demo that allowed you to iterate in the way you’re suggesting. Maybe someone else can link it.

JoshTriplett|2 years ago

I also wonder if the model takes capitalization into account. Capitalized "Blue Jays" seems more likely to reference the sports team; the birds would be lowercase.

zeckalpha|2 years ago

I see that as a reference to the AI generated Toronto Blue Jays advertisement gone wrong that went viral earlier this year. https://www.blogto.com/sports_play/2023/06/ai-generated-toro...

stevage|2 years ago

I wondered similarly whether the astronaut's weird gait was because it was kind of "moonwalking" on the moon.

kshacker|2 years ago

Assuming we can post links, you mean this video: https://youtu.be/G7mihAy691g?si=o2KCmR2Uh_97UQ0N

Also, maybe you can't edit post facto, but when you give prompts, would you not be able to say : two blue jays but no CN tower

FrozenTuna|2 years ago

Not exactly what you're asking for, but AnimateDiff has introduced creating gifs to SD. Still takes quite a bit of tweaking IME.

ProfessorZoom|2 years ago

that sounds like v0 by vercel, you can iterate just like you asked, to combine that type of iteration with video would be really awesome

Veraslane|2 years ago

[deleted]

jackmott|2 years ago

[deleted]

psunavy03|2 years ago

> sportsball

This is not the flex you think it is. You don't have to like sports, but snarking on people who do doesn't make you intellectual, it just makes you come across as a douchebag, no different than a sports fan making fun of "D&D nerds" or something.

valine|2 years ago

The rate of progress in ML this past year has been breath taking.

I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.

alberth|2 years ago

What was the big “unlock” that allowed so much progress this past year?

I ask as a noob in this area.

Der_Einzige|2 years ago

Controlnet is adapted to video today, the issues are that it's very slow. Haven't you seen the insane quality of videos on civitai?

kornesh|2 years ago

Yeah, solving the flickering problem and achieving temporal consistency will be the key to realize the full potential of generative video models.

Right now, AnimateDiff is leading the way in consistency but I'm really excited to see what people will do with this new model.

hanniabu|2 years ago

> but the real utility of this will be the temporal consistency

The main utility will me misinformation

firefoxd|2 years ago

I understand the magnitude of innovation that's going on here. But still feel like we are generating these videos with both hands tied behind our backs. In other words, it's nearly impossible to edit the videos in this constraints. (Imagine trying to edit the blue Jays to get the perfect view).

Since videos are rarely consumed raw, what if this becomes a pipeline in Blender instead? (Blender the 3d software). Now the video becomes a complete scene with all the key elements of the text input animated. You have your textures, you have your animation, you have your camera, you have all the objects in place. We can even have the render engine in the pipeline to increase the speed of video generation.

It may sound like I'm complaining, but I'm just ask making a feature request...

huytersd|2 years ago

What would solve all these issues is full generation of 3D models that we hopefully get a chance to see over the next decade. I’ve been advocating for a solid LiDAR camera on the iPhone so there is a lot of training data for these LLMs.

jwoodbridge|2 years ago

we're working on this - dream3d.com

ericpauley|2 years ago

I'm still puzzled as to how these "non-commercial" model licenses are supposed to be enforceable. Software licenses govern the redistribution of the software, not products produced with it. An image isn't GPL'd because it was produced with GIMP.

yorwba|2 years ago

The license is a contract that allows you to use the software provided you fulfill some conditions. If you do not fulfill the conditions, you have no right to a copy of the software and can be sued. This enforcement mechanism is the same whether the conditions are that you include source code with copies you redistribute, or that you may only use it for evil, or that you must pay a monthly fee. Of course this enforcement mechanism may turn out to be ineffective if it's hard to discover that you're violating the conditions.

SXX|2 years ago

It doesn't have to be enforceable. This licensing model works exactly the same as Microsoft Windows licensing or WinRAR licensing. Lots and lots of people have pirated Windows or just buy some cheap keys off Ebay, but no one of them in their sane mind would use anything like that at their company.

The same way you can easily violate any "non-commercial" clauses of models like this one as private person or as some tiny startup, but company that decide to use them for their business will more likely just go and pay.

So it's possible to ignore license, but legal and financial risks are not worth it for businesses.

dist-epoch|2 years ago

Visual Studio Community (and many other products) only allows "non-commercial" usage. Sounds like it limits what you can do with what you produce with it.

At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

As an example, see the Creative Commons license, ShareAlike clause:

> If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

kmeisthax|2 years ago

So, there's a few different things interacting here that are a little confusing.

First off, you have copyright law, which grants monopolies on the act of copying to the creators of the original. In order to legally make use of that work you need to either have permission to do so (a license), or you need to own a copy of the work that was made by someone with permission to make and sell copies (a sale). For the purposes of computer software, you will almost always get rights to the software through a license and not a sale. In fact, there is an argument that usage of computer software requires a license and that a sale wouldn't be enough because you wouldn't have permission to load it into RAM[0].

Licenses are, at least under US law, contracts. These are Turing-complete priestly rites written in a special register of English that legally bind people to do or not do certain things. A license can grant rights, or, confusingly, take them away. For example, you could write a license that takes away your fair use rights[1], and courts will actually respect that. So you can also have a license that says you're only allowed to use software for specific listed purposes but not others.

In copyright you also have the notion of a derivative work. This was invented whole-cloth by the US Supreme Court, who needed a reason to prosecute someone for making a SSSniperWolf-tier abridgement[2] of someone else's George Washington biography. Normal copyright infringement is evidenced by substantial similarity and access: i.e. you saw the original, then you made something that's nearly identical, ergo infringement. The law regarding derivative works goes a step further and counts hypothetical works that an author might make - like sequels, translations, remakes, abridgements, and so on - as requiring permission in order to make. Without that permission, you don't own anything and your work has no right to exist.

The GPL is the anticopyright "judo move", invented by a really ornery computer programmer that was angry about not being able to fix their printer drivers. It disclaims almost the entire copyright monopoly, but it leaves behind one license restriction, called a "copyleft": any derivative work must be licensed under the GPL. So if you modify the software and distribute it, you have to distribute your changes under GPL terms, thus locking the software in the commons.

Images made with software are not derivative works of the software, nor do they contain a substantially similar copy of the software in them. Ergo, the GPL copyleft does not trip. In fact, even if it did trip, your image is still not a derivative work of the software, so you don't lose ownership over the image because you didn't get permission. This also applies to model licenses on AI software, insamuch as the AI companies don't own their training data[3].

However, there's still something that licenses can take away: your right to use the software. If you use the model for "commercial" purposes - whatever those would be - you'd be in breach of the license. What happens next is also determined by the license. It could be written to take away your noncommercial rights if you breach the license, or it could preserve them. In either case, however, the primary enforcement mechanism would be a court of law, and courts usually award money damages. If particularly justified, they could demand you destroy all copies of the software.

If it went to SCOTUS (unlikely), they might even decide that images made by software are derivative works of the software after all, just to spite you. The Betamax case said that advertising a copying device with potentially infringing scenarios was fine as long as that device could be used in a non-infringing manner, but then the Grokster case said it was "inducement" and overturned it. Static, unchanging rules are ultimately a polite fiction, and the law can change behind your back if the people in power want or need it to. This is why you don't talk about the law in terms of something being legal or illegal, you talk about it in terms of risk.

[0] Yes, this is a real argument that courts have actually made. Or at least the Ninth Circuit.

The actual facts of the case are even more insane - basically a company trying to sue former employees for fixing it's customers computers. Imagine if Apple sued Louis Rossman for pirating macOS every time he turned on a customer laptop. The only reason why they can't is because Congress actually created a special exemption for computer repair and made it part of the DMCA.

[1] For example, one of the things you agree to when you buy Oracle database software is to give up your right to benchmark the software. I'm serious! The tech industry is evil and needs to burn down to the ground!

[2] They took 300 pages worth of material from 12 books and copied it into a separate, 2 volume work.

[3] Whether or not copyright on the training data images flows through to make generated images a derivative work is a separate legal question in active litigation.

cubefox|2 years ago

Nobody claimed otherwise?

Der_Einzige|2 years ago

They're not enforceable.

stevage|2 years ago

A software licence can definitely govern who can use it and what they can do with it.

> An image isn't GPL'd because it was produced with GIMP.

That's because of how the GPL is written, not because of some limitation of software licences.

accrual|2 years ago

Fascinating leap forward.

It makes me think of the difference between ancestral and non-ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler, the output is somewhat deterministic and doesn't vary with increasing sampling steps, but with Ancestral, noise is added to each step which creates more variety but is more random/stochastic.

I assume to create video, the sampler needs to lean heavily on the previous frame while injecting some kind of sub-prompt, like rotate <object> to the left by 5 degrees, etc. I like the phrase another commenter used, "temporal consistency".

Edit: Indeed the special sauce is "temporal layers". [0]

> Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets

[0] https://stability.ai/research/stable-video-diffusion-scaling...

adventured|2 years ago

The hardest problem the Stable Diffusion community has dealt with in terms of quality has been in the video space, largely in relation to the consistency between frames. It's probably the most commonly discussed problem for example on r/stablediffusion. Temporal consistency is the popular term for that.

So this example was posted an hour ago, and it's jumping all over the place frame to frame (somewhat weak temporal consistency). The author appears to have used pretty straight-forward text2img + Animatediff:

https://www.reddit.com/r/StableDiffusion/comments/180no09/on...

Fixing that frame to frame jitter related to animation is probably the most in-demand thing around Stable Diffusion right now.

Animatediff motion painting made a splash the other day:

https://www.reddit.com/r/StableDiffusion/comments/17xnqn7/ro...

It's definitely an exciting time around SD + animation. You can see how close it is to reaching the next level of generation.

shaileshm|2 years ago

This field moves so fast. Blink an eye and there is another new paper. This is really cool and the learning speed of us humans is insane! Really excited on using it for downstream tasks! I wonder how easy it is to integrate animatediff with this model?

Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!

awongh|2 years ago

It makes sense that they had to take out all of the cuts and fades from the training data to improve results.

I’m the background section of the research paper they mention “temporal convolution layers”, can anyone explain what that is? What sort of training data is the input to represent temporal states between images that make up a video? Or does that mean something else?

flaghacker|2 years ago

It means that instead of (only) doing convolution in spatial dimensions, it also(/instead) happens in the temporal dimension.

A good resource for the "instead" case: https://unit8.com/resources/temporal-convolutional-networks-...

The "also" case is an example of 3D convolution, an example of a paper that uses it: https://www.cv-foundation.org/openaccess/content_iccv_2015/p...

machinekob|2 years ago

I would assume is something similar to joining multiple frames/attentions? in channel dimension and then moving values inside so convolution will have access to some channels from other video frames.

I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker. https://arxiv.org/abs/1811.08383

epiccoleman|2 years ago

This is really, really cool. A few months ago I was playing with some of the "video" generation models on Replicate, and I got some really neat results[1], but it was very clear that the resulting videos were made from prompting each "frame" with the previous one. This looks like it can actually figure out how to make something that has a higher level context to it.

It's crazy to see this level of progress in just a bit over half a year.

[1]: https://epiccoleman.com/posts/2023-03-05-deforum-stable-diff...

christkv|2 years ago

Looks like I'm still good for my bet with some friends that before 2028 a team of 5-10 people will create a blockbuster style movie that today costs 100+ million USD on a shoestring budget and we won't be able to tell.

ben_w|2 years ago

I wouldn't bet either way.

Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic only to be improved upon with each subsequent blockbuster game.

I think we're in a similar phase with AI[0]: every new release in $category is better, gets hailed as super fantastic world changing, is improved upon in the subsequent Two Minute Papers video on $category, and the cycle repeats.

[0] all of them: LLMs, image generators, cars, robots, voice recognition and synthesis, scientific research, …

deckard1|2 years ago

I'm imagining more of an AI that takes a standard movie screenplay and a sidecar file, similar to a CSS file for the web and generates the movie. This sidecar file would contain the "director" of the movie, with camera angles, shot length and speed, color grading, etc. Don't like how the new Dune movie looks? Edit the stylesheet and make it your own. Personalized remixed blockbusters.

On a more serious note, I don't think Roger Deakins has anything to worry about right now. Or maybe ever. We've been here before. DAWs opened up an entire world of audio production to people that could afford a laptop and some basic gear. But we certainly do not have a thousand Beatles out there. It still requires talent and effort.

CamperBob2|2 years ago

It'll happen, but I think you're early. 2038 for sure, unless something drastic happens to stop it (or is forced to happen.)

marcusverus|2 years ago

I'm pumped for this future, but I'm not sure that I buy your optimistic timeline. If the history of AI has taught us anything, it is that the last 1% of of progress is the hardest half. And given the unforgiving nature of the uncanny valley, the video produced by such a system will be worthless until it is damn-near perfect. That's a tall order!

accrual|2 years ago

The first full-length AI generated movie will be an important milestone for sure, and will probably become a "required watch" for future AI history classes. I wonder what the Rotten Tomatoes page will look like.

throwaway743|2 years ago

Definitely a big first for benchmarks. After that hyper personalized content/media generated on-demand

henriquecm8|2 years ago

What I am really looking forward is some Star Trek style holodeck, but I guess we will start with it in VR headsets first.

Geordi: "Computer, in the Holmesian style, create a mystery to confound Data with an opponent who has the ability to defeat him"

rbhuta|2 years ago

VRAM requirements are big for this launch. We're hosting this for free at https://app.decoherence.co/stablevideo. Disclaimer: Google log-in required to help us reduce spam.

xena|2 years ago

How big is big?

zvictor|2 years ago

it's worth paying your subscription just for these free videos. would those have the watermark removed if I go "Basic"?

spaceman_2020|2 years ago

A seemingly off topic question, but with enough compute and optimization, could you eventually simulate “reality”?

Like, at this point, what are the technical counters to the assertion that our world is a simulation?

KineticLensman|2 years ago

(disclaimer: worked in the sim industry for 25 years, still active in terms of physics-based rendering).

First off, there are zero technical proofs that we are in a sim, just a number of philosophical arguments.

In practical terms, we cannot yet simulate a single human cell at the molecular level, given the massive number of interactions that occur every microsecond. Simulating our entire universe is not technically possible within the lifetime of our universe, according to our current understanding of computation and physics. You either have to assume that ‘the sim’ is very narrowly focussed in scope and fidelity, and / or that the outer universe that hosts ‘the sim’ has laws of physics that are essentially magic from our perspective. In which case the simulation hypothesis is essentially a religious argument, where the creator typed 'let there be light' into his computer. If there isn't such a creator, the sim hypothesis 'merely' suggests that our universe, at its lowest levels, looks somewhat computational, which is an entirely different argument.

tracerbulletx|2 years ago

The brain does simulate reality in the sense that what you experience isn't direct sensory input, but more like a dream being generated to predict what it thinks is happening based on conflicting and imperfect sensory input.

beepbooptheory|2 years ago

Why does it matter? Not trying to dismiss, but truly, what would it mean to you if you could somehow verify the "simulation"?

If it would mean something drastic to you, I would be very curious to hear your preexisting existential beliefs/commitments.

People say this sometimes and its kind of slowly revealed to me that its just a new kind of geocentrism: its not just a simulation people have in mind, but one where earth/humans are centered, and the rest of the universe is just for the benefit of "our" part of the simulation.

Which is a fine theory I guess, but is also just essentially wanting God to exist with extra steps!

2-718-281-828|2 years ago

> Like, at this point, what are the technical counters to the assertion that our world is a simulation?

How about this theory is neither verifiable nor falsifiable.

sesm|2 years ago

There can be no technical counters to the assertion that our world is a simulation. If our world is a simulation, then hardware/software that simulates it is outside of our world and it's technical constitution is inaccessible to us.

It's purely a religious question. When humanity invented the wheel, religion described the world as a giant wheel rotating in cycles. When humanity invented books, religion described the world as a book, and God as a it's writer. When humanity invented complex mechanism, religion described the world as giant mechanism and God as a watchmaker. Then computers where invented, and you can guess what happened next.

refulgentis|2 years ago

A little too freshman's first bit off a bong for me. There is, of course, substantial differences between video and reality.

Let's steel-man — you mean 3D VR. Let's stipulate there's a headset today that renders 3D visually indistinguishable from reality. We're still short the other 4 senses

Much like faith, there's always a way to sort of escape the traps here and say "can you PROVE this is base reality"

The general technical argument against "brain in a vat being stimulated" would be the computation expense of doing such, but you can also write that off with the equivalent of foveated rendering but for all senses / entities

SXX|2 years ago

Actually it was already done by sentdex with GAN Theft Auto:

https://youtu.be/udPY5rQVoW0

To an extent...

PS: Video is 2 years old, but still really impressive.

justanotherjoe|2 years ago

That theory was never meant to be so airtight such that it 'needs' to be refuted.

aliljet|2 years ago

I've been following this space very very closely and the killer feature would be to be able to generate these full featured videos for longer than a few seconds with consistently shaped "characters" (e.g., flowers, and grass, and houses, and cars, actors, etc.). Right now, it's not clear to me that this is achieving that objective. This feels like it could be great to create short GIFs, but at what cost?

To be clear, this remains wicked, wicked, wicked exciting.

torginus|2 years ago

I admit I'm ignorant about these model's inner workings, but I don't understand why text is the chosen input format for these models.

It was the same for image generation, where one needed to produce text prompts to create the image, and stuff like img2img and Controlnet that allowed things like controlling poses and inpainting, or having multiple prompts with masks controlling which part of the image is influenced by which prompt.

gorbypark|2 years ago

According to the GitHub repo this is an "image-to-video model". They tease of an upcoming "text to video" interface on the linked landing page, though. My guess is that interface will use a text-to-image model and then feed that into the image-to-video model.

pizzafeelsright|2 years ago

Imago Deo? The Word is what is spoken when we create.

The input eventually becomes meanings mapped to reality.

helpmenotok|2 years ago

Can this be used for porn?

artursapek|2 years ago

Porn will be one of the main use cases for this technology. Porn sites pioneered video streaming technologies back in the day, and drove a lot of the innovation there.

1024core|2 years ago

The question reminded me of this classic: https://www.youtube.com/watch?v=YRgNOyCnbqg

hbn|2 years ago

Depends on whether trains, cars, and/or black cowboys tickle your fancy.

theodric|2 years ago

If it can't, someone will massage it until it can. Porn, and probably also stock video to sell to YouTubers.

alkonaut|2 years ago

The answer to that question is always "yes", regardless what "this" is.

Diffusion models for moving images are already used to a limited extent for this. And I'm sure it will be the use case, not just an edge case.

Racing0461|2 years ago

Nope, all commercial models are severly gated.

SXX|2 years ago

It's already posted to Unstable Diffusion discord so soon we'll know.

After all fine-tuning wouldn't take that long.

citrusui|2 years ago

Very unusual comment.

I do not think so as the chance of constructing a fleshy eldritch horror is quite high.

speedgoose|2 years ago

Has anyone managed to run the thing? I got the streamlit demo to start after fighting with pytorch, mamba, and pip for half an hour, but the demo runs out of GPU memory after a little while. I have 24GB on GPU on the machine I used, does it need more?

skonteam|2 years ago

Yeah, got a 24GB 4090, try to reduce the number of frames decoded to something like 4 or 8. Although, keep in mind it caps the 24Gb and goes to RAM (with the latest nvidia drivers).

mkaic|2 years ago

Have heard from others attempting it that it needs 40GB, so basically an A100/A6000/H100 or other large card. Or an Apple Silicon Mac with a bunch of unified memory, I guess.

nwoli|2 years ago

Is the checkpoint default fp16 or fp32?

nuclearsugar|2 years ago

Very excited to play with this. Some of my latest experiments - https://www.jasonfletcher.info/vjloops/

rbhuta|2 years ago

We're hosting this free (no credit card needed) at https://app.decoherence.co/stablevideo Disclaimer: Google log-in required to help us reduce spam. Let me know what you think of it! It works best on landscape images from my tests.

minimaxir|2 years ago

Model weights (two variations, each 10GB) are available without waitlist/approval: https://huggingface.co/stabilityai/stable-video-diffusion-im...

The LICENSE is a special non-commercial one: https://huggingface.co/stabilityai/stable-video-diffusion-im...

It's unclear how exactly to run it easily: diffusers has video generation support now but need to see if it plugs in seamlessly.

chankstein38|2 years ago

It looks like the huggingface page links their github that seems to have python scripts to run these: https://github.com/Stability-AI/generative-models

ronsor|2 years ago

Regular reminder that it is very likely that model weights can't be copyrighted (and thus can't be licensed).

unknown|2 years ago

[deleted]

AltruisticGapHN|2 years ago

These are basically like animated postcards, like you often see now on loading screens in videogames. A single picture has been animated. Still a long shot from actual video.

siddbudd|2 years ago

"2 more papers down the line"...

neaumusic|2 years ago

It's funny that still don't really have video wallpapers on most devices (I'm only aware of Wallpaper Engine on Windows)

spupy|2 years ago

Mplayer/MPV used to be able to play videos in the X root window like a wallpaper. No idea if it still works nowadays.

Sohcahtoa82|2 years ago

I had a video wallpaper on my Motorola Droid back in 2010.

unknown|2 years ago

[deleted]

pcj-github|2 years ago

Soon the hollywood strike won't even matter, won't need any of those jobs. Entire west coast economy obliterated.

dinvlad|2 years ago

Seems relatively unimpressive tbh - it's not really a video, and we've seen this kind of thing for a few months now

accrual|2 years ago

It seems like the breakthrough is that the video generating method is now baked into the model and generator. I've seen several fairly impressive AI animations as well, but until now, I assumed they were tediously cobbled together by hacking on the still-image SD models.

LoveMortuus|2 years ago

Once text-to-video is good enough and once text generation is good enough, we could legit actually have endless TV shows produced by individuals! We're probably still far away from that, but it is exciting to think about!

I think this will really open new ways and new doors to creativity and creative expression.

keiferski|2 years ago

Question for anyone more familiar with this space: are there any high-quality tools which take an image and make it into a short video? For example, an image of a tree becomes a video of a tree swaying in the wind.

I have googled for it but mostly just get low quality web tools.

circuit10|2 years ago

That's what this is

iamgopal|2 years ago

Very soon, we will be able to change story line of a web series dynamically, a little more thrill, a little more comedy, changing character face to matching ours and others, all in 3D with 360 degree view, how far are we from this ? 5 year ?

niek_pas|2 years ago

At least several decades, I’d say. This is a hugely complex, multifaceted problem. LLMs can’t even write half-decent screenplays yet.

nbzso|2 years ago

Model chain:

Instance One : Act as a top tier Hollywood scenarist, use the public available data for emotional sentiment to generate a storyline, apply the well known archetypes from proven blockbusters for character development. Move to instance two.

Instance Two: Act as top tier producer. {insert generated prompt}. Move to instance three.

Instance Three: Generate Meta-humans and load personality traits. Move to instance four.

Instance Four: Act as a top tier director.{insert generated prompt}. Move to instance five.

Instance Five: Act as a top tier editor.{insert generated prompt}. Move to instance six.

Instance Six: Act as a top tier marketing and advertisement agency.{insert generated prompt}. Move to instance seven.

Instance Seven: Act as a top tier accountant, generate an interface to real-time ROI data and give me the results on an optimized timeline into my AI induced dream.

Personal GPT: Buy some stocks, diversify my portfolio, stock up on synthetic meat, bug-coke and Soma. Call my mom and tell her I made it.

chrononaut|2 years ago

Much like in static images, the subtle unintended imperfections are quite interesting to observe.

For example, the man in the cowboy hat seems he is almost gagging. In the train video the tracks seem to be too wide while the train ice skates across them.

renlo|2 years ago

How much longer will it be until we can play "video games" which consist of user-input streamed to an AI that generates video output and streams it to the player's screen?

slow_numbnut|2 years ago

If you're willing to accept text based output then Text adventure style games and even simulating bash was possible using chatgpt until openAI nerfed it.

didip|2 years ago

Stability.ai, please make sure your board is sane.

devdiary|2 years ago

A default glitch effect in the video can make the distortions a "feature not a bug"

rvion|2 years ago

Finally ! Now that this is out, I can finally start adding proper video widgets to CushyStudio https://github.com/rvion/CushyStudio#readme . Really hope I can get in touch with StabilityAi people soon. Maybe Hacker News will help

RandomBK|2 years ago

Needs 40GB VRAM, down to 24GB by reducing the number of frames processed in parallel.

Eduard|2 years ago

cannot join the waiting list (nor opt in for marketing newsletter), because the sign-up form checkboxes don't toggle on android mobile Chrome or Firefox.

jonplackett|2 years ago

Is this available in the stability API any time soon?

TruthWillHurt|2 years ago

And thanks to the porn community on Civit.ai!

gregorymichael|2 years ago

How long until Replicate has this available?

rbhuta|2 years ago

We're hosting this free (no credit card needed) at https://app.decoherence.co/stablevideo Disclaimer: Google log-in required to help us reduce spam.

Let me know what you think of it! It works best on landscape images from my tests.

radicality|2 years ago

Looks like there is a WIP here: https://replicate.com/lucataco/svd

youssefabdelm|2 years ago

Can't wait for these things to not suck

accrual|2 years ago

It's definitely pretty impressive already. If there could be some kind of "final pass" to remove the slightly glitchy generative artifacts, these look completely passible for simple .gif/.webm header images. Especially if they could be made to loop smoothly ala Snapchat's bounce filter.

richthekid|2 years ago

This is gonna change everything

Chabsff|2 years ago

It's really not.

Don't get me wrong, this is insanely cool, but it's still a long way from good enough to be truly disruptive.

jetsetk|2 years ago

Is it? How so?

302 comments