top | item 42368604

Sora is here

1152 points| toomuchtodo | 1 year ago |openai.com

983 comments

[+] yeknoda|1 year ago|reply

I've found using these and similar tools that the amount of prompts and iteration required to create my vision (image or video in my mind) is very large and often is not able to create what I had originally wanted. A way to test this is to take a piece of footage or an image which is the ground truth, and test how much prompting and editing it takes to get the same or similar ground truth starting from scratch. It is basically not possible with the current tech and finite amounts of time and iterations.

[+] jerf|1 year ago|reply

It just plain isn't possible if you mean a prompt the size of what most people have been using lately, in the couple hundred character range. By sheer information theory, the number of possible interpretations of "a zoom in on a happy dog catching a frisbee" means that you can not match a particular clip out of the set with just that much text. You will need vastly more content; information about the breed, information about the frisbee, information about the background, information about timing, information about framing, information about lighting, and so on and so forth. Right now the AIs can't do that, which is to say, even if you sit there and type a prompt containing all that information, it is going to be forced to ignore most of the result. Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.

This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.

Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.

[+] miltonlost|1 year ago|reply

The adage "a picture is worth a thousand words" has the nice corollary "A thousand words isn't enough to be precise about an image".

Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.

[+] isoprophlex|1 year ago|reply

And another thing that irks me: none of these video generators get motion right...

Especially anything involving fluid/smoke dynamics, or fast dynamic momements of humans and animals all suffer from the same weird motion artifacts. I can't describe it other than that the fluidity of the movements are completely off.

And as all genai video tools I've used are suffering from the same problem, I wonder if this is somehow inherent to the approach & somehow unsolvable with the current model architectures.

[+] beefnugs|1 year ago|reply

AI isn't trying to sell to you: a precise artist with real vision in your brain. It is selling to managers who want to shit out something in an evening that approximates anything, that writes ads that no one wants to see anyway, that produces surface level examples of how you can pay employees less because "their job is so easy"

[+] minimaxir|1 year ago|reply

Way back in the days of GPT-2, there was an expectation that you'd need to cherry-pick atleast 10% of your output to get something usable/coherent. GPT-3 and ChatGPT greatly reduced the need to cherry-pick, for better or for worse.

All the generated video startups seem to generate videos with much lower than 10% usable output, without significant human-guided edits. Given the massive amount of compute needed to generate a video relative to hyperoptimized LLMs, the quality issue will handicap gen video for the foreseeable future.

[+] didibus|1 year ago|reply

Right, but you're thinking as someone who has a vision for the image/video. Think from someone who is needing an image/video and would normally hire a creative person for it, they might be able to get away with AI instead.

The same "prompt" they'd give the creative person they hired... Say, "I want an ad for my burgers that make it look really good, I'm thinking Christmas vibes, it should emphasize our high quality meat, make it cheerful, and remember to hint at our brand where we always have smiling cows."

Now that creative person would go make you that advert. You might check it, give a little feedback for some minor tweaks, and at some point, take what you got.

You can do the same here. The difference right now is that it'll output a lot of junk that a creative person would have never dared show you, so that initial quality filtering is missing. But on the flip side, it costs you a lot less, can generate like 100 of them quickly, and you just pick one that seems good enough.

[+] hipadev23|1 year ago|reply

Real artists struggle matching vague descriptions of what is in your head too. This is at least quicker?

[+] hmottestad|1 year ago|reply

When I first started learning Photoshop as a teenager I often knew what I wanted my final image to look like, but no matter how hard I tried I could never get the there. It wasn't that it was impossible, it was just that my skills just weren't there yet. I needed a lot more practice before I got good enough to create what I could see in my imagination.

Sora is obviously not Photoshop, but given that you can write basically anything you can think of I reckon it's going to take a long time to get good at expressing your vision in words that a model like Sora will understand.

[+] corytheboyd|1 year ago|reply

Free text is just the fundamentally wrong input for precision work like this. Because it is wrong for this doesn’t mean it has NO purpose, it’s still useful and impressive for what it is.

FWIW I too have been quite frustrated iterating with AI to produce a vision that is clear in my head. Past changing the broad strokes, once you start “asking” for specifics, it all goes to shit.

Still, it’s good enough at those broad strokes. If you want your vision to become reality, you either need to learn how to paint (or whatever the medium), or hire a professional, both being tough-but-fair IMO.

[+] ohthehugemanate|1 year ago|reply

If you have a specific vision, you will have to express the detailed information of that vision into the digital realm somehow. You can use (more) direct tools like premiere if you are fluent enough in their "language". Or you can use natural language to express the vision using AI. Either way you have to get the same amount of information into a digital format.

Also, AI sucks at understanding detail expressed in symbolic communication, because it doesn't understand symbols the way linguistic communication expects the receiver to understand them.

My own experience is that all the AI tools are great for shortcutting the first 70-80% or so. But the last 20% goes up an exponential curve of required detail which is easier and easier to express directly using tooling and my human brain.

Consider the analogy to a contract worker building or painting something for you. If all you have is a vague description, they'll make a good guess and you'll just have to live with that. But the more time you spend with them communicating (through description, mood boards rough sketches etc) the more accurate to your detailed version it will get. But you only REALLY get exactly what you want if you do it yourself, or sit beside them as they work and direct almost every step. And that last option is almost impossible if they can't understand symbolic meaning in language.

[+] cube2222|1 year ago|reply

Agreed. It’s still much better than what I could do myself without it, though.

(Talking about visual generative AI in general)

[+] joe_the_user|1 year ago|reply

The thing about Hollywood is that movies aren't made by a producer or director creating a description and an army of actors, tech and etc doing exactly that.

What happens is a description becomes a longer specification or script that's still good and hangs together in itself and then further iterations involving professionals who can't do "exactly what the director wants" but rather do something further that's good and close enough to what the director wants.

[+] diob|1 year ago|reply

I believe it. I was just using AI to help out with some mandatory end of year writing exercises at work.

Eventually, it starts to muck with the earlier work that it did good on, when I'm just asking it to add onto it.

I was still happy with what I got in the end, but it took trial and error and then a lot of piecemeal coaxing with verification that it didn't do more than I asked along the way.

I can imagine the same for video or images. You have to examine each step post prompt to verify it didn't go back and muck with the already good parts.

[+] planb|1 year ago|reply

Iterations are the missing link.

With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.

[+] goldfeld|1 year ago|reply

If you use it in a utilitarian way it'll give you a run for your money, if you use for expression, such as art, learning to embrace some serendipity, it makes good stuff.

[+] titzer|1 year ago|reply

As only a cursory user of said tools (but strong opinions) I felt the immediate desire to get an editable (2D) scene that I could rearrange. For example I often have a specific vantage point or composition in mind, which is fine to start from, but to tweak it and the elements, I'd like to edit it afterwards. To foray into 3D, I'd be wanting to rearrange the characters and direct them, as well as change the vantage point. Can it do that yet?

[+] javier123454321|1 year ago|reply

This is the conundrum of AI generated art. It will lower the barrier to entry for new artists to produce audiovisual content, but it will not lower the amount of effort required to make good art. If anything it will increase the effort, as it has to be excellent in order to get past the slop of base level drudge that is bound to fill up every single distribution channel.

[+] moralestapia|1 year ago|reply

Still three or four order of magnitudes cheaper and easier than to produce said video through traditional methods.

[+] nomel|1 year ago|reply

I think inpainting and "draw the label scene" type interfaces are the obvious future. Never thought I'd miss GauGAN [1].

https://www.youtube.com/watch?v=uNv7XBngmLY&t=25

[+] jstummbillig|1 year ago|reply

> A way to test this is to take a piece of footage or an image which is the ground truth, and test how much prompting and editing it takes to get the same or similar ground truth starting from scratch.

Sure, if you then do the same in reverse.

[+] unknown|1 year ago|reply

[deleted]

[+] mattigames|1 year ago|reply

Not too far in the future you will be able to drag and drop the position of the characters as well as the position of the camera, among other refiment tools.

[+] estebarb|1 year ago|reply

For those scenarios would be helpful a draft generation mode: 16 colors, 320x200...

[+] torginus|1 year ago|reply

Yeah, it almost feels like gambling - 'you're very close, just spend 20 more credits and you might get it right this time!'

[+] bilsbie|1 year ago|reply

Sounds like another way of saying a picture is worth a thousand words.

[+] droidrat|1 year ago|reply

[deleted]

[+] unknown|1 year ago|reply

[deleted]

[+] telenardo|1 year ago|reply

For those curious (and still locked out) here’s direct a comparison of Sora vs. the open-source leaders (HunyuanVideo, Mochi and LTX):

https://app.checkbin.dev/snapshots/1f0f3ce3-6a30-4c1a-870e-2...

Pros:

- Some of the Sora results are absolutely stunning. Check out the detail on the lion, for example! - The landscapes and aerial shots are absolutely incredible. - Quality is much better than Mochi & LTX out of the box. Mochi/LTX seem to require specifically optimized workflows (I've seen great img2vid LTX results on Reddit that start with Flux image generations, for example). Hunyuan seems comparable to Sora!

Cons:

- Still nearly impossible to access Sora despite the “launch”. My generations today were in the 2000s, implying that it’s only open to a very small number of people. There’s no api yet, so it’s not an option for developers. - Sora struggles with physical interactions. Watch the dancers moonwalk, or the ball goes through the dog. HunyuanVideo seems to be a bit better in this regard. - Can't run it locally mode (obviously) - I haven't tested this, but I think it's safe to assume Sora will be censored extensively. HunyuanVideo is surprisingly open (I've seen NSFW generations!) - I’m getting weird camera angles from Sora, but that could likely be solved with better prompting.

Overall, I’d say it’s the best model I've played with, though I haven’t spent much time on other non-open-source ones. Hunyuan gives it a run for its money, though!

[+] pen2l|1 year ago|reply

Every day that passes I grow fonder of Google's decision to delay or otherwise keep a lot of this under the wraps.

The other day I was scrolling down on YouTube shorts and a couple videos invoked an uncanny valley response from me (I think it was a clip of an unrealistically large snake covering some hut) which was somehow fascinating and strange and captivating, and then scrolling down a few more, again I saw something kind of "unbelievable"... I saw a comment or two saying it's fake, and upon closer inspection: yeah, there were enough AI'esque artifacts that one could confidently conclude it's fake.

We'd known about AI slop permeating Facebook -- usually a Jesus figure made out of unlikely set of things (like shrimp!) and we'd known that it grips eyeballs. And I don't even know in which box to categorize this, in my mind it conjures the image of those people on slot machines, mechanically and soullessly pulling levers because they are addicted. It's just so strange.

I can imagine now some of the conversations that might have happened at Google when they choose to keep a lot of innovations related to genAI under the wraps (I'm being charitable here of their motives), and I can't help but agree.

And I can't help but be saddened about OpenAI's decisions to unload a lot of this before recognizing the results of unleashing this to humanity, because I'm almost certain it'll be used more for bad things than good things, I'm certain its application on bad things will secure more eyeballs than on good things.

[+] submeta|1 year ago|reply

What I desperately need is a model that generates perfectly made PowerPoint slides. I have to create many presentations for management, and it’s a very time consuming task. It’s easy to outline my train of thoughts and let an LLM write the full text, but then to create a convincing presentation slide by slide takes days.

I know there is Beautiful.ai or Copilot for PowerPoint, but none of the existing tools really work for me because the results and the user flow aren’t convincing.

[+] MyFirstSass|1 year ago|reply

Wow this is bad. And by bad i mean worse than leading open source and existing alternatives.

Is it me or does it seem like OpenAI revolutionized with both chatGPT and Sora, but they've completely hit the ceiling?

Honestly a bit surprised it happened so fast!

[+] NoNotTheDuo|1 year ago|reply

Their example videos: https://openai.com/sora/, of the doors opening, are hilarious.

1. The first set of doors doesn't have any doorknobs or handles. https://ibb.co/PwqfzBq

2. The second set of doors has handles, and some very large/random hinges on the left door. https://ibb.co/JkDtc6r

3. The third set doesn't have any handles, but I can forgive that, because we're in a spaceship now. The problem is that the inside of the doors seem to have windows, but the outside of the doors, doesn't have any windows. https://ibb.co/nwpXmtq & https://ibb.co/wr6v2g1

4. The best/most hilarious part for me. The doors have handles, but they are on the hinge side of the door. No idea how this would work. https://ibb.co/gWXDcfr

[+] neop1x|1 year ago|reply

There are more examples of its limitations.

The video with dogs shows three taxis transforming into one, the number of people under the tree changing https://player.vimeo.com/video/1037090356?h=07432076b5&loop=...

An example from the HunyuanVideo is terrible as well. Look at that awful tongue: https://hunyuanvideoai.com/part-1-3.mp4

And what we see in that marketing is probably the best they could generate. And I suppose it took a lot of prompt tweaking and regenerations.

The internet is already full of junk shorts and useless videos and soon there will be even more junk content everywhere. :(

[+] ckcheng|1 year ago|reply

I think they trained on one too many closet bifold doors [1].

If you look at the edge of the doors as they swing open, it seems their movement resembles bifold door movement (there's a wiggle to it common to bifold doors that normal doors never have). Plus they seem to magically reveal an inner fold that wasn't there before.

[1]: https://duckduckgo.com/?t=h_&q=interior+bifold+closet+doors&...

[+] Imnimo|1 year ago|reply

I feel like there is a sweet spot for AI generation of images and videos that I would describe as "charmingly bad", like the stuff we got from the old CLIP+VQGAN models. I feel like Sora has jumped past that into the valley of "unappealingly bad".

[+] azinman2|1 year ago|reply

Technically it's amazing that this is possible at all. Yet I don't see how the world is better off for it on net. Aside from eliminating jobs in FX/filming/acting/set design/etc, what do we really gain? Amateur filmmakers can be more powerful? How about we put the same money into a fund for filmmakers to access. The negatives are plentiful, from the mundane reduction of our media to monolithic simulacra to putting the nail in the coffin for truth to exist unchallenged, let alone the 'fine tunes' that will continue to come for deepfakes that are literal (sexual) harassment.

Humans are not built for this power to be in the hands of everyone with low friction.

[+] phtrivier|1 year ago|reply

Not available in France yet, I'd we interested to know if it's a matter of progressive rollout, or some form of legislation (EU or otherwise ?) that's making OpenAI cautious ? Something like the EU AI Act [1] ?

In a sane world, any video produced by Sora would be required to have a form of watermarking that's on par with what intellectual property owners require.

We've put people in jail for sharing copyrighted movies, and don't see why we would refrain from mandating that AI generated videos have some caption that says, I don't know, "This video was generated with AI" ?

People would not respect the mandate, and we would consider that illegal, and use the monopoly on force to take money out of their bank account.

I know, it sounds mad and soooo 20th century - maybe that's why OpenAI overlords are not deeming peasants in France worthy of "a cat in a suit drinking coffee in an office" and "you'll never believe what the other candidate is doing to your kids".

[1] https://www.imatag.com/blog/ai-act-legal-requirement-to-labe...

EDIT: apparently some form of watermarking is built in (but it's not obvious in the examples, for some reason.)

> While imperfect, we’ve added safeguards like visible watermarks by default, and built an internal search tool that uses technical attributes of generations to help verify if content came from Sora.

[2] https://openai.com/index/sora-is-here/

[+] owenpalmer|1 year ago|reply

MKBHD's review of the new Sora release:

https://www.youtube.com/watch?v=OY2x0TyKzIQ

[+] laweijfmvo|1 year ago|reply

Love the callout of them definitely training on his own videos

[+] ChrisArchitect|1 year ago|reply

Link should be annoucement post: https://openai.com/index/sora-is-here/

[+] jedberg|1 year ago|reply

> We’re introducing our video generation technology now to give society time to explore its possibilities and co-develop norms and safeguards that ensure it’s used responsibly as the field advances.

That's an interesting way of saying "we're probably gonna miss some stuff in our safety tools, so hopefully society picks up the slack for us". :)

[+] goykasi|1 year ago|reply

A bit off-topic, but how much does a 4-letter (or less) .com go for these days? I wonder if they bought this via an intermediary so that the seller wouldnt see "OpenAI" and tack on a few zeros.

edit: previously, this thread pointed to sora.com

[+] gzer0|1 year ago|reply

For the $20/month subscription: you get 50 generations a month. So it is included in your subscription already! Nice.

For the Pro $200/month subscription: you get unlimited generations a month (on a slower que).

[+] kylehotchkiss|1 year ago|reply

Who is the audience for this product? A lot of people like video because it's a way of experience something they currently cannot for one reason or another. People don't want to see arbitrary fake worlds or places on earth that aren't real. Unless it's video game or something. But I see this product being used primarily to trick Facebook users

I guess the CGI industry implications are interesting, but look at the waves behind the AI generated man. They don't break so much as dissolving into each other. There's always a tell. These aren't GPU generated versions of reality with thought behind the effects.

[+] cryptozeus|1 year ago|reply

Raises billion dollar, claims of agi by 2025, cannot handle new user sign up traffic.

[+] nycdatasci|1 year ago|reply

“Sora is here”

No it’s not. I’ve been trying to access all day: “Sora account creation is temporarily unavailable We're currently experiencing heavy traffic and have temporarily disabled Sora account creation. If you've never logged into Sora before, please check back again soon.”

[+] lacoolj|1 year ago|reply

A little worried how young children watching these videos may develop inaccurate impressions of physics in nature.

For instance, that ladybug looks pretty natural, but there's a little glitch in there that an unwitting observer, who's never seen a ladybug move before, may mistake as being normal. And maybe it is! And maybe it isn't?

The sailing ship - are those water movements correct?

The sinking of the elephant into snow - how deep is too deep? Should there be snow on the elephant or would it have melted from body heat? Should some of the snow fall off during movement or is it maybe packed down too tightly already?

There's no way to know because they aren't actual recordings, and if you don't know that, and this tech improves leaps and bounds (as we know it will), it will eventually become published and will be taken at face value by many.

Hopefully I'm just overthinking it.

[+] EternalFury|1 year ago|reply

Many people say:

> these things will get bigger and better much faster than we can learn to discern

I would like to ask “Why?”

Clearly, these models are just one case of “NN can learn to map anything from one domain to another” and with enough training/overfitting they can approximate reality to a high degree.

But, why would it get better to any significant extent?

Because we can collect an infinite amount of video? Because we can train models to the point where they become generative video compression algorithms that have seen it all?

[+] rushingcreek|1 year ago|reply

As there was no mention of an API for either Sora or o1 Pro, I think this launch further marks OpenAI’s transition from an infrastructure company to a product company.