Nvidia Uses AI to Slash Bandwidth on Video Calls

[+] OrderlyTiamat|5 years ago|reply

> they have managed to reduce the required bandwidth for a video call by an order of magnitude. In one example, the required data rate fell from 97.28 KB/frame to a measly 0.1165 KB/frame – a reduction to 0.1% of required bandwidth.

A nitpick, perhaps, but isn't that three orders of magnitude?

We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually be this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.

Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.

I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.

[+] polytely|5 years ago|reply

I think you are right on the money with your thoughts on virtual avatars. I've already noticed this phenomenon cropping up in some niches.

1. the phenomenon of VTubers https://en.m.wikipedia.org/wiki/Virtual_YouTuber

2. in the virtual animal crossing late night show, Animal Talking, the presenter's (Gary Whitta) avatar doesn't really resemble how the presenter looks in real life https://en.m.wikipedia.org/wiki/Animal_Talking_with_Gary_Whi...

3. I watch a lot of interview s with people in VR Chat and it's very interesting how people seem to find it easier(?) to open up while they are embodying a character. https://youtu.be/KZWOXgc7PA4

Being able to experiment with identity in this way is really interesting to me, and I hope it becomes more mainstream with the proliferation of this technology

[+] TuringNYC|5 years ago|reply

You just need to connect GPT3, and the dialogue is taken care of. Lyrebird’s API will take care of the speech synthesis.

Viola! My deep fake can stand in at meetings now while I code.

[+] retsibsi|5 years ago|reply

> A nitpick, perhaps, but isn't that three orders of magnitude?

Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)

[+] no_one_ever|5 years ago|reply

As these trends sort of become more and more prevalent, I am so shocked at how David Foster Wallace had nailed this prediction in his book Infinite Jest.

Humans becoming more and more dependent on virtual face-to-face meetings and also relying on embellishment of their supposed appearance through the screen. It reminds me of how SciFi authors predicted technology, but with a complimentary commentary on human psychology.

Sorry if it isn't directly related to the post, but it is so striking to me.

[+] rhn_mk1|5 years ago|reply

The avatar thing isn't one-sided either: it'd be an awesome power to have to remake others!

Real-time silly hats for people I talk to and I'm sold.

[+] skriticos2|5 years ago|reply

I imagine people in home office situations would like to use this not only for the background, but them-self. I mean, if you are in doors all day, you might not be perfectly groomed for the day - so faking that would probably appeal to many people.

[+] overfitted|5 years ago|reply

I think the ability to, as someone mentioned it, have yourself look a bit tidier than you actually are (working from home) could be a huge benifit.

I mean taking away focus on things that doesn't matter in a virtual meeting such as: Where you are sitting - via Virtual Background Your daily hair style status or if you have a nose pimple - Via NVIDIAs AI showcased here. Would be great.

Though replacing yourself with a "digital" avatar I think takes away many of the benefits an actual live meeting provides.

[+] rasz|5 years ago|reply

>isn't that three orders of magnitude?

sure is .. if you stream your face at >30-50Mbit/s. For contrast highest bitrate available on Twitch, used for streaming high motion full screen updating twitchy 1080@60 gaming, is ~6-8Mbit/s.

[+] irthomasthomas|5 years ago|reply

David Foster Wallace predicted this in his novel Infinite Jest. Except they where static images inserted over a video phone, and the user had to keep their head positioned just right to make them work.

[+] xboxnolifes|5 years ago|reply

It's possible that the "order of magnitude" statement was the majority case, and the 0.1% statement was a best case scenario. So, 1 magnitude is to be expected, but 3 is possible.

[+] throwaway316943|5 years ago|reply

My prediction is that people will just change their avatars as often as they change their personal fashion. For some that’s never and for others it’s every season or even more often.

[+] EricMausler|5 years ago|reply

"Its just easier to apply the <deep fake of myself> than it is to apply foundation. Its how I'd look anyway" - delusional early adopters probably

[+] Ironlikebike|5 years ago|reply

This reminds me of a sci-fi novel I read in the nineties. The premise had something to do with actors who took on roles in virtual reality where their bodies are fit with sense-points. They're cast in live-action role-plays with wealthy remote clients. They're basically deep-fakes in VR.

[+] zimpenfish|5 years ago|reply

> A nitpick, perhaps, but isn't that three orders of magnitude?

I dunno that I'd call it three orders - it's close at about 830x - but it's definitely not even close to being one order either.

[+] maitredusoi|5 years ago|reply

[deleted]

[+] etcet|5 years ago|reply

A technology very similar to this plays a plot point in Vernor Vinge's 1992 novel A Fire Upon the Deep.

In his universe, both the interstellar net and combat links between ships are low bandwidth. Hence, video is interpolated between sync frames or recreated from old footage. Vinge calls the resulting video "evocations".

[+] nickcw|5 years ago|reply

I was thinking of exactly this when I read the article.

The plot point being that when the bandwidth gets too low, the interpolation AI has to make lots of stuff up, you are not quite sure exactly what was said.

I seem to remember the bandwidth in the book was very tiny, small number of bits per second (?) so the AI was taking the speech and compressing it into something more compressed than text then decompressing it at the other end into something that was more or less the same.

[+] WookieRushing|5 years ago|reply

I highly recommend A Fire Upon the Deep. Its a rare mix of really interesting hard scifi with an actually good story and characters. Hard scifi often has very flat characters but this is not a book which suffers from it.

It has a very very cool twist to explain the Fermi Paradox and is a really good example of a universe with one modified rule.

[+] 3ch0|5 years ago|reply

There is also a similar technoligy in Rob Reids After On book. The AI has the ability in thet book to "refocus" the person so that they are looking into the camera.

I believe this is huge and would create higher engagement if everybody was acutally looking into the camera instead of to the side or up all the time. Creating a more human an emotional relation with the people you are talking to.

[+] ACow_Adonis|5 years ago|reply

Fundamentally, I don't know if people realise that what we're on the verge of here.

It's effectively a motion-mapped keypoints of the person projected onto a simulated model. I'm assuming the cartoonish avatar was used as an example to partly avoid drawing direct lines to the full implications.

- There's no reason this couldn't extend to voice modelling as well. (much clearer speaking at much lower bandwidth)

- There's no reason this couldn't extend to replacing your sent projection with another image (or person)

- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)

- There's no reason you couldn't replace other people's avatar with one's of your own choosing as well.

- Why couldn't we model the rest of the environment?

Not there today, but this future is closer than many realise.

[+] ada1981|5 years ago|reply

>>- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)<<

This would be interesting as an upgrade to the “name on resume” test.

Could also see a future company policy that runs peoples data through a “sameness” filter before letting them into the company to scrub bias.

[+] motoboi|5 years ago|reply

There are limitations, but you can do this today: https://github.com/eyaler/avatars4all

[+] unknown|5 years ago|reply

[deleted]

[+] Animats|5 years ago|reply

This is a lot like Framefree.[1] That was developed around 2005 at Kerner Optical, which was a spinoff from Lucasfilm. The system finds a set of morph points in successive keyframes and morphs between them. This can do slow motion without jerkyness, and increase frame rate. Any modern GPU can do morphing in real time, so playback is cheap. There used to be a browser plug-in for playing Framefree-compressed video.

Compression was expensive, because finding good morph points is hard. But now hardware has caught up to doing it in real time on cheap hardware.

As a compression method, it's great for talking heads with a fixed camera. You're just sending morph point moves, and rarely need a new keyframe.

You can be too early. Kerner Optical went bust a decade ago.

[1] https://youtu.be/VBfss0AaNaU

[+] colechristensen|5 years ago|reply

This may be projecting expectations but the example compressed video looks very slightly fake in a way that is just a little uncanny valley type unsettling.

Perhaps the nets they’re using are compressing out facial microexpressions and when we see it, it seems just a little unnatural. Compression artifacts might be preferable because the information they’re missing is more obvious and less artificial. In other words i’d rather be presented with something obviously flawed than something i can’t quite tell what is wrong.

[+] varispeed|5 years ago|reply

What I don't like about AI processed images is that they are not real. I can't go past the fact that I am not looking at the picture as it looks like in reality but somehow smart approximation of the world that is not necessarily true.

[+] jcims|5 years ago|reply

Wonderful technical achievement but I think I’d rather squint through garbled video to see a real human.

Now if I can use it to add a Klingon skull ridge and hollow eyes to my boss or scribble notes on my scrum master’s generous forehead we might be on to something.

[+] conradludgate|5 years ago|reply

I see a lot of people being alienated by the fact that people could take on different avatars during their meeting. I would honestly accept that with no question.

In a work environment, I would expect the person I'm talking to to be presentable, ie their avatar would be presentable, so no goofy backgrounds or annoying accessories.

But the key for me is, I'd actually have something to see. So often in my work in in meetings and three people have cameras on and the rest don't. I don't really care what they look like, I care if they're engaged, nodding their heads, their facial reactions.

I don't always have my video on either, I don't have great upload speeds so I usually appear as a big blob anyway. I'd happily have whatever representation of me be in my place if it meant people could see my reactions

[+] Steltek|5 years ago|reply

My first thought was about the diversity of faces used in the demo and how ten years ago, computers didn't think black people were humans.

https://www.youtube.com/watch?v=t4DT3tQqgRM

But after that, I was reminded of the paranoia (or not?) around Zoom and that, for an extreme example, the CCP was mining and generating facial fingerprints and social networks using video calls. It seems like this technology is the same concept except put to a useful purpose.

[+] ip26|5 years ago|reply

If the "Free View" really works well, that sounds like possibly the most important part. The missing feeling of eye contact is a significant unsolved problem in video calls.

[+] Miraste|5 years ago|reply

Apple has just implemented their own version of this in FaceTime in iOS 14.

[+] ksec|5 years ago|reply

I would imagine Apple doing this with FaceTime soon.

Using their own NPU ( Neural processing unit ), you can now make FaceTime call with ridiculously low bandwidth. From the Nvidia example, 0.1165 KB/frame even at buttery smooth 60fps ( I could literally hear Apple market the crap out of this ), that is 7KBps or 56Kbps! Remember when the industry were trying to compress CD Audio quality ( aka 128Kbps MP3 ) down to 64Kbps? This FaceTime Video Call is using even less!

And since the NPU and FaceTime are all part of Apple's platform and not available anywhere else. They now have an even better excuse not to open it up and further lock customers into their ecosystem. ( Not such a good thing with how Apple are acting right now )

Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.

[+] izacus|5 years ago|reply

> I would imagine Apple doing this with FaceTime soon.

What's this claim based on? Last I looked into FaceTime tech, they didn't do anything special - their quality comes from use of H.265 and the fact that iOS devices have good quality HW encoding blocks which provide good compression at low bandwidths.

FaceTime stream is usually also low motion / change so it's possible to achieve very good compression even with basic quality. Although they still don't quite match the quality of AV1 powered Google Duo on very poor connections.

[+] ohno1111|5 years ago|reply

The latest NPU can't deliver this kind of "deep compression" at 25fps, not even 10fps. But in the future they can just send the Facemesh vertices and streaming text of the speech (and classically compress non-speech audio, if it's even desired as most people are happy just using it to talk) so it will be less data than 1Kbps.

[+] loueed|5 years ago|reply

I think Face ID could be used to create the point map, instead of generating it from an image. They could also use Face ID to prevent the tech from malicious deep fakes, e.g only allow people to use this feature I when Face ID confirms the user and the manipulated photo are the same person.

[+] loudmax|5 years ago|reply

> Remember when the industry were trying to compress CD Audio quality (aka 128Kbps MP3) down to 64Kbps?

128kb mp3 is good enough for most people most of the time, but it isn't CD quality. Having said that, 64kb Opus is almost or about as good as 128kb mp3.

I wonder how well these techniques can be applied to audio.

[+] pier25|5 years ago|reply

> Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.

Maybe it will be exclusive to Android devices.

Or maybe it will work on any device (consuming CPU or GPU depending on the hardware) but only on Nvidia's communication app.

[+] codesnik|5 years ago|reply

> I would imagine Apple doing this with FaceTime soon.

And we'll notice it because somehow our and other people faces in facetime will start to subtly translate strangely aggravating emotions, as current memojis do.

[+] hokkos|5 years ago|reply

The problem is with phone the background change more, and you show more things than just a static face, because it is easier to move around.

[+] duhi88|5 years ago|reply

Probably just turning an R&D project into a marketing feature.

[+] antman|5 years ago|reply

I would go to the origin announcement rather than this reproduction with ads https://developer.nvidia.com/maxine?ncid=so-yout-26905#cid=d...

[+] acomjean|5 years ago|reply

Isn’t this just like apples animated emoji (Animoji) where your face is mapped to a emoji character? Except instead of a cartoon it’s mapped to your actual face.

https://blog.emojipedia.org/apples-new-animoji/

And how well does that work when you switch to screen sharing?

[+] blackbear_|5 years ago|reply

Nvidia Uses AI to Slash Bandwidth on Video Calls... But only if everybody in the call has a 600$ Nvidia GPU

[+] jonplackett|5 years ago|reply

I wonder how weird it gets when you turn your head too much. This is very cool though - I was expecting to be able to tell a difference and maybe slip into uncanny valley territory but it looks good.

Big question though - is this just substituting the problem of not having good internet with not having a really fast nVidia graphics card?

[+] qwerty456127|5 years ago|reply

Now the person you are speaking to is going to be n% (partially) emulated. n is going to increase in future. One day there will be a paid feature letting you emulate 100% of yourself to respond to video calls when you are not available. And finally, they will replace yourself even without you knowing, and even after you die.

[+] dhdhhdd|5 years ago|reply

At what resolution? And also, does the output actually resembles the original image? Examples with background other than uniform? Would be nice if they provided more than just screenshots

It's not uncommon to see video calls at 100kbs-150kbps, which is ~10KB/s, and this is for 7fps or so, including audio. So "per frame" that would be 1KB or so (more for key frames, less for I frames).

So they say it can be 0.1KB, so better than that... Exciting, if realistic.

Also, add on top audio, and packet overhead :-) there is at least 0.1KB overhead for sending the packet (bundle it with audio if possible!)

[+] colossal|5 years ago|reply

Can't wait to see the bugs! GAN's are famous for some... interesting reconstructions. And better still, nvidia will have no way to debug it since the model is essentially a black box.

[+] vernie|5 years ago|reply

People are extremely sensitive to subtleties in mouth articulation which facial landmark tracking tends to have trouble capturing. I question whether just a keyframe and facial landmarks are enough to generate convincing lip sync or gaze. I suspect that this is why the majority of the samples in the video are muted, which is a trick commonly used by facial performance capture researchers to hide bad lip sync results.

[+] unicornporn|5 years ago|reply

Stills OK, it would be interesting to see it move. Risk for uncanny valley?

Petapixel is a blog spam site btw. Why not go to the source that is linked in the post?

192 comments