> they have managed to reduce the required bandwidth for a video call by an order of magnitude. In one example, the required data rate fell from 97.28 KB/frame to a measly 0.1165 KB/frame – a reduction to 0.1% of required bandwidth.
A nitpick, perhaps, but isn't that three orders of magnitude?
We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually be this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.
Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.
I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.
3. I watch a lot of interview s with people in VR Chat and it's very interesting how people seem to find it easier(?) to open up while they are embodying a character. https://youtu.be/KZWOXgc7PA4
Being able to experiment with identity in this way is really interesting to me, and I hope it becomes more mainstream with the proliferation of this technology
> A nitpick, perhaps, but isn't that three orders of magnitude?
Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)
As these trends sort of become more and more prevalent, I am so shocked at how David Foster Wallace had nailed this prediction in his book Infinite Jest.
Humans becoming more and more dependent on virtual face-to-face meetings and also relying on embellishment of their supposed appearance through the screen. It reminds me of how SciFi authors predicted technology, but with a complimentary commentary on human psychology.
Sorry if it isn't directly related to the post, but it is so striking to me.
I imagine people in home office situations would like to use this not only for the background, but them-self. I mean, if you are in doors all day, you might not be perfectly groomed for the day - so faking that would probably appeal to many people.
I think the ability to, as someone mentioned it, have yourself look a bit tidier than you actually are (working from home) could be a huge benifit.
I mean taking away focus on things that doesn't matter in a virtual meeting such as:
Where you are sitting - via Virtual Background
Your daily hair style status or if you have a nose pimple - Via NVIDIAs AI showcased here.
Would be great.
Though replacing yourself with a "digital" avatar I think takes away many of the benefits an actual live meeting provides.
sure is .. if you stream your face at >30-50Mbit/s. For contrast highest bitrate available on Twitch, used for streaming high motion full screen updating twitchy 1080@60 gaming, is ~6-8Mbit/s.
David Foster Wallace predicted this in his novel Infinite Jest. Except they where static images inserted over a video phone, and the user had to keep their head positioned just right to make them work.
It's possible that the "order of magnitude" statement was the majority case, and the 0.1% statement was a best case scenario. So, 1 magnitude is to be expected, but 3 is possible.
My prediction is that people will just change their avatars as often as they change their personal fashion. For some that’s never and for others it’s every season or even more often.
This reminds me of a sci-fi novel I read in the nineties. The premise had something to do with actors who took on roles in virtual reality where their bodies are fit with sense-points. They're cast in live-action role-plays with wealthy remote clients. They're basically deep-fakes in VR.
A technology very similar to this plays a plot point in Vernor Vinge's 1992 novel A Fire Upon the Deep.
In his universe, both the interstellar net and combat links between ships are low bandwidth. Hence, video is interpolated between sync frames or recreated from old footage. Vinge calls the resulting video "evocations".
I was thinking of exactly this when I read the article.
The plot point being that when the bandwidth gets too low, the interpolation AI has to make lots of stuff up, you are not quite sure exactly what was said.
I seem to remember the bandwidth in the book was very tiny, small number of bits per second (?) so the AI was taking the speech and compressing it into something more compressed than text then decompressing it at the other end into something that was more or less the same.
I highly recommend A Fire Upon the Deep. Its a rare mix of really interesting hard scifi with an actually good story and characters. Hard scifi often has very flat characters but this is not a book which suffers from it.
It has a very very cool twist to explain the Fermi Paradox and is a really good example of a universe with one modified rule.
There is also a similar technoligy in Rob Reids After On book. The AI has the ability in thet book to "refocus" the person so that they are looking into the camera.
I believe this is huge and would create higher engagement if everybody was acutally looking into the camera instead of to the side or up all the time. Creating a more human an emotional relation with the people you are talking to.
Fundamentally, I don't know if people realise that what we're on the verge of here.
It's effectively a motion-mapped keypoints of the person projected onto a simulated model. I'm assuming the cartoonish avatar was used as an example to partly avoid drawing direct lines to the full implications.
- There's no reason this couldn't extend to voice modelling as well. (much clearer speaking at much lower bandwidth)
- There's no reason this couldn't extend to replacing your sent projection with another image (or person)
- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)
- There's no reason you couldn't replace other people's avatar with one's of your own choosing as well.
- Why couldn't we model the rest of the environment?
Not there today, but this future is closer than many realise.
>>- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)<<
This would be interesting as an upgrade to the “name on resume” test.
Could also see a future company policy that runs peoples data through a “sameness” filter before letting them into the company to scrub bias.
This is a lot like Framefree.[1] That was developed around 2005 at Kerner Optical, which was a spinoff from Lucasfilm. The system finds a set of morph points in successive keyframes and morphs between them. This can do slow motion without jerkyness, and increase frame rate. Any modern GPU can do morphing in real time, so playback is cheap. There used to be a browser plug-in for playing Framefree-compressed video.
Compression was expensive, because finding good morph points is hard. But now hardware has caught up to doing it in real time on cheap hardware.
As a compression method, it's great for talking heads with a fixed camera. You're just sending morph point moves, and rarely need a new keyframe.
You can be too early. Kerner Optical went bust a decade ago.
This may be projecting expectations but the example compressed video looks very slightly fake in a way that is just a little uncanny valley type unsettling.
Perhaps the nets they’re using are compressing out facial microexpressions and when we see it, it seems just a little unnatural. Compression artifacts might be preferable because the information they’re missing is more obvious and less artificial. In other words i’d rather be presented with something obviously flawed than something i can’t quite tell what is wrong.
What I don't like about AI processed images is that they are not real. I can't go past the fact that I am not looking at the picture as it looks like in reality but somehow smart approximation of the world that is not necessarily true.
Wonderful technical achievement but I think I’d rather squint through garbled video to see a real human.
Now if I can use it to add a Klingon skull ridge and hollow eyes to my boss or scribble notes on my scrum master’s generous forehead we might be on to something.
I see a lot of people being alienated by the fact that people could take on different avatars during their meeting. I would honestly accept that with no question.
In a work environment, I would expect the person I'm talking to to be presentable, ie their avatar would be presentable, so no goofy backgrounds or annoying accessories.
But the key for me is, I'd actually have something to see. So often in my work in in meetings and three people have cameras on and the rest don't. I don't really care what they look like, I care if they're engaged, nodding their heads, their facial reactions.
I don't always have my video on either, I don't have great upload speeds so I usually appear as a big blob anyway. I'd happily have whatever representation of me be in my place if it meant people could see my reactions
But after that, I was reminded of the paranoia (or not?) around Zoom and that, for an extreme example, the CCP was mining and generating facial fingerprints and social networks using video calls. It seems like this technology is the same concept except put to a useful purpose.
If the "Free View" really works well, that sounds like possibly the most important part. The missing feeling of eye contact is a significant unsolved problem in video calls.
I would imagine Apple doing this with FaceTime soon.
Using their own NPU ( Neural processing unit ), you can now make FaceTime call with ridiculously low bandwidth. From the Nvidia example, 0.1165 KB/frame even at buttery smooth 60fps ( I could literally hear Apple market the crap out of this ), that is 7KBps or 56Kbps! Remember when the industry were trying to compress CD Audio quality ( aka 128Kbps MP3 ) down to 64Kbps? This FaceTime Video Call is using even less!
And since the NPU and FaceTime are all part of Apple's platform and not available anywhere else. They now have an even better excuse not to open it up and further lock customers into their ecosystem. ( Not such a good thing with how Apple are acting right now )
Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.
> I would imagine Apple doing this with FaceTime soon.
What's this claim based on? Last I looked into FaceTime tech, they didn't do anything special - their quality comes from use of H.265 and the fact that iOS devices have good quality HW encoding blocks which provide good compression at low bandwidths.
FaceTime stream is usually also low motion / change so it's possible to achieve very good compression even with basic quality. Although they still don't quite match the quality of AV1 powered Google Duo on very poor connections.
The latest NPU can't deliver this kind of "deep compression" at 25fps, not even 10fps. But in the future they can just send the Facemesh vertices and streaming text of the speech (and classically compress non-speech audio, if it's even desired as most people are happy just using it to talk) so it will be less data than 1Kbps.
I think Face ID could be used to create the point map, instead of generating it from an image. They could also use Face ID to prevent the tech from malicious deep fakes, e.g only allow people to use this feature I when Face ID confirms the user and the manipulated photo are the same person.
> Remember when the industry were trying to compress CD Audio quality (aka 128Kbps MP3) down to 64Kbps?
128kb mp3 is good enough for most people most of the time, but it isn't CD quality. Having said that, 64kb Opus is almost or about as good as 128kb mp3.
I wonder how well these techniques can be applied to audio.
> I would imagine Apple doing this with FaceTime soon.
And we'll notice it because somehow our and other people faces in facetime will start to subtly translate strangely aggravating emotions, as current memojis do.
Isn’t this just like apples animated emoji (Animoji) where your face is mapped to a emoji character? Except instead of a cartoon it’s mapped to your actual face.
I wonder how weird it gets when you turn your head too much. This is very cool though - I was expecting to be able to tell a difference and maybe slip into uncanny valley territory but it looks good.
Big question though - is this just substituting the problem of not having good internet with not having a really fast nVidia graphics card?
Now the person you are speaking to is going to be n% (partially) emulated. n is going to increase in future. One day there will be a paid feature letting you emulate 100% of yourself to respond to video calls when you are not available. And finally, they will replace yourself even without you knowing, and even after you die.
At what resolution? And also, does the output actually resembles the original image? Examples with background other than uniform? Would be nice if they provided more than just screenshots
It's not uncommon to see video calls at 100kbs-150kbps, which is ~10KB/s, and this is for 7fps or so, including audio. So "per frame" that would be 1KB or so (more for key frames, less for I frames).
So they say it can be 0.1KB, so better than that... Exciting, if realistic.
Also, add on top audio, and packet overhead :-) there is at least 0.1KB overhead for sending the packet (bundle it with audio if possible!)
Can't wait to see the bugs! GAN's are famous for some... interesting reconstructions. And better still, nvidia will have no way to debug it since the model is essentially a black box.
People are extremely sensitive to subtleties in mouth articulation which facial landmark tracking tends to have trouble capturing. I question whether just a keyframe and facial landmarks are enough to generate convincing lip sync or gaze. I suspect that this is why the majority of the samples in the video are muted, which is a trick commonly used by facial performance capture researchers to hide bad lip sync results.
[+] [-] OrderlyTiamat|5 years ago|reply
A nitpick, perhaps, but isn't that three orders of magnitude?
We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually be this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.
Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.
I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.
[+] [-] polytely|5 years ago|reply
1. the phenomenon of VTubers https://en.m.wikipedia.org/wiki/Virtual_YouTuber
2. in the virtual animal crossing late night show, Animal Talking, the presenter's (Gary Whitta) avatar doesn't really resemble how the presenter looks in real life https://en.m.wikipedia.org/wiki/Animal_Talking_with_Gary_Whi...
3. I watch a lot of interview s with people in VR Chat and it's very interesting how people seem to find it easier(?) to open up while they are embodying a character. https://youtu.be/KZWOXgc7PA4
Being able to experiment with identity in this way is really interesting to me, and I hope it becomes more mainstream with the proliferation of this technology
[+] [-] TuringNYC|5 years ago|reply
Viola! My deep fake can stand in at meetings now while I code.
[+] [-] retsibsi|5 years ago|reply
Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)
[+] [-] no_one_ever|5 years ago|reply
Humans becoming more and more dependent on virtual face-to-face meetings and also relying on embellishment of their supposed appearance through the screen. It reminds me of how SciFi authors predicted technology, but with a complimentary commentary on human psychology.
Sorry if it isn't directly related to the post, but it is so striking to me.
[+] [-] rhn_mk1|5 years ago|reply
Real-time silly hats for people I talk to and I'm sold.
[+] [-] skriticos2|5 years ago|reply
[+] [-] overfitted|5 years ago|reply
I mean taking away focus on things that doesn't matter in a virtual meeting such as: Where you are sitting - via Virtual Background Your daily hair style status or if you have a nose pimple - Via NVIDIAs AI showcased here. Would be great.
Though replacing yourself with a "digital" avatar I think takes away many of the benefits an actual live meeting provides.
[+] [-] rasz|5 years ago|reply
sure is .. if you stream your face at >30-50Mbit/s. For contrast highest bitrate available on Twitch, used for streaming high motion full screen updating twitchy 1080@60 gaming, is ~6-8Mbit/s.
[+] [-] irthomasthomas|5 years ago|reply
[+] [-] xboxnolifes|5 years ago|reply
[+] [-] throwaway316943|5 years ago|reply
[+] [-] EricMausler|5 years ago|reply
[+] [-] Ironlikebike|5 years ago|reply
[+] [-] zimpenfish|5 years ago|reply
I dunno that I'd call it three orders - it's close at about 830x - but it's definitely not even close to being one order either.
[+] [-] maitredusoi|5 years ago|reply
[deleted]
[+] [-] etcet|5 years ago|reply
In his universe, both the interstellar net and combat links between ships are low bandwidth. Hence, video is interpolated between sync frames or recreated from old footage. Vinge calls the resulting video "evocations".
[+] [-] nickcw|5 years ago|reply
The plot point being that when the bandwidth gets too low, the interpolation AI has to make lots of stuff up, you are not quite sure exactly what was said.
I seem to remember the bandwidth in the book was very tiny, small number of bits per second (?) so the AI was taking the speech and compressing it into something more compressed than text then decompressing it at the other end into something that was more or less the same.
[+] [-] WookieRushing|5 years ago|reply
It has a very very cool twist to explain the Fermi Paradox and is a really good example of a universe with one modified rule.
[+] [-] 3ch0|5 years ago|reply
I believe this is huge and would create higher engagement if everybody was acutally looking into the camera instead of to the side or up all the time. Creating a more human an emotional relation with the people you are talking to.
[+] [-] ACow_Adonis|5 years ago|reply
It's effectively a motion-mapped keypoints of the person projected onto a simulated model. I'm assuming the cartoonish avatar was used as an example to partly avoid drawing direct lines to the full implications.
- There's no reason this couldn't extend to voice modelling as well. (much clearer speaking at much lower bandwidth)
- There's no reason this couldn't extend to replacing your sent projection with another image (or person)
- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)
- There's no reason you couldn't replace other people's avatar with one's of your own choosing as well.
- Why couldn't we model the rest of the environment?
Not there today, but this future is closer than many realise.
[+] [-] ada1981|5 years ago|reply
This would be interesting as an upgrade to the “name on resume” test.
Could also see a future company policy that runs peoples data through a “sameness” filter before letting them into the company to scrub bias.
[+] [-] motoboi|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] Animats|5 years ago|reply
Compression was expensive, because finding good morph points is hard. But now hardware has caught up to doing it in real time on cheap hardware.
As a compression method, it's great for talking heads with a fixed camera. You're just sending morph point moves, and rarely need a new keyframe.
You can be too early. Kerner Optical went bust a decade ago.
[1] https://youtu.be/VBfss0AaNaU
[+] [-] colechristensen|5 years ago|reply
Perhaps the nets they’re using are compressing out facial microexpressions and when we see it, it seems just a little unnatural. Compression artifacts might be preferable because the information they’re missing is more obvious and less artificial. In other words i’d rather be presented with something obviously flawed than something i can’t quite tell what is wrong.
[+] [-] varispeed|5 years ago|reply
[+] [-] jcims|5 years ago|reply
Now if I can use it to add a Klingon skull ridge and hollow eyes to my boss or scribble notes on my scrum master’s generous forehead we might be on to something.
[+] [-] conradludgate|5 years ago|reply
In a work environment, I would expect the person I'm talking to to be presentable, ie their avatar would be presentable, so no goofy backgrounds or annoying accessories.
But the key for me is, I'd actually have something to see. So often in my work in in meetings and three people have cameras on and the rest don't. I don't really care what they look like, I care if they're engaged, nodding their heads, their facial reactions.
I don't always have my video on either, I don't have great upload speeds so I usually appear as a big blob anyway. I'd happily have whatever representation of me be in my place if it meant people could see my reactions
[+] [-] Steltek|5 years ago|reply
https://www.youtube.com/watch?v=t4DT3tQqgRM
But after that, I was reminded of the paranoia (or not?) around Zoom and that, for an extreme example, the CCP was mining and generating facial fingerprints and social networks using video calls. It seems like this technology is the same concept except put to a useful purpose.
[+] [-] ip26|5 years ago|reply
[+] [-] Miraste|5 years ago|reply
[+] [-] ksec|5 years ago|reply
Using their own NPU ( Neural processing unit ), you can now make FaceTime call with ridiculously low bandwidth. From the Nvidia example, 0.1165 KB/frame even at buttery smooth 60fps ( I could literally hear Apple market the crap out of this ), that is 7KBps or 56Kbps! Remember when the industry were trying to compress CD Audio quality ( aka 128Kbps MP3 ) down to 64Kbps? This FaceTime Video Call is using even less!
And since the NPU and FaceTime are all part of Apple's platform and not available anywhere else. They now have an even better excuse not to open it up and further lock customers into their ecosystem. ( Not such a good thing with how Apple are acting right now )
Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.
[+] [-] izacus|5 years ago|reply
What's this claim based on? Last I looked into FaceTime tech, they didn't do anything special - their quality comes from use of H.265 and the fact that iOS devices have good quality HW encoding blocks which provide good compression at low bandwidths.
FaceTime stream is usually also low motion / change so it's possible to achieve very good compression even with basic quality. Although they still don't quite match the quality of AV1 powered Google Duo on very poor connections.
[+] [-] ohno1111|5 years ago|reply
[+] [-] loueed|5 years ago|reply
[+] [-] loudmax|5 years ago|reply
128kb mp3 is good enough for most people most of the time, but it isn't CD quality. Having said that, 64kb Opus is almost or about as good as 128kb mp3.
I wonder how well these techniques can be applied to audio.
[+] [-] pier25|5 years ago|reply
Maybe it will be exclusive to Android devices.
Or maybe it will work on any device (consuming CPU or GPU depending on the hardware) but only on Nvidia's communication app.
[+] [-] codesnik|5 years ago|reply
And we'll notice it because somehow our and other people faces in facetime will start to subtly translate strangely aggravating emotions, as current memojis do.
[+] [-] hokkos|5 years ago|reply
[+] [-] duhi88|5 years ago|reply
[+] [-] antman|5 years ago|reply
[+] [-] acomjean|5 years ago|reply
https://blog.emojipedia.org/apples-new-animoji/
And how well does that work when you switch to screen sharing?
[+] [-] blackbear_|5 years ago|reply
[+] [-] jonplackett|5 years ago|reply
Big question though - is this just substituting the problem of not having good internet with not having a really fast nVidia graphics card?
[+] [-] qwerty456127|5 years ago|reply
[+] [-] dhdhhdd|5 years ago|reply
It's not uncommon to see video calls at 100kbs-150kbps, which is ~10KB/s, and this is for 7fps or so, including audio. So "per frame" that would be 1KB or so (more for key frames, less for I frames).
So they say it can be 0.1KB, so better than that... Exciting, if realistic.
Also, add on top audio, and packet overhead :-) there is at least 0.1KB overhead for sending the packet (bundle it with audio if possible!)
[+] [-] colossal|5 years ago|reply
[+] [-] vernie|5 years ago|reply
[+] [-] unicornporn|5 years ago|reply
Petapixel is a blog spam site btw. Why not go to the source that is linked in the post?