1. The scanning is fast, it takes longer to set up a fingerprint on a macbook air. Just turning the head from side to side, then up and down, smiling and raising one's eyebrows.
2. I used the M5, and the processing time to generate the persona was quick. I didn't time it, but it felt like less than 10 seconds.
3. My cheeks tend to restrict smiling while wearing the headset, it works but people that know me understood what I meant when I said my smile was hindered.
4. Despite the limited actions used for set up, it reproduces a far greater range of facial movements. For example if I do the invisible string trick, it captures my lips correctly (when you move the top lip in one direction and the lower lip in the opposite direction, as if pulled by a string.)
5. I wasn't expecting this big of a jump in quality from the v1.
CorridorDigital recently used the tech to assist in remaking the rooftop bullet-time scene from The Matrix. It's used for making the environment instead of modeling it from scratch.
They also had an earlier video that more heavily featured gaussian splats. Using them to recreate the inside of the universal studios theme park without permission. I was very impressed with how it handles reflections on glass.
Oh man that was weird; I opened the video in a private browsing thing to not pollute my watch history and the version I got was automatically translated to Dutch, including voiceover which I presume is AI driven to try and match the tone of the original video. Still a bit robotic though.
While I have my browser configured to prefer Dutch, the second one is English; I wish I could tell it / them that I don't want them to translate anything if it's in one of those languages.
It’s amazing tech, it’s just a solution looking for a problem.
It feels a bit like the original Segway’s over-engineered solution versus cheap Chinese hoverboards, then the scooters and e-bikes that took over afterwards.
Why would I be paying all this money for this realistic telepresence when my shitbox HP laptop from Walmart has a perfectly serviceable webcam?
I used my VP extensively recently when working remotely. It's not glamorous, but I used Screen Sharing with a Macbook that grants you a virtual ultrawide monitor.
Once you're already in VR, it's nice to not have to break out for a meeting, and that's where Personas fit in.
It's not a killer app carrying the product, it's a necessary feature making sure there's not a gap in workflow.
Many use cases come to mind. If (retinal?) identities were private, encrypted, and “anonymized” in handshake:
web browsing without captchas, anubis, bot tests, etc. (“human only” internet, maybe like Berners-Lee’s “semantic web” idea [1][2])
Non “anonymized”:
non-jury court and arbitration appearances (with expansion of judges to clear backlogs [3])
medical checkups and social care (eg neurocognitive checkups for elderly, social services checkins esp. children, checkins for depressed or isolated needing offwork social interactions, etc.)
bureaucratic appointments (customer service by humans, DMV, building permits, licenses, etc.)
web browsing for routine tasks without logins (banks, email, etc)
I live half way across the world from my folks so I don’t see them often. I’d love something that gives me a greater sense of presence than a video call can give.
I always viewed the current generations of 'cheap Chinese hovebords' etc being a direct descendant of the Segway, and that Kamen and his believers weren't quite as ridiculous as we thought them at the time - they were just ahead of their time and expecting too much from too low a point in the technology curve.
I'm curious about practical application in everyday life of these avatars - but in the real life, not examples provided by marketing department. With that price Vision Pro still feels like a toy for wealthy people, or perhaps for CEOs of companies who can afford conferences in virtual environment. But then exactly, why? Majority of world tested during pandemic video calls, conferences and all sorts of other activities like virtual crowds for tv programs (pretty sure British panel shows shown grids of people as a substitute for studio audience). News services were inviting their guests by video calls when Skype was still around.
yes and to a degree which i find particularly interesting. its never going to happen because of your example
i prefer working in my vp and see a possible world where vp makes my remote team collaborate as if were in the office, from the comfort of the most ergonomic location in my house
it solves this problem and 0.0001% of people are dorks like me who try and say, "they did it" while the rest of the world keeps going to work as before
all of the tech problems were solvable. people simply dont want to put a thing on their face and i think thats unsolvable
I would not describe creating an experience that feels like you are in the room with a group of people, even allowing cross talk, is a solution looking for a problem. I think it's the thing everyone slowing dying on Zoom calls wishes they could have.
I'm usually a fan of Norm's videos, but this might be the first time I've seen a Tested video that felt more like paid-promotion than an actual unbiased review. I don't keep up with it though.
Came this close to buying an AVP, before learning that they only mirror a single screen with no virtual monitors.
Like, guize, c'mon. Virtual desktop can do three. For 3.5k you gotta do better. I don't particularly need a virtual me in space as much as I need more screens that can do, like, actual work.
I've always used 2-3 monitors pretty comfortably but with high latency AI agents adding more concurrency to my workflows I'm feeling very crowded. I would love a VR experience with an arbitrary number of screens/windows as well as more clearly separated environments (like having a visually different virtual office per project) that I can quickly switch between.
What is missing from the article is that creating a model from a few pictures is not that hard (well it is to do well, but hear me out)
The difficult part is animating it realistically with the sensors you have, in real time.
Extracting signal from eye-gaze cameras with a sighlty wider field of view, that allows realistic not not uncanny valley animation is quite hard to do on the general public Peoples faces are all different sizes and shapes, to the point that even getting accraute gaze vectors is hard, let alone smile and check position (those are done with different cameras, not just eye gaze. )
This is what fascinates me as well. I have to assume there's a neural net that effectively learns all of the possible muscles in the face. The limited sensor data gets fed in, and it's able to infer the full face shape. It seems perfectly plausible in theory, but I'm still impressed it seems to work so well in practice.
How's the latency? Latency is what makes Zoom et al painful for me now - it ruins the ability to politely interject, give confirmatiom, etc. Does Apple do a better job of this than Google/Zoom? In theory you could get 20-30ms (just spitballing numbers I used to get playing shooters!) but i've never got anywhere near that with vid conferencing.
Even so, latency-in-zoom kind of becomes an attribute of the medium and you learn to adapt. How does it feel with the Vision Pro though? The article talks about a really convincing sense of being in the same place with someone - how does latency affect that? (And does it differ based on if you're all physically in Silicon Valley or not?)
I would assume any added latency is negligible -- the sensors + interpretation + rendering should be very fast.
But you've still got all the network latency including Wi-Fi latency on both ends. And you always need a small audio buffer so discrete network packets can be assembled into continuous audio without gaps.
So I wouldn't expect this latency to be any different from regular videoconferencing.
The laws of physics means that the longer the path for your network packet, the higher the latency.
One way latency on the Internet across fiber is about 4μs to 5μs per kilometer in my experience.
For example, SF to Paris is ~40ms one way (it used to be 60ms 15y ago, latency and jitter have really improved).
Double those values for the round trip allowing you to interject in a conversation.
Add wifi, which has terrible latency with a lot of jitter (1ms to 400ms jitter is not uncommon). Wi-Fi 7 should reduce the jitter and latency in theory. We shall see improvements in the coming decade. Cellphone 5G did improve latency for me, so I don't doubt WiFi will eventually deliver.
In other words you need to be within 3Mm (3000km) away to get a chance at a 30ms roundtrip. And that's assuming peer to peer without wifi nor slow devices.
For a conference call, everybody connects to a central server acting as the relay. So now the latency budget is halved already.
> latency-in-zoom kind of becomes an attribute of the medium and you learn to adapt.
To some degree but not fully. When you adapt your brain is still doing extra work to compensate, similarly to how you don’t «hear» jet engine noise after acclimating to an airplane but it will still tire you to some degree.
I had Zoom and Teams meetings daily during Covid, and personal FaceTime calls almost daily for a while. I still get «Zoom fatigue» if a call goes on for over an hour, if I need to talk face to face during the call (i.e. no screen sharing, can’t disable video and look at something else, etc.) I’m fine if I don’t look at people’s faces but rather people’s screen sharing.
Norris's dodge on iPhone scanning is telling—processing on-device keeps it secure and magical, but imagine Personas popping up in FaceTime cameos or ARKit bridges to iPads. How soon until we see cross-device ecosystems like Microsoft's Mesh, but with Apple's polish? Eager for that affordability leap; until then, thanks for the vivid demo, Scott—now I need a Vision Pro buddy just to test this out.
This video might help explain 3D Gaussian splatting.
https://www.youtube.com/watch?v=wKgMxrWcW1s
Essentially, an entirely new graphics pipeline with different fundamental techniques which allow for high performance and fidelity compared to... what we did before(?)
Cool.
Not quite, it’s just a way to assign a color value to a point in space (think point clouds) based on photogrammetry. It’s voxels on steroids but still is drawn using the same techniques. It’s the magic of creating the splats that’s interesting.
Sorry but this is a horrible video. The guy just spews superlatives in an annoying voice until 4:30 (of a 6 minute video mind you), when he finally gives a 10 second "explanation" of Gaussian splatting, which doesn't really explain anything, then jumps to a sponsored ad.
AceJohnny2|3 months ago
The previous beta ones were terrifying frankenstein monsters. The new ones fooled my boss for 30 minutes.
There's a bit of uncanny valley left, nevertheless. My persona's smile reminds of the horrible expressions people like to make in Source Filmmaker.
quitit|3 months ago
1. The scanning is fast, it takes longer to set up a fingerprint on a macbook air. Just turning the head from side to side, then up and down, smiling and raising one's eyebrows.
2. I used the M5, and the processing time to generate the persona was quick. I didn't time it, but it felt like less than 10 seconds.
3. My cheeks tend to restrict smiling while wearing the headset, it works but people that know me understood what I meant when I said my smile was hindered.
4. Despite the limited actions used for set up, it reproduces a far greater range of facial movements. For example if I do the invisible string trick, it captures my lips correctly (when you move the top lip in one direction and the lower lip in the opposite direction, as if pulled by a string.)
5. I wasn't expecting this big of a jump in quality from the v1.
pndy|3 months ago
Perhaps how their heads, eyes move with this weird "fluid" effect and way too much blurred faces?
frenzcan|3 months ago
Stalker_Aloy|3 months ago
https://www.youtube.com/watch?v=iq5JaG53dho&t=2s
stevage|3 months ago
extraduder_ire|3 months ago
https://www.youtube.com/watch?v=cetf0qTZ04Y
_kb|3 months ago
Cthulhu_|3 months ago
While I have my browser configured to prefer Dutch, the second one is English; I wish I could tell it / them that I don't want them to translate anything if it's in one of those languages.
Mistletoe|3 months ago
dangus|3 months ago
It feels a bit like the original Segway’s over-engineered solution versus cheap Chinese hoverboards, then the scooters and e-bikes that took over afterwards.
Why would I be paying all this money for this realistic telepresence when my shitbox HP laptop from Walmart has a perfectly serviceable webcam?
AceJohnny2|3 months ago
Once you're already in VR, it's nice to not have to break out for a meeting, and that's where Personas fit in.
It's not a killer app carrying the product, it's a necessary feature making sure there's not a gap in workflow.
atcon|3 months ago
web browsing without captchas, anubis, bot tests, etc. (“human only” internet, maybe like Berners-Lee’s “semantic web” idea [1][2])
Non “anonymized”:
non-jury court and arbitration appearances (with expansion of judges to clear backlogs [3])
medical checkups and social care (eg neurocognitive checkups for elderly, social services checkins esp. children, checkins for depressed or isolated needing offwork social interactions, etc.)
bureaucratic appointments (customer service by humans, DMV, building permits, licenses, etc.)
web browsing for routine tasks without logins (banks, email, etc)
[1] <https://www.newyorker.com/magazine/2025/10/06/tim-berners-le...> [2] <https://newtfire.org/courses/introDH/BrnrsLeeIntrnt-Lucas-Nw...> [3] <https://nysfocus.com/2025/05/30/uncap-justice-act-new-york-c...>
quitit|3 months ago
To me it would be a shortcoming of the device if I couldn't show me and the thing I'm working on at the same time.
raincole|3 months ago
Why do we have 4K monitors when 1920x1080 is perfectly fine for 99.999% of use cases?
If you look at the world through this lens called "serviceability" you'll think everything is a solution looking for a problem.
osaariki|3 months ago
detritus|3 months ago
pndy|3 months ago
boogieknite|3 months ago
i prefer working in my vp and see a possible world where vp makes my remote team collaborate as if were in the office, from the comfort of the most ergonomic location in my house
it solves this problem and 0.0001% of people are dorks like me who try and say, "they did it" while the rest of the world keeps going to work as before
all of the tech problems were solvable. people simply dont want to put a thing on their face and i think thats unsolvable
october8140|3 months ago
cubefox|3 months ago
https://www.youtube.com/live/ucRukZM0d1s?t=1h1m50s
https://zju3dv.github.io/freetimegs/
https://www.4dv.ai/
The videos can be played back in real-time, though they require multiple cameras to capture.
october8140|3 months ago
samplatt|3 months ago
stevage|3 months ago
1123581321|3 months ago
crazygringo|3 months ago
There's regular latency due to distance, just like on a phone call if you're chatting with someone halfway across the world.
But on a normal connection, audio and the persona should always be in sync, the same way audio and video are over Zoom or FaceTime.
There shouldn't be any extra latency for the audio only.
utopiah|3 months ago
pksebben|3 months ago
Like, guize, c'mon. Virtual desktop can do three. For 3.5k you gotta do better. I don't particularly need a virtual me in space as much as I need more screens that can do, like, actual work.
efsavage|3 months ago
bytesandbits|3 months ago
KaiserPro|3 months ago
What is missing from the article is that creating a model from a few pictures is not that hard (well it is to do well, but hear me out)
The difficult part is animating it realistically with the sensors you have, in real time.
Extracting signal from eye-gaze cameras with a sighlty wider field of view, that allows realistic not not uncanny valley animation is quite hard to do on the general public Peoples faces are all different sizes and shapes, to the point that even getting accraute gaze vectors is hard, let alone smile and check position (those are done with different cameras, not just eye gaze. )
crazygringo|3 months ago
akdor1154|3 months ago
Even so, latency-in-zoom kind of becomes an attribute of the medium and you learn to adapt. How does it feel with the Vision Pro though? The article talks about a really convincing sense of being in the same place with someone - how does latency affect that? (And does it differ based on if you're all physically in Silicon Valley or not?)
crazygringo|3 months ago
But you've still got all the network latency including Wi-Fi latency on both ends. And you always need a small audio buffer so discrete network packets can be assembled into continuous audio without gaps.
So I wouldn't expect this latency to be any different from regular videoconferencing.
bombela|3 months ago
One way latency on the Internet across fiber is about 4μs to 5μs per kilometer in my experience.
For example, SF to Paris is ~40ms one way (it used to be 60ms 15y ago, latency and jitter have really improved).
Double those values for the round trip allowing you to interject in a conversation.
Add wifi, which has terrible latency with a lot of jitter (1ms to 400ms jitter is not uncommon). Wi-Fi 7 should reduce the jitter and latency in theory. We shall see improvements in the coming decade. Cellphone 5G did improve latency for me, so I don't doubt WiFi will eventually deliver.
In other words you need to be within 3Mm (3000km) away to get a chance at a 30ms roundtrip. And that's assuming peer to peer without wifi nor slow devices.
For a conference call, everybody connects to a central server acting as the relay. So now the latency budget is halved already.
setopt|3 months ago
To some degree but not fully. When you adapt your brain is still doing extra work to compensate, similarly to how you don’t «hear» jet engine noise after acclimating to an airplane but it will still tire you to some degree.
I had Zoom and Teams meetings daily during Covid, and personal FaceTime calls almost daily for a while. I still get «Zoom fatigue» if a call goes on for over an hour, if I need to talk face to face during the call (i.e. no screen sharing, can’t disable video and look at something else, etc.) I’m fine if I don’t look at people’s faces but rather people’s screen sharing.
kilibe|3 months ago
KissSheep|3 months ago
[deleted]
nQQKTz7dm27oZ|3 months ago
[deleted]
tantalor|3 months ago
Just in time for Vision Pro to go big. Right?
unknown|3 months ago
[deleted]
unknown|3 months ago
[deleted]
SomaticPirate|3 months ago
reactordev|3 months ago
dymk|3 months ago
colordrops|3 months ago
smartties|3 months ago