The great thing about a product like this is that it's so easy to fake it in video.
I don't really buy that typing speed is a bottleneck for most people. We can't actually think all that fast. And I suspect AI is doing a lot of filling in the gaps here.
It might have some niche use cases, like being able to use your phone while cycling.
I fear that if AI filling the gapsis competing with your own thoughts while using it, your lazy brain goes the easy way and accepts the insinuations of themachine as its own. Beating the entire purpose of having a gizmo that writes down "your very own" thoughts.
That is if this exists. But if it does, it's not what you think it does for you.
It's possible the demo is faked, and I'm skeptical.
But I also don't think the speed is really the point of a device like this. Getting out a device, pressing keys or tapping on it, and putting it away again, those attentional costs of using some device... I know something like basic notetaking would feel really different to me if I was able to just do the thing in the demo at high accuracy instead.
That's a big if, though - the accuracy would have to be high for it to really be useful, and the video is probably best-case clips.
I think the idea here is a very different mode of programming. Less prompting, waiting, seeing results, prompting again (where typing is not the bottleneck)
But more having a conversation with a really fast coding agent. That should feel like you’re micro-managing an intern as they code really fast, you could start describing the problem and it could start coding and you interject and tell it to do do things differently. There the bottleneck would be typing, especially if you have fast inference. But with voice now you’re coding at the speed of your thoughts.
I think doing that would be super cool but awkward if you’re talking out loud in an office, that’s where this device would come in.
Pulling out my phone, unlocking it, opening my notes app, creating a new note, that is a bottleneck.
Puling out my phone, unlocking it, remembering what the hotkey is today for starting google/gemini, is a bottle neck. Damned if I can remember what random gesture lets me ask Gemini to take a note today (presumably gemini has notes support now, IIRC the original release didn't).
Finding where Google stashes todo items at, also a bottle neck. Of course that entails me getting my phone out and navigating to whatever notes app (for awhile todos/notes were inside a separate Google search app!) they are shoved into.
My Palm Pilot from 2000 had more usability than a modern smartphone.
Most people i think type very slowly on computers and i believe type even more slowly on phones. I've had many many people remark at how fast i type on both of those platforms and it still confuses me, as i think it's so easy for me to overlook how slowly people type.
I agree that it's an easy to fake demo and at the same time, if they're going to fake it, why make it seem so slow?
As to whether typing speed is a bottleneck for most people, maybe not most people, but definitely some people, and it's a massive bottleneck for me personally.
I think better when I'm talking and since I have started using speech to text, it has increased my writing speed and coding speed by at least an order, maybe two orders of magnitude.
But you are right, the AI filling in gaps can really cause trouble using speech, goodness knows what it's doing using sub-speech.
One of the major ways you can speed up reading, is that you stop 'vocalizing' each word in your head. It does seem that thinking is much faster than 'thinking aloud' (in your head)
I am surprised no one here has noted that a device like this almost completely negates the need for literacy. That is huge. Right now people still need to interact with written words, both typing and reading. Realistically a quiet vocal based input device like this could have a UX built around it that does not require users to be literate at all.
How convenient! Literacy has always been a thorn to efficient society, as books too easily spread dangerous heretical propaganda. Now we can directly filter the quality of information and increase cultural unification. /j
Having a voice narrate one thing at a time to you does not have the same informational bandwidth as written content, labeled buttons, etc on a page/screen. Not even close.
> We currently have a working prototype that, after training with user-specific example data, demonstrates over 90% accuracy on an application-specific vocabulary. The system is currently user-dependent and requires individual training. We are currently on working on iterations that would not require any personalization.
Spent all last year writing a techno-thriller about mind-reading, I'm sure this is about as factual, and, of course nothing nefarious could possibly happen if this ever became real.
This is the stuff nightmares are made of. We already live in a you have nothing to hide society. Now imagine one where mega corps and the government have access to every thought you have. No worries, you got nothing to hide right? What would that do to our thought process and how we articulate our inner selfs? What do we allow ourselves to even think? At some point it will not even matter because we will have trained ourselves to suppress any deviant thought. I'd rather not keep on going because the ramifications of this technology make me truly sick in the stomach.
In the Ghost in the Shell universe, I always thought the telepathic conversations from cyberbrain to cyberbrain was one of the most fantastic and least realistic predictions for the future. But I was clearly wrong. We already have rudimentary telepathy 10 years ahead of schedule!
The accuracy is going to be the real make or break for this. In a paper from 2018 they reported 92% word accuracy [1]. That's a lifetime ago for ML but they were also using five facial electrodes where now it looks confined to around the ears. If the accuracy was great today they would report it. In actual use I can see even 99% being pretty annoying and 95% being almost unusable (for people who can speak normally).
If you look at his facial movements in the video it looks as if he is pretty actively using his facial muscles, 'trying' to speak while moving as little as possible (which would cause the clearest signals to be emitted).
If that is what is happening, to me it feels like harder work than just speaking (similar to how singing softly but accurately can be very hard work). It would still be pretty cool, but only practical in use cases where you have to be silent and only for short periods of usage.
The presentation of this product reminds me of peak crypto when a 'white paper' and a two-page website was all anyone needed to get bamboozled into handing their money over.
What I picked up from this vision of the future... we will have mind reading devices to capture out thoughts, but we will still be on a train and commuting to work... dang...
So they came up with this groundbreaking idea but couldn't come up with better use case then typing on a train.
Look, I can't but not appreciate that at least they are doing something interesting as opposed to vibe one shot fork of vs code things that we see.
I just imagine this going really wrong. My chain of thought would be something like: "Let's see, I need to rotate this image so I need to loop over rows then columns, .. gawd fuck this code base is shit designed, there are no units on these fields, this could be so much cleaner, ... for each row ... I wonder what's for lunch today? I hope it's good ... for each column ... Dang that response on HN really pissed me off, I'd better go check it ... read pixel from source ... tonight I meeting up with a friend, I'd better remember to confirm, ... write pixel to dest ...."
For those thinking about speed: an average human talks anywhere from 120-240 words per minute. An average human who touch types is probably 1/3 to 1/2 as fast as that, while an average human on a phone probably types 1/5 as fast as that.
But for me speed isn't even the issue. I can dictate to Siri at near-regular-speech speeds -- and then spend another 200% of the time that took to fix what it got wrong. I have reasonable diction and enunciation, and speech to text is just that bad while walking down the street. If this is as accurate as they're showing, it would be worth it just for the accuracy.
I found it interesting that in the segment where two people were communicating "telepathically", they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch).
I have to wonder, if they have enough signal to produce what essentially looks like speech-to-text (without the speech), wouldn't it be possible to use the exact same signal to directly produce the synthesized speech? It could also lower latency further by not needing extra surrounding context for the text to be pronounced correctly.
> they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch).
This is an LLM model thing. Plenty of open source (or at least MIT licensed) LLMs and TTS models exist that translate and can be zero shot trained on a user's speech. Direct audio to audio models tend to be less researched and less advanced than the corresponding (but higher latency) audio to text to audio pipelines.
That said you can get audio->text->audio down to 400ms or so latency if you are really damn good at it.
From memory, i think other recent research is along this approach, but not yet good enough. Cant remember where I read this but was likely HN. I think the posted paper got 95% accuracy when picking from a known set of target sentences/words, but far less (60%?) when used for freeform input.
This is literally only as fast as text to speech. the only difference is that you don't have to speak aloud. Which is cool.
But for using a computer its still annoying and worse than a mouse because with a mouse you can click or drag and place in a second, in this format you have to think "move the box from point A to point B (with coordinates or a description) etc etc".
I think its cool, I've been brainstorming how a good MCI would work for a while and didn't think of this. I think its a great novel approach that will probably be expanded on soon.
> But for using a computer its still annoying and worse than a mouse because with a mouse you can click or drag and place in a second, in this format you have to think "move the box from point A to point B (with coordinates or a description) etc etc".
You wouldn't use a regular WIMP[1] paradigm with this, that completely defeats the advantages you have. You don't need to have a giant window full of icons and other clickable/tappable UI elements, that becomes pointless now.
what it could be really cool for is stuff like "open my house door", "Turn off the lights", "text so and so", "Start my car"
Stuff we want to do without pulling out our phone that doesn't require a lot of detailed instruction.
Integrating AlterEgo with the next generation of AR glasses could be the next generation of technology after these electronic bricks we carry in our pockets. My biggest frustration with wake-word assistants is that voice is inherently a broadcast channel.
There’s endless comedy about the confusion on a bus when someone's talking into Bluetooth and their neighbor thinks they’re being addressed. Silent Sense + AR gets your eyes up and around you, fixes posture, frees your hands and keeps the guy next to you out of the conversation.
I've seen tech like this on display at the IFA already 15 years ago. Forgot the name, but it was pretty hyped at that time. You could even demo it live. Sure, it was only to steer a sidescrolling video game character, but it worked great with as little training of half a minute or so.
Anyhow, Alterego just seems like another vaporware product, that will never enter or even begin to penetrate the overall market. But let's see!
[+] [-] stevage|6 months ago|reply
I don't really buy that typing speed is a bottleneck for most people. We can't actually think all that fast. And I suspect AI is doing a lot of filling in the gaps here.
It might have some niche use cases, like being able to use your phone while cycling.
[+] [-] emsign|5 months ago|reply
That is if this exists. But if it does, it's not what you think it does for you.
[+] [-] Bjartr|6 months ago|reply
I can break 100wpm, especially if I accept typos. It's still much, much slower to type than I can think.
[+] [-] w00ds|6 months ago|reply
[+] [-] dllthomas|6 months ago|reply
[+] [-] oldfuture|5 months ago|reply
also adding their press release here:
https://docsend.com/view/dmda8mqzhcvqrkrk/d/fjr4nnmzf9jnjzgw
[+] [-] pj_mukh|5 months ago|reply
But more having a conversation with a really fast coding agent. That should feel like you’re micro-managing an intern as they code really fast, you could start describing the problem and it could start coding and you interject and tell it to do do things differently. There the bottleneck would be typing, especially if you have fast inference. But with voice now you’re coding at the speed of your thoughts.
I think doing that would be super cool but awkward if you’re talking out loud in an office, that’s where this device would come in.
[+] [-] com2kid|5 months ago|reply
Puling out my phone, unlocking it, remembering what the hotkey is today for starting google/gemini, is a bottle neck. Damned if I can remember what random gesture lets me ask Gemini to take a note today (presumably gemini has notes support now, IIRC the original release didn't).
Finding where Google stashes todo items at, also a bottle neck. Of course that entails me getting my phone out and navigating to whatever notes app (for awhile todos/notes were inside a separate Google search app!) they are shoved into.
My Palm Pilot from 2000 had more usability than a modern smartphone.
This device can solve all of those issues.
[+] [-] jimkleiber|5 months ago|reply
[+] [-] mickdarling|5 months ago|reply
As to whether typing speed is a bottleneck for most people, maybe not most people, but definitely some people, and it's a massive bottleneck for me personally.
I think better when I'm talking and since I have started using speech to text, it has increased my writing speed and coding speed by at least an order, maybe two orders of magnitude.
But you are right, the AI filling in gaps can really cause trouble using speech, goodness knows what it's doing using sub-speech.
[+] [-] noduerme|5 months ago|reply
[+] [-] legacynl|5 months ago|reply
One of the major ways you can speed up reading, is that you stop 'vocalizing' each word in your head. It does seem that thinking is much faster than 'thinking aloud' (in your head)
[+] [-] throwawaymaths|5 months ago|reply
depends on what they are connected to in the back there.
[+] [-] az226|5 months ago|reply
[+] [-] com2kid|5 months ago|reply
[+] [-] aDyslecticCrow|5 months ago|reply
[+] [-] jussaying2|5 months ago|reply
[+] [-] Dilettante_|5 months ago|reply
[+] [-] giveita|5 months ago|reply
[+] [-] soulofmischief|5 months ago|reply
https://www.media.mit.edu/projects/alterego/frequently-asked...
[+] [-] boznz|5 months ago|reply
[+] [-] wcrossbow|5 months ago|reply
[+] [-] deekshith13|5 months ago|reply
[+] [-] ipnon|5 months ago|reply
[+] [-] synapsomorphy|6 months ago|reply
[1] https://www.media.mit.edu/publications/alterego-IUI/
[+] [-] ivape|5 months ago|reply
[+] [-] thatxliner|5 months ago|reply
[+] [-] vunderba|5 months ago|reply
> Alterego only responds to intentional, silent speech.
What exactly do they mean by this? Some kind of equivalent to subvocalization [1]?
[1] https://en.wikipedia.org/wiki/Subvocalization
[+] [-] hyperadvanced|5 months ago|reply
[+] [-] dinfinity|5 months ago|reply
If that is what is happening, to me it feels like harder work than just speaking (similar to how singing softly but accurately can be very hard work). It would still be pretty cool, but only practical in use cases where you have to be silent and only for short periods of usage.
[+] [-] ipsum2|5 months ago|reply
[+] [-] Theodores|6 months ago|reply
[+] [-] pedalpete|6 months ago|reply
I suspect it's EMG though muscles in the ear and jaw bone, but that seems too rudimentary.
The TED talk describes a system which includes sensors on the chin across the jaw bone, but the demo obviously has removed that sensor.
[+] [-] desireco42|6 months ago|reply
So they came up with this groundbreaking idea but couldn't come up with better use case then typing on a train.
Look, I can't but not appreciate that at least they are doing something interesting as opposed to vibe one shot fork of vs code things that we see.
[+] [-] oldfuture|6 months ago|reply
https://www.media.mit.edu/projects/alterego/overview/
adding also their press release here:
https://docsend.com/view/dmda8mqzhcvqrkrk/d/fjr4nnmzf9jnjzgw
[+] [-] socalgal2|6 months ago|reply
[+] [-] andsoitis|5 months ago|reply
Seems like vaporware.
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] gcanyon|5 months ago|reply
But for me speed isn't even the issue. I can dictate to Siri at near-regular-speech speeds -- and then spend another 200% of the time that took to fix what it got wrong. I have reasonable diction and enunciation, and speech to text is just that bad while walking down the street. If this is as accurate as they're showing, it would be worth it just for the accuracy.
[+] [-] blixt|6 months ago|reply
I have to wonder, if they have enough signal to produce what essentially looks like speech-to-text (without the speech), wouldn't it be possible to use the exact same signal to directly produce the synthesized speech? It could also lower latency further by not needing extra surrounding context for the text to be pronounced correctly.
[+] [-] com2kid|5 months ago|reply
This is an LLM model thing. Plenty of open source (or at least MIT licensed) LLMs and TTS models exist that translate and can be zero shot trained on a user's speech. Direct audio to audio models tend to be less researched and less advanced than the corresponding (but higher latency) audio to text to audio pipelines.
That said you can get audio->text->audio down to 400ms or so latency if you are really damn good at it.
[+] [-] stevage|5 months ago|reply
(I think it was https://en.wikipedia.org/wiki/Oath_of_Fealty_%28novel%29 but can't find enough details to confirm.)
[+] [-] akdor1154|6 months ago|reply
I'm sure that's not the last word though!
[+] [-] Briannaj|5 months ago|reply
I think its cool, I've been brainstorming how a good MCI would work for a while and didn't think of this. I think its a great novel approach that will probably be expanded on soon.
[+] [-] com2kid|5 months ago|reply
You wouldn't use a regular WIMP[1] paradigm with this, that completely defeats the advantages you have. You don't need to have a giant window full of icons and other clickable/tappable UI elements, that becomes pointless now.
[1]https://en.wikipedia.org/wiki/WIMP_(computing)
[+] [-] Briannaj|5 months ago|reply
[+] [-] laurieg|5 months ago|reply
Going from voice input to silent voice input is a huge step forward for UX.
[+] [-] Tiereven|5 months ago|reply
There’s endless comedy about the confusion on a bus when someone's talking into Bluetooth and their neighbor thinks they’re being addressed. Silent Sense + AR gets your eyes up and around you, fixes posture, frees your hands and keeps the guy next to you out of the conversation.
[+] [-] Dilettante_|6 months ago|reply
[+] [-] runxel|5 months ago|reply
Anyhow, Alterego just seems like another vaporware product, that will never enter or even begin to penetrate the overall market. But let's see!