Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined. A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.
Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration. There are many failure points: it can "hear" you wrong, it can miss the wake word, it can hear correctly but interpret wrong, miss context clues, or simply be unable to process whatever the request is. In my experience, most normal people either relegate voice commands to ultra-specific tasks, like timers, weather, and music, and that's that. Google and Alexa are relatively good at "trivia" questions, but Siri is a complete failure. All systems have edge cases that make them brittle.
I think there's potential here. Cortana was the most promising: an assistant that's integrated into the OS and can change any setting or perform anything on-screen would, again, be really awesome. We just don't have that. I think maybe OS-wide + GPT 4 (or later) might get closer to what we expect, but it's just not great right now. I really want to be able to say something as unstructured as "hey siri, create alarms every 5 minutes starting at 6am tomorrow" or "hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news". There /is/ power to-be-had, but nobody has really tapped it.
Natural language is a fundamentally wrong vehicle to convey information to a computer. It can be useful for some specific tasks, automated Q/A, simple interfaces to databases, stuff where I can't be properly f_ed to remember the syntax or the shortcut like IDE commands.
But the idea it can replace formal language is fundamentally and dangerously incorrect. I agree with Dijkstra's quip, we shouldn't regard formal language as a burden, but rather as a privilege.
But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.
And even if you're all alone in a silent place, giving instructions out loud takes more time than configuring a screen, and will always be error prone, because the feedback will always be ambiguous and imprecise.
Except maybe if the feedback is on a screen, but then if there's already a screen, why not use it.
"hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news"
I think the problem with that is that even I, as a human, struggle to know for sure what you want.
You want to turn all the lights on in the house? Does that include the lamps in the bedroom? How about new lights that you add later? Or the ones in the garden? It's full of ambiguity. What device do you want to watch the news on? Or did you mean the radio? Do you want this to apply when you get back at 2am one night, meaning your family gets woken up when you turn on all the lights and start playing the news in their bedrooms?
I think that's probably why voice interfaces aren't likely to work well for anything beyond direct, specific, well-scoped requests: turn on the lights in the bedroom; turn off the heating at home; roll up the blinds; what's the weather like today; what's the remaining range on my car. They really struggle to deal with anything more complex – not so bad in theory, but really incredibly irritating when they make the wrong decision.
If you had some kind of 24-hour live-in assistant (a butler, maybe?), then they probably have the knowledge and intuition to make sensible decisions in response to fairly unstructured requests. But I think we're miles off getting a voice assistant to do it – not because they can't, necessarily, but because if they mess it up at all it's infuriating.
I might be in the minority, but I also don't want to add things to my life that make my environment noisier or that require me or others living with me to speak more. As much of a Star Trek fan as I am, I never found "The Computer" to be appealing, and always thought of it more as an artistic device. It's a lot easier to communicate a character's intent / action if they are vocalizing it for performance. Even in scenes where they are "typing" something into the computer, they will inevitably be communicating to the captain or another character what they are doing.
In practical reality these interfaces feel, to me, as extremely inefficient. As someone who doesn't particularly like to speak, and prefers silent environments, these interfaces require more energy from me to use. Unless they are serving someone who has a physical impairment then I don't see what problems exist that these solve, but I can identify lots of problems that they introduce (not only noise but privacy / security vulnerabilities etc.)
Timers and reminders alone are enough to make them a pretty nice thing to have though.
I don't really want them to be all that much more powerful, because natural language can be imprecise, and... there's just not much I that I want to automate in a home setting beyond some real simple timers for lights and stuff.
What if I had a bad day and didn't want to see depressing news? Or what if I came home and was talking on the phone when it turned the news on?
True automation as opposed to just telemetry and remote control can easily be annoying more than helpful.
I like the idea of automation... but I don't actually... automate anything aside from timers and reminders.
If I would be in this space I would just build voice assistant to very specific situations where you cannot type like driving, cooking, doing some sport etc. There is lots of potential but big players are kinda trying build generic tool for every situation which is super hard problem.
Voice assistants have reached the Unhelpful Valley stage.
When they were a novelty I recall the excitement of trying new commands and layering in context, after many failures I've been conditioned to now only attempt and expect success with generic queries.
To me what’s interesting is that MS smelled that it was a problem a while ago and pulled the plug before it ate a hole in their wallet but Amazon and Google keep plugging along ploughing money into a bottomless pit. Apple has a different play and looks like they are controlling their losses there quite well and may act as a slight loss leader for other products.
> Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration
My biggest frustration with Alexa is getting it play the podcasts I want to listen to. Even popular podcasts with English names are hard to get just right for Alexa. The same goes for song titles and bands that are not popular, or they are in other languages.
Usually when I want to take a shower, I try to get the podcasts/music to play for 2 minutes, then sigh, give up and just say "Alexa play Britney Spears".
>> A voice system that can do literally everything one can do with a keyboard and a mouse would be magical, but no system offers that.
And even then, a voice assistant is essentially a user interface, not a product or service.
It could be a service if you could reliably say "Alexa, plan my trip to customer X the week of the 30th and send me my itinerary". But for now they are an alternative to a phone UI.
The potential would be there, if they would focus on the assistant-part, and take the voice just as a mean to interact with the assistant, besides other means like clicking, typing, showing complex information on a screen, etc.
Voice alone sucks, it's just too limited to be useful on a grand scale. Similarly, command lines suck too. The shell in general has the same problems that Voice assistants have, just that they have more value and had decades to mature into something actually useful. And toady we have unix-shells which reduce the problematic parts by many levels, and still receive constant improvements. This is missing for voice assistants, because unix-shells are growing and improving in an open space, where everyone can add their own things. This is not happening in big tech.
I don't think this is actually reliably possible due to the fact that while grammar does tend to follow patterns sometimes, we're fundamentally dealing with an exponential amount of ways to say things to a voice assistant.
In the spirit of the title of this post, someone else also has to say something.
If your argument is that this is a "non-visual command line" there's slim hope of the layperson learning a whole secret grammar without even a goddamn man page just to do their menial tasks.
>Voice assistants are basically just mainstream non-visual command-lines, and it's unsurprising to me that something that relies heavily on memorization and extremely specialized "skills" isn't quite taking off in the way it was imagined.
This got me thinking. Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager. Now there's just the language part.
Thing is, I don't want to speak to my computer using English. Aside from the enormous practical problems in natural language processing you've outlined, I just find the idea creepy[1].
What I want is to unambiguously tell it to do arbitrary things. I.e. use it as an actual computer, not a toy that can do a few tricks. I.e. actually program it. In some kind of Turing complete shell language that is optimized for being spoken aloud. You would speak words into the open source voice recognizer, it writes those to stdout, then an interpreter reads from stdin and executes the instructions.
Is there any language like this? What should it look like?
And yeah that would take effort to learn to use it right, just like any other programming language; so be it. This would be a hobbyist thing.
To me the hardest problem is simply remembering what every light on my network is named. Did I call the light next to my desk “desk light” or did I call it “office light”? If I don’t get the name exactly right, I cannot control the light. Multiply that by every other light in the house and it becomes a lot to remember. I have probably 15 lights controlled by Alexa and I can only remember the name of like three of them. Thus most of the time it is just “Alexa turn on the lights” so it can turn everything on in a room.
If these voice assistants were smarter about “alternative” names for every device it might be easier to use. But as it stands, it’s kind of a pain because the way you phrase each request is so unforgiving…
Oh yeah, and god help you if your device name is similar to your room name. If your room is “office” (or did I name it “the office”?) and your light is “office light” Alexa is gonna have a bad time figuring the two apart.
I have no clue how to fix this…
PS: this is why I question steering wheel free self driving cars. How will we tell these things exactly where to go when we cannot even reliably tell our voice assistants exactly what light to turn on?
I think the biggest potential is with Microsoft Teams in business. It is ubiquitousness in people's work life, has access to data and has integrated with everything. And adding cortana to calls would be an easy step for people to understand and learn. People would say "cortana share my screen". People would learn phrases from each other.
> There /is/ power to-be-had, but nobody has really tapped it.
This kind of thing can't be built for modern mainstream operating systems because they generally prevent subjugation of the OS components and other programs, even if the user wants that, ostensibly for security reasons.
Unlike a human operator, an assistant "app" can only operate within the bounds of APIs defined by the OS vendors and third-party developers. Gone are the days of third-party software that extends the operating system in ways that the overlords couldn't (or wouldn't) dream of.
I think you're identifying some of the right problems here. All voice assistants are based on turn-taking, and when the VoiceAI hits one of those failure points and just comes back with "I didn't get that" it leaves the user in a frustrating state trying to debug what's wrong.
I work at SoundHound where we've been worried about these issues. (I'm going to plug our recent work...) Our new approach is to do natural language understanding in real-time instead of at the utterance (turn) taking level. That way we can give the user constant feedback in real-time. In the case of a screen that means the user sees right away that they are understood, and if not, a better hint of what went wrong. For example a likely mistake is an ASR mistranscription for a word or two.
We still need to prove this is a better paradigm for VoiceAI in products that people can try for themselves, and are working towards that goal. I hope that voice interfaces that were clunky with turn-taking will finally be more naturally usable with real-time NLU.
I tried Amazon's Alexa, the top end model with a display. Often it would taunt you about new/interesting things on the screen, but I could never get them to work. I'd had to memorize things to get even the basics working. Ended up unplugging it.
However Google's Assistant in comparison worked great, no memorization, and very useful. Sure time, weather, set timers, and alarms worked great with a very flexible set of natural language queries. Even more complex things like what will be the temperature tomorrow at 10pm, simple calculations and unit conversions. But also things like IMDB like queries about directors, actors, which movies someone was in, etc generally worked well. It seemed to really understand things, not just "A web search returned ...". Even more complex things like the wheelbase of a 2004 WRX would return an answer, not a search result.
With all that said I'm looking for a non-cloud/on site solution, even if it requires more work, most recently noticed https://github.com/rhasspy/rhasspy
The big issue is that there's no clearly defined interface for users. What commands are possible? Nobody knows. So people default to the most obvious things like setting a timer. Is it possible to setup your own commands and build your own work flows? AFAIK, no. So the tech is essentially dead in the water until companies fundamentally rethink what they're trying to do with voice assistants.
Talon voice can do everything a keyboard and mouse offers, plus more (contextual awareness, higher level abstraction). Very powerful in combination with modal editing. I'm not affiliated, just a user.
Granted, this is for a specific user base and yes, not in coffee shops.
This timeline is such a mishmash of mediocrity. Voice assistants could have been a vibrant ecosystem of different personalities, like say buying a Darth Vader voice pack or having your computer sound like a snooty English butler..
There's a great little game series called Megaman Battle Network (Rockman.exe in Japan) which diverges from the mainline by showing an alternate universe where scientists focused on AI instead of robotics, resulting in a world where "Navis" are ubiquitous.
I wonder, what if our early software engineers focused on bringing natural voice control to CLIs, before perfecting GUIs first?
I think these assistants just need to give the user a way to edit interpretations.
A 'debug' area that lets you ask a command, see what was interpreted - and immediately edit or click "that's not what I wanted". But not an afterthought and not a cumbersome process like setting up an automation that is triggered by specific commands.
Imagine telling your voice assistant "You're wrong, as usual" and instead of it giving you the boiler plate "I'm sorry ", it actually offered a way to improve itself.
I would think that a good command-line is one that responds to me within milliseconds on a crapbox i386 machine, and I can COMMAND it what to do.
A good command-line is not a binary blob that cannot parse simple instructions correctly.
At the same time, siri seems to be getting slower and fatter every iteration so perhaps it is becoming more human ;)
Another pitfall of most voice assistants is that they are really designed first with the corporation in mind rather than the user. Most are proxies for surveillance, advertising, or are just steering consumers back to a preferred set of walled-garden services.
Yeah, the whole idea has a lot of potential that seems like it should be within reach, but somehow it's 2022 and my phone still can't handle "hey Google, play my driving playlist on Spotify."
Your queries continue to be money-sinks -- even in your ideal case, you aren't buying anything! This query costs them money but earns them nothing. This is useless.
Me and voice assistants are like me on the ballroom dance floor. I loved to take the lessons and learn all sorts of moves and chain them all together and look impressive, but when I got onto the floor with a partner, I just wouldn't know what to do or where to start. I kept to the "basic" steps and maybe a timid little turn once in a while.
Maybe it's possible to learn a working vocabulary and know how to command a voice assistant. I know my way around several command lines, but I have no idea what to say to Hey Google.
I think it's fairly clear now that the only time a voice based UI is better is when the user is unable to use their hands. Driving or in the kitchen when cooking seem the be the most successful. The are barely any other strong use cases.
On top of that the general distrust of the privacy of these systems has stoped a significant number of people (myself included) from wanting to us them at all. I don't have an in home device, and have turned off Siri on my Apple devices.
The main issue for me is that they are not stateful. Perhaps the main thing in the role of an assistant is to keep state. You want someone or something that understands you and what you want, so that you don't have to put too much thought into it.
If you tell it you want more coffee, it should know what you like and suggest a mixture of brands you bought before and new ones you may enjoy. If you tell it you're hungry, depending on the time of day it could suggest you some takeaway you've ordered previously or something else you may like. If you say the same some other time it may suggest recipes based on what you have at home or it may suggest nearby restaurants. It should keep track of your friends and otherwise and tell you when their birthdays are coming and it would be nice if it could even suggest some presents based on things you've told the assistant before, or their wishlist on amazon or something else.
There are a lot of things assistants could do, but it needs to know you. The model where everyone has the same assistant doesn't quite work out.
It's useful for trivial unambiguous tasks where you have your hands full or don't want to touch your device or it's dangerous to. That's all I can muster mine for.
"Hey Siri, add more toilet paper to the shopping list" (while pooping)
"Hey Siri, shuffle my music" (while driving)
"Hey Siri, countdown 10 minutes" (while shoving a pizza in the oven)
Anything else is a shit show. Anything where trust or accuracy is involved i.e. mutating data, spending money, absolutely no way can I trust it at all and never will.
"Hey Google, turn off the den light switch in 30 minutes"
"Sorry, for safety reasons we cannot...."
It's a light. Because it heard "switch" it thinks there might be some power tool connected to it and won't let me set delayed actions. I want to be intelligent with it like "Hey Google, turn turn the lights on when the sun comes up everday" but no one has gone to that next step.
Or how about "Hey Google, turn off the tone played when you say Hey Google". These settings aren't accessible from the voice interface itself.
Can't wait for Alexa to fail so my SmartTV will stop nagging me to use integrations I will never use. Anti-competitive but whatevs.
Voice assistants are shit. The number of times my friends have got alexa to turn the light on first time is functionally zero.
And, they don't really explain the syntax constraints. Which are massive.
Try ab initio without knowing how to do it, to get OK google to open an arbitrary google authored app and direct it to do something. Compared to learning how to use the OS UI keyboard shortcuts or applescript. (Which btw like Windows is basically fully documented because all the libraries are self documenting for their call structures)
The voice interfaces are universally badly designed because spoken command sentences are not well understood as a modality of command, distinct from mouse, gesture, touch or keyboard.
Until voice is baked in with a documented syntax in "man" format, i won't believe its first class.
How do I even know for any arbitrary app what voice directives it uses? How do they correlate to any other command input? How consistent is this with other commands in other apps? Does "stop now" always mean the same thing between a mapping routing app, and a tape backup app? Isn't "stop" contextually defined in a way ^C isn't?
A lot of the more useful information retrieval tasks involve a feedback loop. If I’m shopping for a product, I may enter a generic term. Then the system sends me images of products matching that term. Then I tell the system which product image is closest to what I want. Then the system sends me reviews of that product. Then I read the reviews and realize this is not what I actually want…repeat until I’m satisfied.
I can’t do this loop with modern voice assistants.
A voice assistant with contextual conversation skills, and access to an “always on” visual monitor (home projector or AR glasses) would definitely increase utility by 10x or more
The oven is in another area of the house, so when I come back to squeeze in some work, I often just said "Voice Assistant, set a timer for 10 minutes!"
And that's about it.
Apart from that I worked on some chat-bots in the past and it's the very same thing to me, just even a bit worse because of audio.
Natural language processing simply isn't there AND there is just a few very niche use cases in my eyes.
So if they go away again, I won't cry or rather my tears will dry fast.
Somebody develops something, gets a lot of buzz, doesn't deliver, buzz dies slowly, and then several years in the future somebody else actually builds the tech stack needed for it, and it takes off again.
Voice assistance, chat robots [1]... the metaverse.
[1] Several years back I attended a chat in my city where some people from IBM showed us how to implement a chatbot I think with Watson.
Beyond the trivial examples, it was nearly impossible to implement anything (Or the people giving the chat didn't know how to implement those) but the documentation was nowhere to be found beyond those trivial examples.
While I get it why Amazon, Microsoft, and Google (?) might want to reduce development costs for voice interfaces, it seems like Apple is locked in to supporting Siri since one of its products, the Apple Watch, really needs Siri to get full value from it. I like to go about my day without carrying a phone (if I don’t need to take pictures) and the Apple Watch is a great compromise that allows getting calls and text messages, but at the same time is not intrusive as an iPhone. No one sits in a restaurant staring lovingly at content on their Apple Watch, ignoring nearby people and their environment.
Pardon my getting a bit off topic, but it is interesting to see the “belt tightening” by FANGs: it seems like everyone is cutting excess staff and looking hard at which products make money and which are money losers. This may seem like a good thing except for newer product categories like AR/VR that will need a lot of experimentation to get right.
It's not the technology. Voice transcription works great, and from the point of view of extracting meaning, even pattern matching would do better than the embarassing failure of today's voice assistants. It's a matter of product. We are living in an IT world where there are no great people able to turn the technological potential into useful things.
Conversational interfaces just aren't any good, and until they're perfect they won't be useful. The primary issue I've seen (having worked with slack bots) is discoverability, how do you find out what a bot can do without asking lots of open questions? The most useful bots I saw were those that didn't try to be conservsational but had a fixed grammar for commands and questions (along with a good help response). And at least with chatbots you've got the textual context you can scroll back on and read as opposed to trying to keep track of what's happening in a conversation with a non human who you can't see.
Despite the tremendous amount of effort that's gone into creating large language models, there is still no way to hold a goal-directed conversation with voice assistants. There is a lot of implicit context in normal human speech that needs to be inferred or clarified. None of these speech assistants can do handle anything more than the most rudimentary clarification dialog.
I'm really not surprised. Some sample items I bought in the last months in order of difficulty both for me and (I think) for a voice assistant:
1. Hey Alexa/Google/Whatever, buy me a saddle, Selle Italia SMP Extra, white. I expect to get a proposal for the best price + shipment and that's it. Easy.
2. Buy me a replacement head for my Parktool pump. Oops, it's a PFP-3 pump. Ah, the shop selling the saddle doesn't have one... I eventually discovered that that shop has an identical part for a pump of a different brand, probably the same Chinese manufacturer. I think this would be hard for an assistant, borderline to impossible.
3. Buy me a magnetic mosquito screen of at least X x Y cm, not adhesive or velcro. Oops, I need a frame to mount it on? I spare you with the discoveries I made along the process. Either the voice assistant is equivalent to a professional installer and can see my door or I'll always do a search with a browser, and watch many videos too.
Voice assistants are the most useful new UI to me since smartphones.
I use them to get travel time estimates, reference facts, music and podcasts, rewind / forward / pause / resume / next / previous, timers, lights, and occasionally broadcast messages on home speaker devices.
I use voice UI while cooking, running, mowing, raking, driving, changing diapers, putting on my kids' clothes, and while sitting with my wife.
My two young children (5 and almost 2) love voice UI. We ask about the animal of the day, what does this animal sound like, play that song. My older child is beginning to set timers with it. My wife recently said having a nearby voice assistant is important in a home configuration discussion.
My family and I sometimes have frustrating experiences with voice UI, like failed hotwords, shifting syntax, answering on an unexpected device, and slow responses. But we still use it frequently, and our overall sentiment with voice UI is positive.
Conversational Computing is just not ready. Even a better AI, alone, will not solve it 100%.
But because of the 2010s meteoric rise of FAANG (in the public imagination and stock market) they think they can quickly push into mainstream all these immature paradigms like IVA, AR and now VR.
All these technologies exist since decades, but they are not ready! billions of dollars of investment and marketing are not enough to make them so.
As small a Leap as a tactile portable device took us almost 20 years to reach a conclusive mainstream form (iPhone)
We need to accept that IVA/AR/VR are exponentially larger leaps and should remain side-shows for a very long time to come.
For example, Microsoft is finally acknowledging this, with HoloLens being now just "to help you solve real business problems".
At least for my family, the voice recognition seems to be getting worse and worse with each update. Nowadays google doesn’t seem to turn off the alarms or the music after several tries so my son just comes and unplugs it (touch controls are also a fucking mystery), my wife tries to make her phone ring so many times she can find without help in the meantime. Alexa doesn’t understand what I try to order and it’s easier to just find my phone and type it. I thought that somehow the devices were just having physical issues with time, but even a new one has the same problems. So far it seems like I would get a better experience by rolling my own stuff and training it with our specific voices.
[+] [-] Shank|3 years ago|reply
Instead, it's a guessing game about syntax and semantics, and frequently a source of frustration. There are many failure points: it can "hear" you wrong, it can miss the wake word, it can hear correctly but interpret wrong, miss context clues, or simply be unable to process whatever the request is. In my experience, most normal people either relegate voice commands to ultra-specific tasks, like timers, weather, and music, and that's that. Google and Alexa are relatively good at "trivia" questions, but Siri is a complete failure. All systems have edge cases that make them brittle.
I think there's potential here. Cortana was the most promising: an assistant that's integrated into the OS and can change any setting or perform anything on-screen would, again, be really awesome. We just don't have that. I think maybe OS-wide + GPT 4 (or later) might get closer to what we expect, but it's just not great right now. I really want to be able to say something as unstructured as "hey siri, create alarms every 5 minutes starting at 6am tomorrow" or "hey siri, when I get home every day, turn on all of the lights, change my focus to personal, and turn on the news". There /is/ power to-be-had, but nobody has really tapped it.
[+] [-] qsort|3 years ago|reply
Natural language is a fundamentally wrong vehicle to convey information to a computer. It can be useful for some specific tasks, automated Q/A, simple interfaces to databases, stuff where I can't be properly f_ed to remember the syntax or the shortcut like IDE commands.
But the idea it can replace formal language is fundamentally and dangerously incorrect. I agree with Dijkstra's quip, we shouldn't regard formal language as a burden, but rather as a privilege.
[+] [-] bambax|3 years ago|reply
But how? Even if those interfaces were actually working, it's still extremely inconvenient to talk when you can click. You have to be somewhere where talking out loud doesn't disturb the people around you. That excludes most situations: open space offices, restaurants, coffee shops, public transport, cars with passengers, and most places in the home except maybe the bathroom.
And even if you're all alone in a silent place, giving instructions out loud takes more time than configuring a screen, and will always be error prone, because the feedback will always be ambiguous and imprecise.
Except maybe if the feedback is on a screen, but then if there's already a screen, why not use it.
[+] [-] matthewmacleod|3 years ago|reply
I think the problem with that is that even I, as a human, struggle to know for sure what you want.
You want to turn all the lights on in the house? Does that include the lamps in the bedroom? How about new lights that you add later? Or the ones in the garden? It's full of ambiguity. What device do you want to watch the news on? Or did you mean the radio? Do you want this to apply when you get back at 2am one night, meaning your family gets woken up when you turn on all the lights and start playing the news in their bedrooms?
I think that's probably why voice interfaces aren't likely to work well for anything beyond direct, specific, well-scoped requests: turn on the lights in the bedroom; turn off the heating at home; roll up the blinds; what's the weather like today; what's the remaining range on my car. They really struggle to deal with anything more complex – not so bad in theory, but really incredibly irritating when they make the wrong decision.
If you had some kind of 24-hour live-in assistant (a butler, maybe?), then they probably have the knowledge and intuition to make sensible decisions in response to fairly unstructured requests. But I think we're miles off getting a voice assistant to do it – not because they can't, necessarily, but because if they mess it up at all it's infuriating.
[+] [-] gspencley|3 years ago|reply
In practical reality these interfaces feel, to me, as extremely inefficient. As someone who doesn't particularly like to speak, and prefers silent environments, these interfaces require more energy from me to use. Unless they are serving someone who has a physical impairment then I don't see what problems exist that these solve, but I can identify lots of problems that they introduce (not only noise but privacy / security vulnerabilities etc.)
Personal preference.
[+] [-] eternityforest|3 years ago|reply
I don't really want them to be all that much more powerful, because natural language can be imprecise, and... there's just not much I that I want to automate in a home setting beyond some real simple timers for lights and stuff.
What if I had a bad day and didn't want to see depressing news? Or what if I came home and was talking on the phone when it turned the news on?
True automation as opposed to just telemetry and remote control can easily be annoying more than helpful.
I like the idea of automation... but I don't actually... automate anything aside from timers and reminders.
[+] [-] antupis|3 years ago|reply
[+] [-] brycehalley|3 years ago|reply
When they were a novelty I recall the excitement of trying new commands and layering in context, after many failures I've been conditioned to now only attempt and expect success with generic queries.
[+] [-] mc32|3 years ago|reply
[+] [-] serial_dev|3 years ago|reply
My biggest frustration with Alexa is getting it play the podcasts I want to listen to. Even popular podcasts with English names are hard to get just right for Alexa. The same goes for song titles and bands that are not popular, or they are in other languages.
Usually when I want to take a shower, I try to get the podcasts/music to play for 2 minutes, then sigh, give up and just say "Alexa play Britney Spears".
[+] [-] phkahler|3 years ago|reply
And even then, a voice assistant is essentially a user interface, not a product or service.
It could be a service if you could reliably say "Alexa, plan my trip to customer X the week of the 30th and send me my itinerary". But for now they are an alternative to a phone UI.
[+] [-] PurpleRamen|3 years ago|reply
Voice alone sucks, it's just too limited to be useful on a grand scale. Similarly, command lines suck too. The shell in general has the same problems that Voice assistants have, just that they have more value and had decades to mature into something actually useful. And toady we have unix-shells which reduce the problematic parts by many levels, and still receive constant improvements. This is missing for voice assistants, because unix-shells are growing and improving in an open space, where everyone can add their own things. This is not happening in big tech.
[+] [-] sublinear|3 years ago|reply
In the spirit of the title of this post, someone else also has to say something.
If your argument is that this is a "non-visual command line" there's slim hope of the layperson learning a whole secret grammar without even a goddamn man page just to do their menial tasks.
[+] [-] _dain_|3 years ago|reply
This got me thinking. Voice recognition is basically a commodity now .. there are open source AI engines that can do it offline really well. So the recognition part is solved, you can just grab it from your distro's package manager. Now there's just the language part.
Thing is, I don't want to speak to my computer using English. Aside from the enormous practical problems in natural language processing you've outlined, I just find the idea creepy[1].
What I want is to unambiguously tell it to do arbitrary things. I.e. use it as an actual computer, not a toy that can do a few tricks. I.e. actually program it. In some kind of Turing complete shell language that is optimized for being spoken aloud. You would speak words into the open source voice recognizer, it writes those to stdout, then an interpreter reads from stdin and executes the instructions.
Is there any language like this? What should it look like?
And yeah that would take effort to learn to use it right, just like any other programming language; so be it. This would be a hobbyist thing.
[1] https://i.kym-cdn.com/photos/images/original/002/054/961/748...
[+] [-] spookthesunset|3 years ago|reply
If these voice assistants were smarter about “alternative” names for every device it might be easier to use. But as it stands, it’s kind of a pain because the way you phrase each request is so unforgiving…
Oh yeah, and god help you if your device name is similar to your room name. If your room is “office” (or did I name it “the office”?) and your light is “office light” Alexa is gonna have a bad time figuring the two apart.
I have no clue how to fix this…
PS: this is why I question steering wheel free self driving cars. How will we tell these things exactly where to go when we cannot even reliably tell our voice assistants exactly what light to turn on?
[+] [-] 7952|3 years ago|reply
[+] [-] SheinhardtWigCo|3 years ago|reply
This kind of thing can't be built for modern mainstream operating systems because they generally prevent subjugation of the OS components and other programs, even if the user wants that, ostensibly for security reasons.
Unlike a human operator, an assistant "app" can only operate within the bounds of APIs defined by the OS vendors and third-party developers. Gone are the days of third-party software that extends the operating system in ways that the overlords couldn't (or wouldn't) dream of.
[+] [-] bistable|3 years ago|reply
I work at SoundHound where we've been worried about these issues. (I'm going to plug our recent work...) Our new approach is to do natural language understanding in real-time instead of at the utterance (turn) taking level. That way we can give the user constant feedback in real-time. In the case of a screen that means the user sees right away that they are understood, and if not, a better hint of what went wrong. For example a likely mistake is an ASR mistranscription for a word or two.
We still need to prove this is a better paradigm for VoiceAI in products that people can try for themselves, and are working towards that goal. I hope that voice interfaces that were clunky with turn-taking will finally be more naturally usable with real-time NLU.
https://www.youtube.com/watch?v=5WLYH1qHfq8
[+] [-] sliken|3 years ago|reply
However Google's Assistant in comparison worked great, no memorization, and very useful. Sure time, weather, set timers, and alarms worked great with a very flexible set of natural language queries. Even more complex things like what will be the temperature tomorrow at 10pm, simple calculations and unit conversions. But also things like IMDB like queries about directors, actors, which movies someone was in, etc generally worked well. It seemed to really understand things, not just "A web search returned ...". Even more complex things like the wheelbase of a 2004 WRX would return an answer, not a search result.
With all that said I'm looking for a non-cloud/on site solution, even if it requires more work, most recently noticed https://github.com/rhasspy/rhasspy
[+] [-] Sakos|3 years ago|reply
[+] [-] 4b11b4|3 years ago|reply
Granted, this is for a specific user base and yes, not in coffee shops.
[+] [-] Razengan|3 years ago|reply
There's a great little game series called Megaman Battle Network (Rockman.exe in Japan) which diverges from the mainline by showing an alternate universe where scientists focused on AI instead of robotics, resulting in a world where "Navis" are ubiquitous.
I wonder, what if our early software engineers focused on bringing natural voice control to CLIs, before perfecting GUIs first?
[+] [-] amelius|3 years ago|reply
This is not power. This is just first-world problems.
[+] [-] bogdanstanciu|3 years ago|reply
A 'debug' area that lets you ask a command, see what was interpreted - and immediately edit or click "that's not what I wanted". But not an afterthought and not a cumbersome process like setting up an automation that is triggered by specific commands.
Imagine telling your voice assistant "You're wrong, as usual" and instead of it giving you the boiler plate "I'm sorry ", it actually offered a way to improve itself.
[+] [-] iquerno|3 years ago|reply
At the same time, siri seems to be getting slower and fatter every iteration so perhaps it is becoming more human ;)
[+] [-] sokoloff|3 years ago|reply
“OK, I’ve created an infinite number of alarms, every five minutes, starting at 6 AM tomorrow!”
(As a native English speaker, I'm not sure what specific outcome you want to happen from that request. That's the one that makes the most sense.)
[+] [-] 1MachineElf|3 years ago|reply
[+] [-] PhasmaFelis|3 years ago|reply
[+] [-] freeone3000|3 years ago|reply
[+] [-] gernb|3 years ago|reply
That sounds like a security nightmare. Someone walks by and starts changing your system settings? No thank you
[+] [-] Eleison23|3 years ago|reply
Maybe it's possible to learn a working vocabulary and know how to command a voice assistant. I know my way around several command lines, but I have no idea what to say to Hey Google.
[+] [-] samwillis|3 years ago|reply
On top of that the general distrust of the privacy of these systems has stoped a significant number of people (myself included) from wanting to us them at all. I don't have an in home device, and have turned off Siri on my Apple devices.
[+] [-] aflag|3 years ago|reply
If you tell it you want more coffee, it should know what you like and suggest a mixture of brands you bought before and new ones you may enjoy. If you tell it you're hungry, depending on the time of day it could suggest you some takeaway you've ordered previously or something else you may like. If you say the same some other time it may suggest recipes based on what you have at home or it may suggest nearby restaurants. It should keep track of your friends and otherwise and tell you when their birthdays are coming and it would be nice if it could even suggest some presents based on things you've told the assistant before, or their wishlist on amazon or something else.
There are a lot of things assistants could do, but it needs to know you. The model where everyone has the same assistant doesn't quite work out.
[+] [-] gw98|3 years ago|reply
"Hey Siri, add more toilet paper to the shopping list" (while pooping)
"Hey Siri, shuffle my music" (while driving)
"Hey Siri, countdown 10 minutes" (while shoving a pizza in the oven)
Anything else is a shit show. Anything where trust or accuracy is involved i.e. mutating data, spending money, absolutely no way can I trust it at all and never will.
[+] [-] ageitgey|3 years ago|reply
The other day, we had to remember to book a school thing for the kid. I said "Hey Google, set a reminder for 9pm to [book the thing]".
Google replied "Here are web search results for set a reminder...."
When they fail constantly at the most basic tasks, usage is going to drop way off.
[+] [-] thrwawy74|3 years ago|reply
"Hey Google, turn off the den light switch in 30 minutes"
"Sorry, for safety reasons we cannot...."
It's a light. Because it heard "switch" it thinks there might be some power tool connected to it and won't let me set delayed actions. I want to be intelligent with it like "Hey Google, turn turn the lights on when the sun comes up everday" but no one has gone to that next step.
Or how about "Hey Google, turn off the tone played when you say Hey Google". These settings aren't accessible from the voice interface itself.
Can't wait for Alexa to fail so my SmartTV will stop nagging me to use integrations I will never use. Anti-competitive but whatevs.
[+] [-] ggm|3 years ago|reply
And, they don't really explain the syntax constraints. Which are massive.
Try ab initio without knowing how to do it, to get OK google to open an arbitrary google authored app and direct it to do something. Compared to learning how to use the OS UI keyboard shortcuts or applescript. (Which btw like Windows is basically fully documented because all the libraries are self documenting for their call structures)
The voice interfaces are universally badly designed because spoken command sentences are not well understood as a modality of command, distinct from mouse, gesture, touch or keyboard.
Until voice is baked in with a documented syntax in "man" format, i won't believe its first class.
How do I even know for any arbitrary app what voice directives it uses? How do they correlate to any other command input? How consistent is this with other commands in other apps? Does "stop now" always mean the same thing between a mapping routing app, and a tape backup app? Isn't "stop" contextually defined in a way ^C isn't?
[+] [-] andrewstuart|3 years ago|reply
Truly opened to developers they could have been really interesting and fun.
This is what happens when big companies develop technologies and think they are too valuable to share.
[+] [-] DevX101|3 years ago|reply
I can’t do this loop with modern voice assistants.
A voice assistant with contextual conversation skills, and access to an “always on” visual monitor (home projector or AR glasses) would definitely increase utility by 10x or more
[+] [-] OtomotO|3 years ago|reply
Setting a timer.
The oven is in another area of the house, so when I come back to squeeze in some work, I often just said "Voice Assistant, set a timer for 10 minutes!"
And that's about it.
Apart from that I worked on some chat-bots in the past and it's the very same thing to me, just even a bit worse because of audio.
Natural language processing simply isn't there AND there is just a few very niche use cases in my eyes.
So if they go away again, I won't cry or rather my tears will dry fast.
[+] [-] ericol|3 years ago|reply
Somebody develops something, gets a lot of buzz, doesn't deliver, buzz dies slowly, and then several years in the future somebody else actually builds the tech stack needed for it, and it takes off again.
Voice assistance, chat robots [1]... the metaverse.
[1] Several years back I attended a chat in my city where some people from IBM showed us how to implement a chatbot I think with Watson.
Beyond the trivial examples, it was nearly impossible to implement anything (Or the people giving the chat didn't know how to implement those) but the documentation was nowhere to be found beyond those trivial examples.
[+] [-] mark_l_watson|3 years ago|reply
Pardon my getting a bit off topic, but it is interesting to see the “belt tightening” by FANGs: it seems like everyone is cutting excess staff and looking hard at which products make money and which are money losers. This may seem like a good thing except for newer product categories like AR/VR that will need a lot of experimentation to get right.
[+] [-] antirez|3 years ago|reply
[+] [-] rndmio|3 years ago|reply
[+] [-] 2sk21|3 years ago|reply
[+] [-] pmontra|3 years ago|reply
1. Hey Alexa/Google/Whatever, buy me a saddle, Selle Italia SMP Extra, white. I expect to get a proposal for the best price + shipment and that's it. Easy.
2. Buy me a replacement head for my Parktool pump. Oops, it's a PFP-3 pump. Ah, the shop selling the saddle doesn't have one... I eventually discovered that that shop has an identical part for a pump of a different brand, probably the same Chinese manufacturer. I think this would be hard for an assistant, borderline to impossible.
3. Buy me a magnetic mosquito screen of at least X x Y cm, not adhesive or velcro. Oops, I need a frame to mount it on? I spare you with the discoveries I made along the process. Either the voice assistant is equivalent to a professional installer and can see my door or I'll always do a search with a browser, and watch many videos too.
[+] [-] anon20221123|3 years ago|reply
I use them to get travel time estimates, reference facts, music and podcasts, rewind / forward / pause / resume / next / previous, timers, lights, and occasionally broadcast messages on home speaker devices.
I use voice UI while cooking, running, mowing, raking, driving, changing diapers, putting on my kids' clothes, and while sitting with my wife.
My two young children (5 and almost 2) love voice UI. We ask about the animal of the day, what does this animal sound like, play that song. My older child is beginning to set timers with it. My wife recently said having a nearby voice assistant is important in a home configuration discussion.
My family and I sometimes have frustrating experiences with voice UI, like failed hotwords, shifting syntax, answering on an unexpected device, and slow responses. But we still use it frequently, and our overall sentiment with voice UI is positive.
[+] [-] backtoyoujim|3 years ago|reply
can you imagine hiring a production assistant that gave everything you asked them to do to a corporate competitor ?
Why would I want to do that with my life and Amazon and Apple ?
[+] [-] CrypticShift|3 years ago|reply
But because of the 2010s meteoric rise of FAANG (in the public imagination and stock market) they think they can quickly push into mainstream all these immature paradigms like IVA, AR and now VR.
All these technologies exist since decades, but they are not ready! billions of dollars of investment and marketing are not enough to make them so.
As small a Leap as a tactile portable device took us almost 20 years to reach a conclusive mainstream form (iPhone)
We need to accept that IVA/AR/VR are exponentially larger leaps and should remain side-shows for a very long time to come.
For example, Microsoft is finally acknowledging this, with HoloLens being now just "to help you solve real business problems".
[+] [-] licebmi__at__|3 years ago|reply