top | item 43634005

Show HN: Aqua Voice 2 – Fast Voice Input for Mac and Windows

140 points| the_king | 10 months ago |withaqua.com

Hey HN - It’s Finn and Jack from Aqua Voice (https://withaqua.com). Aqua is fast AI dictation for your desktop and our attempt to make voice a first-class input method.

Video: https://withaqua.com/watch

Try it here: https://withaqua.com/sandbox

Finn is uber dyslexic and has been using dictation software since sixth grade. For over a decade, he’s been chasing a dream that never quite worked — using your voice instead of a keyboard.

Our last post (https://news.ycombinator.com/item?id=39828686) about this seemed to resonate with the community - though it turned out that version of Aqua was a better demo than product. But it gave us (and others) a lot of good ideas about what should come next.

Since then, we’ve remade Aqua from scratch for speed and usability. It now lives on your desktop, and it lets you talk into any text field -- Cursor, Gmail, Slack, even your terminal.

It starts up in under 50ms, inserts text in about a second (sometimes as fast as 450ms), and has state-of-the-art accuracy. It does a lot more, but that’s the core. We’d love your feedback — and if you’ve got ideas for what voice should do next, let’s hear them!

83 comments

order

idk1|10 months ago

I’ve been using this for some time and I have to say it is fantastic. I’m intentionally not writing this with Aqua but by hand and it is taking so much longer. This to me feels like what Apple Intelligence could be, it is so much better than stuff all of the big tech is doing. For example, if you tell Siri voice dictation to go back and delete something what Siri will do is just write out “go back and delete something“ also if you tell Siri to go back and spell a name differently all Siri will do is write out the letters that you said to go back and type out. Honestly, for voice dictation software it feels like travelling to another planet in terms of improvement.

niel|10 months ago

Real-time text output à la Apple Dictation with the accuracy of Whisper is something I've been looking for recently - I'll definitely give Aqua a spin.

MacWhisper [0] (the app I settled on) is conspicuously missing from your benchmarks [1]. How does it compare?

[0]: https://goodsnooze.gumroad.com/l/macwhisper

[1]: https://withaqua.com/blog/benchmark-nov-2024

the_king|10 months ago

We're more accurate and much faster than Mac Whisper, even their strongest model (Whisper Cpp Large V3).

For that benchmarking table, you can use Whisper Large V3 as a stand-in for Mac Whisper and Super Whisper accuracy.

aylmao|10 months ago

This is super impressive, great job!

Side-comment of something this made me think of (again): tech builds too much for tech. I've lived in the Bay before, so I know why this happens. When you're there, everyone around you is in tech, your girlfriend is in tech, you go to parties and everyone invariably ends up talking about work, which is tech. Your frustrations are with tech tools and so are your peers', so you're constantly thinking about tech solutions applicable to tech's problems.

This seems very much marketed to SF people doing SF things ("Cursor, Gmail, Slack, even your terminal"). I wonder how much effort has gone into making this work with code editors or the terminal, even though I doubt this would a big use-case for this software if it ever became generally popular. I'd imagine the market here is much larger in education, journalism, film, accessibility, even government. Those are much more exciting demos.

the_king|10 months ago

thanks!

I share the same sentiment. I remember thinking in college how annoying it was that I was reading low-resolution, marked-up, skewed, b&w scans of a book using Adobe Acrobat while CS concentrators were doing everything in VS Code (then brand new).

but we do think voice is actually great with Cursor. It’s also really useful in the terminal for certain things. Checking out or creating branches, for example.

fxtentacle|10 months ago

This looks like it'll slurp up all your data and upload it into a cloud. Thanks, no. I want privacy, offline mode and source code for something as crucial to system security as an input method.

"we also collect and process your voice inputs [..] We leverage this data for improvements and development [..] Sharing of your information [..] service providers [..] OpenAI" https://withaqua.com/privacy

FloatArtifact|10 months ago

Local inference only is an absolute requirement. It's not even really all that accessible if it's online only. I can say this as someone that's used over 20000 hours worth of voice dictation and computer control.

canada_dry|10 months ago

First thing I looked for and read: the FAQ.

No mention of privacy (or on prem) - so assumed it's 100% cloud.

Non-starter for me. Accuracy is important, but privacy is more so.

Hopefully a service with these capabilities will be available where the first step has the user complete a brief training session, sends that to the cloud to tailor the recognition parameters for their voice and mannerisms... then loads that locally.

pokstad|10 months ago

This should be on the FAQ. I was trying to find out if it was 100% processed locally.

jmcintire1|10 months ago

fair point. offline+local would be ideal, but as it stands we can't run asr and an llm locally at the speed that is required to provide the level of service we want to.

given that we need the cloud, we offer zero data retention -- you can see this in the app. your concern is as much about ux and communications as it is privacy

toddmorey|10 months ago

And man it's another monthly subscription. I'm not mad at them for finding a gap in the market and putting a business around it. I'm mad at Apple for leaving that gap... hopefully built in voice dictation improves quickly.

jackthetab|10 months ago

Agreed.

This is where I bounce (out of this discussion).

thmsmlr|10 months ago

I totally agree, I created BetterDictation (.com) exactly because of that. Offline was a super important requirement for me.

jrvarela56|10 months ago

Feedback: I use MacWhisper and Tiny wisperkit model (english only) is way faster than any cloud service on my M1 macbook pro.

I’d say local is necessary for delightful product experience and the added bonus is that it ticks the privacy box

brianjking|10 months ago

How much ram is in your m1?

alxlu|10 months ago

I’ve been using this for a while now and I really enjoy it. I ran into a semi-obscure bug and emailed them and they basically fixed it the same day.

I do wish there was a mobile app though (or maybe an iOS keyboard). It would also be nice to be able to have a separate hotkey you can set up to send the output to a specific app (instead of just the active one).

the_king|10 months ago

thanks! We're working on iOS, but it's tough to get the ergos right given all of Apple's restrictions and neglected APIs.

rkagerer|10 months ago

You mentioned it "lives on your desktop". How does licensing work, and can you install and use it on a machine without internet access?

rickydroll|10 months ago

I've been using Aqua since it was announced on HNN. I've survived the teething pains by using a mixture of Aqua and Dragon, depending on what I was doing. With this new Windows app, I've given up using Dragon for anything.

Things I've learned are:

1. It works better if you're connected by Ethernet than by Wi-Fi.

2. It needs to have a longer recognition history because sometimes you hit the wrong key to end a recognition session, and it loses everything.

3. Besides the longer history, a debugging mode that records all the characters sent to the dictation box would be useful. Sometimes, I see one set of words, blink, and then it's replaced with a new recognition result. Capturing would be useful in describing what went wrong.

4. There should be a way to tell us when a new version is running. Occasionally, I've run into problems where I'm getting errors, and I can't tell if it's my speaking, my audio chain, my computer, the network, or the app.

5. Grammarly is a great add-on because it helps me correct mis-speakings and odd little errors, like too many spaces caused by starting and stopping recognition.

When Dragon Systems went through bankruptcy court, a public benefits corporation bid for the core technology because it recognized that Dragon was a critical tool for people with disabilities to function in a digital world.

In my opinion, Aqua has reached a similar status as an essential tool. Well, it doesn't fully replace Dragon for those who need command and control (yet). The recognition accuracy and smoothness are so amazing that I can't envision returning to Dragon Systems without much pain. The only thing worse would be going back to a keyboard.

Aqua Guys, don't fuck it up.

replete|10 months ago

Product/UI looks good. Nice job. I would pay for a completely offline version of this, cloud voice data is non-starter for me though unfortunately

voltaireodactyl|10 months ago

Check out MacWhisper which is one time payment and does this among many other things.

willwade|10 months ago

You’re real market you need to go hard on is the assistive tech market. You know the biggest companies in this space are those solving problems for dyslexia where govt grants in eg UK fund pretty much all their work? I had an access to work assessment and they recommend like sweets stuff from texthelp. It’s then paid for by the government following these assessments. But it’s crap. It literally is a crap tool for adhd or dyslexia because these users literally CANT remember or deal with barriers like learning how to dictate correctly. Aqua voice solves this. I’m your biggest fan. I recommend it in my AT assessments all the time :)

waveringana|10 months ago

yes I really hope a lot of these ML startups check out the history of ML tech a bit more because so many accessibility tools are built via ML but theyve been abandoned

adamesque|10 months ago

I was very delighted by Aqua v1, which felt like magic at first.

But I’ve noticed/learned that I can’t dictate written content. My brain just does not work that way at all — as I write I am constantly pausing to think, to revise, etc and it feels like a completely different part of my brain is engaged. Everything I dictated with Aqua I had to throw away and rewrite.

Has anyone had similar problems, and if so, had any success retraining themselves toward dictation? There are fleeting moments where it truly feels like it would be much faster.

SCdF|10 months ago

I use my (work) computer entirely with my voice, and it takes a lot of effort to work out what to actually write and to not ramble. Like you I've found that it's better to throw out words in sort of half sentence chunks, to give your brain time to work out what the next chunk is.

It's very hard, and I wouldn't do it if I didn't have to.

(which is why I'm always perplexed by these apps which allow voice dictation or voice control, but not as a complete accessibility package. I wouldn't be using my voice if my hands worked!)

It's also critically important (and after 3-4 years of this I still regularly fail at this) to actually read what you've written, and edit it before send, because those chunks don't always line up into something that I'd consider acceptably coherent. Even for a one sentence slack message.

(also, I have a kiwi accent, and the dictation software I use is not always perfect at getting what I wanted to say on the page)

noahjk|10 months ago

Same here. My two biggest hurdles are:

1. like you mentioned, the second I start talking about something, I totally forget where I'm going, have to pause, it's like my thoughts aren't coming to me. Probably some sort of mental feedback loop plus, like you mentioned, different method of thinking.

2. in the back of my mind, I'm always self-conscious that someone is listening, so it's a privacy / being judged / being overheard feeling which adds a layer of mental feedback.

There's also not great audio clues for handling on-the-fly editing. I've tried to say "parentheses word parentheses" and it just gets written out. I've tried to say "strike that" and it gets written out. These interfaces are very 'happy path' and don't do a lot of processing (on iOS, I can say "period" and get a '.' (or ?,!) but that's about the extent).

I have had some success with long-form recording sessions which are transcribed afterwards. After getting over the short initial hump, I can brain-dump to the recording, and then trust an app like Voice Notes or Superwhisper to transcribe, and then clean up after.

The main issue I run into there, though, is that I either forget to record something (ex. a conversation that I want to review later) or there is too much friction / I don't record often enough to launch it quickly or even remember to use that workflow.

I get the same feeling with smart home stuff - it was awesome for a while to turn lights on and off with voice, but lately there's the added overhead of "did it hear me? do I need to repeat myself? What's the least amount of words I can say? Why can't I just think something into existence instead? Or have a perfect contextual interface on a physical device?"

the_king|10 months ago

I think Aqua v1 had two problems:

1. The models weren't ready.

2. The interactions were often strained. Not every edit/change is easy to articulate with your voice.

If 1 had been our only problem, we might have had a hit. In reality, I think optimizing model errors allowed us to ignore some fundamental awkwardness in the experience. We've tried to rectify this with v2 by putting less emphasis on streaming for every interaction and less emphasis on commands, replacing it with context.

Hopefully it can become a tool in the toolbox.

jmcintire1|10 months ago

Imo it is a question of right tool for the right job, adjusted for differences between people. For me, the use case that made our product click was prompting Cursor while coding. Then I wanted to use it whenever I talked to chatgpt -- it's much faster to talk and then read, and repeat.

Voice is great for whenever the limiting factor to thought is speed of typing.

cloogshicer|10 months ago

I'm exactly the same. Aqua is so incredible and I really tried to like it, but I just can't get my brain to think of what I want to say first, I have to pause to think constantly.

SCdF|10 months ago

I currently use Talon, which I note is not in your benchmarks.

I can't find any documentation on how Aqua works, or how it compares, so I'm not sure it's meant to be a replacement / competitor to Talon? What are you configuring? How are you telling it that you like "genz" style in Slack? Can I create custom configurations / macros?

One thing I like about Talon is it's not magic. Which maybe is not what you're going for. But I am giving it explicit commands that I know it will understand (if it understands my accent obvs), as opposed to guessing and constructing a human language vague sentence and hope that an llm will work it out. Which means it feels like something I can actually become fast with, and build up muscle memory of.

Also that it's completely offline, so I can actually run it on a work computer without my security folks freaking out.

the_king|10 months ago

We're building something different, but there is some overlap. Aqua is built for max speed, while keeping accuracy high. To achieve that, inference runs in a datacenter (for now).

You can customize Aqua using custom instructions, similar to ChatGPT custom instructions, and get some Talon functionality from it:

In my own, I have:

1. Breaking the paragraphs with three or four sentences.

2. Don't start a sentence with "and".

3. Use lowercase in Slack and iMessage.

4. Here are some common terminal commands...

willwade|10 months ago

Aqua voice is nothing like talon. I wouldn’t bother trying to compare. It’s a dictation tool. Just entry. Not commands. But it’s bloody impressive. You don’t need to learn anything - you just talk like you would talk to someone across the way from you

TylerE|10 months ago

I will have to look into this. I am currently in the process of going on disability as I cannot work due to (amongst other things) carpal and cubical tunnel in both arms.

oulipo|10 months ago

Interesting!

A nice open-source alternative is VoiceInk, check it out: https://github.com/Beingpax/VoiceInk

do you also plan to open-source part of your platform?

razemio|10 months ago

I just tried it on a M4 Max MacBook Pro. When you have such a processor, it seems to be even faster than Aqua Voice 2, does more, optional supports open router AND is open source? Thank you so much for the recommendation!

pablopeniche|10 months ago

Just tried it and it crashed

roland_kovacs|10 months ago

Hey guys, great idea! Most of the apps already have voice recognition. Do you think about serving a niche where this feature is not existing? Also, the data protection part is unclear to me, I don't want everything uploaded to a cloud where I don't know what is happening with it.

somberi|10 months ago

I use MacWhisper and it works well enough for me to stop looking for options.(MBA M2 24GB Ram - Large V3 English)

I wouldn't feel comfortable if someone were looking over my shoulder while I'm typing at a coffee shop.

I am not your customer.

bemmu|10 months ago

I would like to try this, but I use Synergy to use two computers with the same keyboard. I have Aqua Voice now on the server computer, would be great if I could input text to the client computer using it as well.

qntmfred|10 months ago

I use the built-in voice typing in Windows and am pretty happy with it. How would you say this compares (presuming most of your comparisons are mac-centric)

the_king|10 months ago

Aqua is in another league when it comes to accuracy. I just ran them side by side on a simple q to ChatGPT and here were the results...

Aqua Voice

  What is the first recorded eclipse in human history? I'm not asking when the first one occurred, but the first written record we have of an eclipse.
Windows Voice Typing (v11 24H2, Dell XPS 13 9340)

  What is the first recorded eclipse in human history i'm not asking 1 like the first occurred but the first ridden record we have of an eclipse

Windows mistakes were:

-"1" should be "when"

-"ridden" should be "written"

-No punctuation

aminsadeghi|10 months ago

Is there going to be Linux support at some point?

the_king|10 months ago

We can do that, with some help from the community.

tomblomfield|10 months ago

I recently started using Aqua and it's great. The team really improved the latency in the last few weeks.

hu3|10 months ago

How does it compare to https://wisprflow.ai ?

btw, grats!

the_king|10 months ago

Thanks!

We're faster, more accurate, and have a streaming option. Aqua can go from key-up to paste in as little as 450ms. Flow was closer to 1000 in our tests.

Overall, you'll notice we make a few more tweaks to the output than Wisprflow.

For example, Aqua + Cursor is very powerful - we syntax highlight your transcript. The easiest way to see this is to use streaming mode (double press Fn) + deep context + cursor and try asking it to change something.

This also works in other "context rich" environments.

bklyn11201|10 months ago

Music playing on Youtube in Chrome, Airpods in, the desktop and the sandbox/demo just don't work.

the_king|10 months ago

Shoot, might be due to AirPods mic init latency. AirPods work well on Desktop (though their mic quality isn't the best).

iAMkenough|10 months ago

Tried viewing the Pricing link in the footer, but it requires a Google account to view.

waveringana|10 months ago

will we ever see local, open source models? they are very important for accessibility reasons which this product can fit into, but wont because of it being cloud based (and proprietary).

hasperdi|10 months ago

Anyone can recommend a good dictation app on Linux?