New acoustic attack steals data from keystrokes with 95% accuracy

[+] lispisok|2 years ago|reply

So they generated training data from one laptop and microphone then generated test data with the exact same laptop and microphone in the same setup, possibly one person pressing the keys too. For the Zoom model they trained a new model with data gathered from Zoom. They call it a practical side channel attack but they didnt do anything to see if this approach could generalize at all

[+] OtherShrezzing|2 years ago|reply

I believe that is the generalisable version of the attack. You're not looking to learn the sound of arbitrary keyboards with this attack, rather you're looking to learn the sound of specific targets.

For example, a Twitch streamer enters responses into their stream-chat with a live mic. Later, the streamer enters their Twitch password. Someone employing this technique could reasonably be able to learn the audio from the first scenario, and apply the findings in the second scenario.

[+] jprete|2 years ago|reply

I think this linited attack surface can work without having to generalize one model to multiple people or keyboards. One advantage of a Zoom attack is that you get “plaintext” shortly after hearing the “ciphertext” if you can get the target to type into the chat window. And when you hear typing in other contexts it’s likely to be something that matches a handful of grammars that an LLM can recognize already (written languages, programming languages, commands, calculation inputs) - and when it doesn’t, that’s probably a password.

[+] tehwebguy|2 years ago|reply

Do keystrokes still come through Zoom? The noise filtering has become extremely aggressive lately, often hear people say “Sorry about that engine / ambulance / city noise” but nobody knows what they’re talking about.

[+] Geee|2 years ago|reply

It's for a targeted attack. It doesn't need to be generalized.

[+] failuser|2 years ago|reply

How come keyboard sound suppression is not a standard option in all online communication apps? It’s not that hard, keyboard sounds are pretty distinct.

[+] surge|2 years ago|reply

Yeah and in fact, I've heard of this attack being done in the past, but it heavily depends on the typist, the keyboard, etc. Cadence, sound, etc changes with the typist and hardware. This isn't new, and has very few, if any practical applications for wide spread replication.

[+] omgJustTest|2 years ago|reply

The answer is that likely all the above are used.

Asking for “what signal it is detecting” might be better asked from a “what is the greatest signal bearing information” being used… which would help in averting attacks.

This kind of stuff could be real menacing in all sorts of public places like airports, coffee shops and etc.

[+] moonchrome|2 years ago|reply

Seems simple to defend - use a password manager.

[+] snet0|2 years ago|reply

I did a similar acoustic side-channel attack as final year project at uni. There's a treasure trove of findings in this area, I'm just waiting for someone to combine methodologies. There are pretty good results using geometric models, trained and untrained statistical models like this and others, and combining these features with assorted language models.

Here's a few random papers I read along the way:

https://doi.org/10.1007/s10207-019-00449-8 - SonarSnoop, which uses a phone's speaker to produce ultrasonic audio that can be used to profile the user's interaction (e.g. entering swipe-based passcodes).

https://people.eecs.berkeley.edu/~daw/papers/ssh-use01.pdf - "Timing Analysis of Keystrokes and Timing Attacks on SSH", a paper from 2001 that uses statistical models of keystroke timings to retrieve passwords from encrypted SSH traffic.

https://doi.org/10.1145/1609956.1609959 - "Keyboard acoustic emanations revisited", which uses hidden Markov models and some other English language features to recover text based on classification via cepstrum features.

https://doi.org/10.1145/2660267.2660296 - "Context-free Attacks Using Keyboard Acoustic Emanations" which uses a geometric approach, using time-difference-of-arrival to estimate physical locations probabilistically.

[+] iainctduncan|2 years ago|reply

I'm not clear why people are poo-pooing this as if it's not a big deal. From a security and espionage point of view this is pretty significant - the audio learning has got to the point that a sensitive audio bug can bascially be key logger. There are a ton of context where an audio tap would be much easier to get in place than a traditional network attack (and with modern shotgun mics, might not even require being in the building). That is applicable to much more than just password stealing.

I've always been a bit fascinated by this attack vector and wondered if would get to this point.

[+] drvdevd|2 years ago|reply

Yes it seems like any possible physical side channel (eg Tempest as well) is now amenable to machine learning approaches. Very interesting indeed.

[+] nine_k|2 years ago|reply

I wonder if playing the typing sound constantly could help. Not an abstract sound, but recording of your actual typing on this particular keyboard, mixed to play some realistic-sounding phrases / sequences. It should pause for a split second to let your actual keystrokes mix in. That would be really hard to decipher, or to correlate your typing with whatever other events (time to enter a password).

Better yet, play some white noise around you. I heard that it's actually done sometimes at really important meetings.

If you're not such a VIP, just type important things only on your phone; touch screens don't produce enough sound, hopefully.

[+] hackernewds|2 years ago|reply

you would need to tie microphone input with the actual keys typed, and enough of it to train a model. nothingburger

[+] crazygringo|2 years ago|reply

Fascinating. I'm really curious what the acoustic properties are that it's recognizing.

Is it more of a physical fingerprint of each key, such that if you swapped keys/springs the model would need to be updated? So it's produced by manufacturing inconsistencies, the way individual typewriters used to be forensically identified?

Or is more each key being identical, but producing a different resonance pattern within the keyboard/laptop due to the shape of all of the matter surrounding it? If you move the keyboard in the room, do you have to re-train the model?

I also wonder how much it varies depending on how hard you press each key -- not at all or a great deal? And what about by keyboard -- when you compare thin MacBook keys with an external full-height keyboard, is one easier/harder to recognize each key on than the other?

[+] xpe|2 years ago|reply

Building on what you said: (1) just the key's properties; (2) key properties relative to other keys; (2) sound transmission and environment between key and microphone; (3) relationship between key and finger; (4) relationship between key and associated dendritis

[+] rocqua|2 years ago|reply

I presume typing style matters aswell. How quickly you reach each key, rythm, how hard you tend to hit a specific key.

My sense is that they profile the person more than the keyboard.

[+] crazygringo|2 years ago|reply

By the way, some (most?) videoconferencing software removes keyboard sounds from the audio, because it's particularly a distracting problem with laptops where the microphone is right next to the keys.

I'm pretty sure Zoom does this by default as part of its noise cancellation (it's potentially even easier since you can use keydown events to help identify, not just the audio stream).

So as long as basic default noise cancellation is on, that would at least prevent this over regular videoconferencing. And because of this, I'm having a hard time thinking of when else this would be a realistic threat, where the attacker wouldn't already have enough physical access to either install a regular keylogger or else a hidden camera.

[+] saiya-jin|2 years ago|reply

Teams definitely don't have this, at least not by default, or not by default in our corp. Anytime somebody on the call starts typing you hear it very clearly.

[+] 1123581321|2 years ago|reply

Meetings between organizations, multi-office cafeterias, or coffee shops, perhaps.

[+] bo1024|2 years ago|reply

If any random webpage is granted access to the microphone, I would think this could be a problem.

[+] ariym|2 years ago|reply

Georgi Gerganov created one a few years ago

https://github.com/ggerganov/kbd-audio

[+] mxwsn|2 years ago|reply

The example figure shows a key hit every half second, which suggests a pecking style of typing at around 24 wpm. This way the model gets very clean waveforms. I wonder how their approach would work with average or fast typists. The sound profiles might be much harder to link to characters.

[+] fbdab103|2 years ago|reply

Even if there was ambiguity, some data is better than none. Given enough training data, I suspect you could find repeatable patterns in standard typists: on a qwerty layout, after typing an "A", "Q" takes 1.2-2.3x as long to type as a "J" kind of pairwise tempo patterns. Anything to reduce the search space from brute-forcing every candidate character.

Even better if the target uses a passphrase, "hXXXse battXXX stXXXXX cXXXXXX" becomes interpretable given a few landmark letter identified with high probability.

[+] zaxomi|2 years ago|reply

Sovjet listened successfully to typewrites back in the 1970s.

[+] elderlybanana|2 years ago|reply

In response to this post, I just open sourced a starter project to a variation of this idea: https://github.com/secretlessai/audio-mnist. I've been interested in doing image classification techniques like CNN on audio data for a while.

A couple years ago for a weekend project I made a simple "audio-mnist" dataset from handwritten digit audio recordings. I never got past a few days worth of work, but open-sourcing it has been on my mind for a minute. This post kicked me into action. Getting some more data, basic CNN examples, etc. could provide a nice starting point for a lot of research and tools.

There is still separate code I'd have to find and make intelligible to create the recordings and split the audio.

Anyway, in case anyone finds part of this process interesting or useful.

[+] tehsauce|2 years ago|reply

Would love a wireless keyboard that works using this! It wouldn’t need any battery, charging or syncing!

[+] swid|2 years ago|reply

Some old TV remotes used to work this way. They were made by Zenith and are called Space Command remotes. Apparently they are the reason TV remotes are sometimes called clickers.

https://www.theverge.com/23810061/zenith-space-command-remot...

[+] metaphor|2 years ago|reply

Imagine the UX of 1 in 20 characters typed being incorrectly inferred though. The P_failure*Cost impact would strike me as insufferable even if error rate were to improve by an order of magnitude.

[+] WXLCKNO|2 years ago|reply

Time to inject background audio of me typing "fuck you" into my zoom calls.

[+] hoosieree|2 years ago|reply

Text-to-keystroke-audio where the text comes from the LLM Prompt "fanfiction based on HGTV's Love It or List It starring an Ewok realtor and Klingon interior designer in iambic pentameter".

The goal is to cause the eavesdropper to totally reevaluate their life choices, and maybe even get caught up in the story.

[+] zgluck|2 years ago|reply

Tactical noise!

[+] majikandy|2 years ago|reply

That might make it even easier to decipher. A nice reference point.

[+] hoosieree|2 years ago|reply

Using an image classifier on spectrograms is pretty funny. Not a bad idea, given image classifiers are dime a dozen, but still.

[+] iainctduncan|2 years ago|reply

It's actually quite common. One of the big bird recognition apps does just this.

[+] thedookmaster|2 years ago|reply

I don't use the qwerty layout, I use colemak. Likely this mitigates this for myself.

[+] MaximilianEmel|2 years ago|reply

Now they can make wireless keyboards that don't need a battery or radio!

[+] antegamisou|2 years ago|reply

This is hardly a new concept btw.

In 2005 ACM's CCS Zhuang, Zhou and Tygar presented Keyboard Acoustic Emanations Revisited [1]

    We examine the problem of keyboard acoustic emanations. We
    present a novel attack taking as input a 10-minute sound recording
    of a user typing English text using a keyboard, and then recovering 
    up to 96% of typed characters. There is no need for a labeled
    training recording. Moreover the recognizer bootstrapped this way
    can even recognize random text such as passwords: In our experiments, 
    90% of 5-character random passwords using only letters can
    be generated in fewer than 20 attempts by an adversary; 80% of 10-
    character passwords can be generated in fewer than 75 attempts.
    Our attack uses the statistical constraints of the underlying content, 
    English language, to reconstruct text from sound recordings
    without any labeled training data. The attack uses a combination
    of standard machine learning and speech recognition techniques,
    including cepstrum features, Hidden Markov Models, linear classification, 
    and feedback-based incremental learning

which builds up on Asonov & Agrawal's work [2] who came up with the idea the previous year (2004).

    We show that PC keyboards, notebook keyboards, telephone 
    and ATM pads are vulnerable to attacks based on
    differentiating the sound emanated by different keys. Our
    attack employs a neural network to recognize the key being 
    pressed. We also investigate why different keys produce
    different sounds and provide hints for the design of homophonic 
    keyboards that would be resistant to this type of attack.

[1] https://dl.acm.org/doi/10.1145/1609956.1609959

[2] https://ieeexplore.ieee.org/document/1301311

[+] 6510|2 years ago|reply

maybe...

https://news.mit.edu/2014/algorithm-recovers-speech-from-vib...

[+] unknown|2 years ago|reply

[deleted]

[+] pengaru|2 years ago|reply

So microphones need to get muted automatically by password prompts, seems simple enough in principle.

[+] insickness|2 years ago|reply

Zoom is good at filtering out rather loud background noises. I can't imagine that the sound of background typing during a conversation could be detected by the other party.

[+] frant-hartm|2 years ago|reply

What? Zoom (by default with auto mic adjustment) catches everything. Typing on laptop is especially bad as it is closer to the mic than the person speaking (unless there is external mic), so it's like a stampede of rhinos.

[+] transportgo|2 years ago|reply

I think about this attack when streamers on Twitch logs into websites etc.

[+] nmeagent|2 years ago|reply

I think an attacker would find that many streamers with high quality audio have properly setup their mics with noise gate filters to remove their relatively quiet keystrokes.

[+] jncfhnb|2 years ago|reply

I wonder how hard this problem is. I bet it’s actually not that bad. If I were to guess, A huge part of the problem is likely the position of the microphone.

Note that the testing data in the confusion matrix appears to have a uniformish distribution of each key being pressed. I suspect this data was not generated by someone actually typing because you would rarely see numbers and rare letters. It is possible these were simply pressed one at a time rather than in a series of rapid presses.

My guess is this approach uses the mic to identify where the sound of the key press was coming from rather than what each key press sounds like. Which does not invalidate the results but may make it seem less magical. Tbh it’s probably much worse this way because such a model could probably generalize very well across all keyboards and typing styles.

[+] manbitesdog|2 years ago|reply

This idea could also be used for good at some point. Imagine “connecting” any keyboard to a device just by enabling the microphone.

It would have its own set of problems: not two people using it at once, eavesdropping would be really easy… but it’d have its own set of interesting applications

227 comments