So they generated training data from one laptop and microphone then generated test data with the exact same laptop and microphone in the same setup, possibly one person pressing the keys too. For the Zoom model they trained a new model with data gathered from Zoom. They call it a practical side channel attack but they didnt do anything to see if this approach could generalize at all
I believe that is the generalisable version of the attack. You're not looking to learn the sound of arbitrary keyboards with this attack, rather you're looking to learn the sound of specific targets.
For example, a Twitch streamer enters responses into their stream-chat with a live mic. Later, the streamer enters their Twitch password. Someone employing this technique could reasonably be able to learn the audio from the first scenario, and apply the findings in the second scenario.
I think this linited attack surface can work without having to generalize one model to multiple people or keyboards. One advantage of a Zoom attack is that you get “plaintext” shortly after hearing the “ciphertext” if you can get the target to type into the chat window. And when you hear typing in other contexts it’s likely to be something that matches a handful of grammars that an LLM can recognize already (written languages, programming languages, commands, calculation inputs) - and when it doesn’t, that’s probably a password.
Do keystrokes still come through Zoom? The noise filtering has become extremely aggressive lately, often hear people say “Sorry about that engine / ambulance / city noise” but nobody knows what they’re talking about.
How come keyboard sound suppression is not a standard option in all online communication apps? It’s not that hard, keyboard sounds are pretty distinct.
Yeah and in fact, I've heard of this attack being done in the past, but it heavily depends on the typist, the keyboard, etc. Cadence, sound, etc changes with the typist and hardware. This isn't new, and has very few, if any practical applications for wide spread replication.
Asking for “what signal it is detecting” might be better asked from a “what is the greatest signal bearing information” being used… which would help in averting attacks.
This kind of stuff could be real menacing in all sorts of public places like airports, coffee shops and etc.
I did a similar acoustic side-channel attack as final year project at uni. There's a treasure trove of findings in this area, I'm just waiting for someone to combine methodologies. There are pretty good results using geometric models, trained and untrained statistical models like this and others, and combining these features with assorted language models.
Here's a few random papers I read along the way:
https://doi.org/10.1007/s10207-019-00449-8 - SonarSnoop, which uses a phone's speaker to produce ultrasonic audio that can be used to profile the user's interaction (e.g. entering swipe-based passcodes).
https://people.eecs.berkeley.edu/~daw/papers/ssh-use01.pdf - "Timing Analysis of Keystrokes and Timing Attacks on SSH", a paper from 2001 that uses statistical models of keystroke timings to retrieve passwords from encrypted SSH traffic.
https://doi.org/10.1145/1609956.1609959 - "Keyboard acoustic emanations revisited", which uses hidden Markov models and some other English language features to recover text based on classification via cepstrum features.
https://doi.org/10.1145/2660267.2660296 - "Context-free Attacks Using Keyboard Acoustic Emanations" which uses a geometric approach, using time-difference-of-arrival to estimate physical locations probabilistically.
I'm not clear why people are poo-pooing this as if it's not a big deal. From a security and espionage point of view this is pretty significant - the audio learning has got to the point that a sensitive audio bug can bascially be key logger. There are a ton of context where an audio tap would be much easier to get in place than a traditional network attack (and with modern shotgun mics, might not even require being in the building). That is applicable to much more than just password stealing.
I've always been a bit fascinated by this attack vector and wondered if would get to this point.
I wonder if playing the typing sound constantly could help. Not an abstract sound, but recording of your actual typing on this particular keyboard, mixed to play some realistic-sounding phrases / sequences. It should pause for a split second to let your actual keystrokes mix in. That would be really hard to decipher, or to correlate your typing with whatever other events (time to enter a password).
Better yet, play some white noise around you. I heard that it's actually done sometimes at really important meetings.
If you're not such a VIP, just type important things only on your phone; touch screens don't produce enough sound, hopefully.
Fascinating. I'm really curious what the acoustic properties are that it's recognizing.
Is it more of a physical fingerprint of each key, such that if you swapped keys/springs the model would need to be updated? So it's produced by manufacturing inconsistencies, the way individual typewriters used to be forensically identified?
Or is more each key being identical, but producing a different resonance pattern within the keyboard/laptop due to the shape of all of the matter surrounding it? If you move the keyboard in the room, do you have to re-train the model?
I also wonder how much it varies depending on how hard you press each key -- not at all or a great deal? And what about by keyboard -- when you compare thin MacBook keys with an external full-height keyboard, is one easier/harder to recognize each key on than the other?
Building on what you said: (1) just the key's properties; (2) key properties relative to other keys; (2) sound transmission and environment between key and microphone; (3) relationship between key and finger; (4) relationship between key and associated dendritis
By the way, some (most?) videoconferencing software removes keyboard sounds from the audio, because it's particularly a distracting problem with laptops where the microphone is right next to the keys.
I'm pretty sure Zoom does this by default as part of its noise cancellation (it's potentially even easier since you can use keydown events to help identify, not just the audio stream).
So as long as basic default noise cancellation is on, that would at least prevent this over regular videoconferencing. And because of this, I'm having a hard time thinking of when else this would be a realistic threat, where the attacker wouldn't already have enough physical access to either install a regular keylogger or else a hidden camera.
Teams definitely don't have this, at least not by default, or not by default in our corp. Anytime somebody on the call starts typing you hear it very clearly.
The example figure shows a key hit every half second, which suggests a pecking style of typing at around 24 wpm. This way the model gets very clean waveforms. I wonder how their approach would work with average or fast typists. The sound profiles might be much harder to link to characters.
Even if there was ambiguity, some data is better than none. Given enough training data, I suspect you could find repeatable patterns in standard typists: on a qwerty layout, after typing an "A", "Q" takes 1.2-2.3x as long to type as a "J" kind of pairwise tempo patterns. Anything to reduce the search space from brute-forcing every candidate character.
Even better if the target uses a passphrase, "hXXXse battXXX stXXXXX cXXXXXX" becomes interpretable given a few landmark letter identified with high probability.
In response to this post, I just open sourced a starter project to a variation of this idea: https://github.com/secretlessai/audio-mnist. I've been interested in doing image classification techniques like CNN on audio data for a while.
A couple years ago for a weekend project I made a simple "audio-mnist" dataset from handwritten digit audio recordings. I never got past a few days worth of work, but open-sourcing it has been on my mind for a minute. This post kicked me into action. Getting some more data, basic CNN examples, etc. could provide a nice starting point for a lot of research and tools.
There is still separate code I'd have to find and make intelligible to create the recordings and split the audio.
Anyway, in case anyone finds part of this process interesting or useful.
Some old TV remotes used to work this way. They were made by Zenith and are called Space Command remotes. Apparently they are the reason TV remotes are sometimes called clickers.
Imagine the UX of 1 in 20 characters typed being incorrectly inferred though. The P_failure*Cost impact would strike me as insufferable even if error rate were to improve by an order of magnitude.
Text-to-keystroke-audio where the text comes from the LLM Prompt "fanfiction based on HGTV's Love It or List It starring an Ewok realtor and Klingon interior designer in iambic pentameter".
The goal is to cause the eavesdropper to totally reevaluate their life choices, and maybe even get caught up in the story.
In 2005 ACM's CCS Zhuang, Zhou and Tygar presented Keyboard Acoustic Emanations Revisited [1]
We examine the problem of keyboard acoustic emanations. We
present a novel attack taking as input a 10-minute sound recording
of a user typing English text using a keyboard, and then recovering
up to 96% of typed characters. There is no need for a labeled
training recording. Moreover the recognizer bootstrapped this way
can even recognize random text such as passwords: In our experiments,
90% of 5-character random passwords using only letters can
be generated in fewer than 20 attempts by an adversary; 80% of 10-
character passwords can be generated in fewer than 75 attempts.
Our attack uses the statistical constraints of the underlying content,
English language, to reconstruct text from sound recordings
without any labeled training data. The attack uses a combination
of standard machine learning and speech recognition techniques,
including cepstrum features, Hidden Markov Models, linear classification,
and feedback-based incremental learning
which builds up on Asonov & Agrawal's work [2] who came up with the idea the previous year (2004).
We show that PC keyboards, notebook keyboards, telephone
and ATM pads are vulnerable to attacks based on
differentiating the sound emanated by different keys. Our
attack employs a neural network to recognize the key being
pressed. We also investigate why different keys produce
different sounds and provide hints for the design of homophonic
keyboards that would be resistant to this type of attack.
Zoom is good at filtering out rather loud background noises. I can't imagine that the sound of background typing during a conversation could be detected by the other party.
What? Zoom (by default with auto mic adjustment) catches everything. Typing on laptop is especially bad as it is closer to the mic than the person speaking (unless there is external mic), so it's like a stampede of rhinos.
I think an attacker would find that many streamers with high quality audio have properly setup their mics with noise gate filters to remove their relatively quiet keystrokes.
I wonder how hard this problem is. I bet it’s actually not that bad. If I were to guess, A huge part of the problem is likely the position of the microphone.
Note that the testing data in the confusion matrix appears to have a uniformish distribution of each key being pressed. I suspect this data was not generated by someone actually typing because you would rarely see numbers and rare letters. It is possible these were simply pressed one at a time rather than in a series of rapid presses.
My guess is this approach uses the mic to identify where the sound of the key press was coming from rather than what each key press sounds like. Which does not invalidate the results but may make it seem less magical. Tbh it’s probably much worse this way because such a model could probably generalize very well across all keyboards and typing styles.
This idea could also be used for good at some point. Imagine “connecting” any keyboard to a device just by enabling the microphone.
It would have its own set of problems: not two people using it at once, eavesdropping would be really easy… but it’d have its own set of interesting applications
[+] [-] lispisok|2 years ago|reply
[+] [-] OtherShrezzing|2 years ago|reply
For example, a Twitch streamer enters responses into their stream-chat with a live mic. Later, the streamer enters their Twitch password. Someone employing this technique could reasonably be able to learn the audio from the first scenario, and apply the findings in the second scenario.
[+] [-] jprete|2 years ago|reply
[+] [-] tehwebguy|2 years ago|reply
[+] [-] Geee|2 years ago|reply
[+] [-] failuser|2 years ago|reply
[+] [-] surge|2 years ago|reply
[+] [-] omgJustTest|2 years ago|reply
Asking for “what signal it is detecting” might be better asked from a “what is the greatest signal bearing information” being used… which would help in averting attacks.
This kind of stuff could be real menacing in all sorts of public places like airports, coffee shops and etc.
[+] [-] moonchrome|2 years ago|reply
[+] [-] snet0|2 years ago|reply
Here's a few random papers I read along the way:
https://doi.org/10.1007/s10207-019-00449-8 - SonarSnoop, which uses a phone's speaker to produce ultrasonic audio that can be used to profile the user's interaction (e.g. entering swipe-based passcodes).
https://people.eecs.berkeley.edu/~daw/papers/ssh-use01.pdf - "Timing Analysis of Keystrokes and Timing Attacks on SSH", a paper from 2001 that uses statistical models of keystroke timings to retrieve passwords from encrypted SSH traffic.
https://doi.org/10.1145/1609956.1609959 - "Keyboard acoustic emanations revisited", which uses hidden Markov models and some other English language features to recover text based on classification via cepstrum features.
https://doi.org/10.1145/2660267.2660296 - "Context-free Attacks Using Keyboard Acoustic Emanations" which uses a geometric approach, using time-difference-of-arrival to estimate physical locations probabilistically.
[+] [-] iainctduncan|2 years ago|reply
I've always been a bit fascinated by this attack vector and wondered if would get to this point.
[+] [-] drvdevd|2 years ago|reply
[+] [-] nine_k|2 years ago|reply
Better yet, play some white noise around you. I heard that it's actually done sometimes at really important meetings.
If you're not such a VIP, just type important things only on your phone; touch screens don't produce enough sound, hopefully.
[+] [-] hackernewds|2 years ago|reply
[+] [-] crazygringo|2 years ago|reply
Is it more of a physical fingerprint of each key, such that if you swapped keys/springs the model would need to be updated? So it's produced by manufacturing inconsistencies, the way individual typewriters used to be forensically identified?
Or is more each key being identical, but producing a different resonance pattern within the keyboard/laptop due to the shape of all of the matter surrounding it? If you move the keyboard in the room, do you have to re-train the model?
I also wonder how much it varies depending on how hard you press each key -- not at all or a great deal? And what about by keyboard -- when you compare thin MacBook keys with an external full-height keyboard, is one easier/harder to recognize each key on than the other?
[+] [-] xpe|2 years ago|reply
[+] [-] rocqua|2 years ago|reply
My sense is that they profile the person more than the keyboard.
[+] [-] crazygringo|2 years ago|reply
I'm pretty sure Zoom does this by default as part of its noise cancellation (it's potentially even easier since you can use keydown events to help identify, not just the audio stream).
So as long as basic default noise cancellation is on, that would at least prevent this over regular videoconferencing. And because of this, I'm having a hard time thinking of when else this would be a realistic threat, where the attacker wouldn't already have enough physical access to either install a regular keylogger or else a hidden camera.
[+] [-] saiya-jin|2 years ago|reply
[+] [-] 1123581321|2 years ago|reply
[+] [-] bo1024|2 years ago|reply
[+] [-] ariym|2 years ago|reply
https://github.com/ggerganov/kbd-audio
[+] [-] mxwsn|2 years ago|reply
[+] [-] fbdab103|2 years ago|reply
Even better if the target uses a passphrase, "hXXXse battXXX stXXXXX cXXXXXX" becomes interpretable given a few landmark letter identified with high probability.
[+] [-] zaxomi|2 years ago|reply
[+] [-] elderlybanana|2 years ago|reply
A couple years ago for a weekend project I made a simple "audio-mnist" dataset from handwritten digit audio recordings. I never got past a few days worth of work, but open-sourcing it has been on my mind for a minute. This post kicked me into action. Getting some more data, basic CNN examples, etc. could provide a nice starting point for a lot of research and tools.
There is still separate code I'd have to find and make intelligible to create the recordings and split the audio.
Anyway, in case anyone finds part of this process interesting or useful.
[+] [-] tehsauce|2 years ago|reply
[+] [-] swid|2 years ago|reply
https://www.theverge.com/23810061/zenith-space-command-remot...
[+] [-] metaphor|2 years ago|reply
[+] [-] WXLCKNO|2 years ago|reply
[+] [-] hoosieree|2 years ago|reply
The goal is to cause the eavesdropper to totally reevaluate their life choices, and maybe even get caught up in the story.
[+] [-] zgluck|2 years ago|reply
[+] [-] majikandy|2 years ago|reply
[+] [-] hoosieree|2 years ago|reply
[+] [-] iainctduncan|2 years ago|reply
[+] [-] thedookmaster|2 years ago|reply
[+] [-] MaximilianEmel|2 years ago|reply
[+] [-] antegamisou|2 years ago|reply
In 2005 ACM's CCS Zhuang, Zhou and Tygar presented Keyboard Acoustic Emanations Revisited [1]
which builds up on Asonov & Agrawal's work [2] who came up with the idea the previous year (2004). [1] https://dl.acm.org/doi/10.1145/1609956.1609959[2] https://ieeexplore.ieee.org/document/1301311
[+] [-] 6510|2 years ago|reply
https://news.mit.edu/2014/algorithm-recovers-speech-from-vib...
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] pengaru|2 years ago|reply
[+] [-] insickness|2 years ago|reply
[+] [-] frant-hartm|2 years ago|reply
[+] [-] transportgo|2 years ago|reply
[+] [-] nmeagent|2 years ago|reply
[+] [-] jncfhnb|2 years ago|reply
Note that the testing data in the confusion matrix appears to have a uniformish distribution of each key being pressed. I suspect this data was not generated by someone actually typing because you would rarely see numbers and rare letters. It is possible these were simply pressed one at a time rather than in a series of rapid presses.
My guess is this approach uses the mic to identify where the sound of the key press was coming from rather than what each key press sounds like. Which does not invalidate the results but may make it seem less magical. Tbh it’s probably much worse this way because such a model could probably generalize very well across all keyboards and typing styles.
[+] [-] manbitesdog|2 years ago|reply
It would have its own set of problems: not two people using it at once, eavesdropping would be really easy… but it’d have its own set of interesting applications