“Should you encrypt or compress first?”

[+] orlp|9 years ago|reply

There's no compress or encrypt _first_.

It's just compress or not, before encrypting. If security is important, the answer to that is no, unless you're an expert and familiar with CRIME and related attacks.

Compression after encryption is useless, as there should be NO recognizable patterns to exploit after the encryption.

[+] lisper|9 years ago|reply

> If security is important, the answer to that is no

It's a little more nuanced than that. Compression may cause information leaks or it can prevent them depending on the circumstances. If you're encrypting an audio stream, then compressing it first can cause leaks. If you're encrypting a document, then compressing it first may prevent leaks.

[+] api|9 years ago|reply

The rule is: never compress something secret together with something potentially attacker-influenced.

If the attacker can influence the traffic, they can potentially gather information about the secret by examining the effect of differing traffic patterns on the size of the encrypted result.

[+] vog|9 years ago|reply

This is exactly what the article says.

[+] jessriedel|9 years ago|reply

For others: https://en.wikipedia.org/wiki/CRIME

[+] unknown|9 years ago|reply

[deleted]

[+] unknown|9 years ago|reply

[deleted]

[+] coroutines|9 years ago|reply

From my very not-a-crypto-person perspective, I had thought compression was fine as long as you don't pad it to the expected block length before encrypting?

something something padding oracle

[+] unknown|9 years ago|reply

[deleted]

[+] DanBlake|9 years ago|reply

>Compression after encryption is useless, as there should be NO recognizable patterns to exploit after the encryption.

Not to nitpick, but this is incorrect. A encrypted file can very well have something like 100 X's in a row which the compression system could turn from XXXXXXXXXXXXXXXX.... into (100 x's go here) - Lousy example I know but it gets the point across.

Its also easy enough to test- Just encrypt a file then compress it. Sometimes it will be smaller. (Your reply indicated that in 100% of attempts compression is impossible)

[+] vog|9 years ago|reply

A more interesting question is whether to compress or sign first.

There's an interesting article on that topic by Ted Unangst:

"preauthenticated decryption considered harmful"

http://www.tedunangst.com/flak/post/preauthenticated-decrypt...

EDIT: Although the article talks about encrypt+sign versus sign+encrypt, the same argument goes for compress+sign versus sign+compress. You shouldn't do anything with untrusted data before having checked the signature - neither uncompress nor decrypt nor anything else.

[+] sirk390|9 years ago|reply

I think, you mean whether to encrypt or sign first.

[+] meta_AU|9 years ago|reply

From what I've read it is "encrypt then sign, unless signatures are optional and could be stripped from the message in which case things are complicated"

[+] mjevans|9 years ago|reply

Where everyone seems to be getting confused is handling a live flow versus handling a finalized flow (a file).

* Always pad to combat plain-text attacks, padding in theory shouldn't compress well so there's no point making the compression less effective by processing it.

* Always compress a 'file' first to reduce entropy.

* Always pad-up a live stream, maybe this data is useful in some other way, but you want interactive messages to be of similar size.

* At some place in the above also include a recipient identifier; this should be counted as part of the overhead not part of the padding.

* The signature should be on everything above here (recipients, pad, compressed message, extra pad).

. It might be useful to include the recipients in the un-encrypted portion of the message, but there are also contexts where someone might choose otherwise; an interactive flow would assume both parties knew a key to communicate with each other on and is one such case.

* The pad, message, extra-pad, and signature /must/ be encrypted. The recipients /may/ be encrypted.

I did have to look up the sign / encrypt first question as I didn't have reason to think about it before. In general I've looked to experts in this field for existing solutions, such as OpenPGP (GnuPG being the main implementation). Getting this stuff right is DIFFICULT.

[+] Animats|9 years ago|reply

This is why military voice encryption sends at a constant bitrate even when you're not talking. For serious security applications where fixed links are used, data is transmitted at a constant rate 24/7, even if the link is mostly idle.

[+] dietrichepp|9 years ago|reply

Wow, what a trainwreck. So many comments in here talking about whether it would be possible to compress data which looks like uniformly random data, for all the tests you would throw at it. Spoiler alert, you can't compress encrypted data. This isn't a question of whether we know it's possible, rather, it's a fact that we know it's impossible.

In fact, if you successfully compress data after encryption, then the only logical conclusion is that you've found a flaw in the encryption algorithm.

[+] kinofcain|9 years ago|reply

Also interesting is which compression algorithm you're using. HPACK Header compression in HTTP 2.0 is an attempt to mitigate this problem:

https://http2.github.io/http2-spec/compression.html#Security

[+] js2|9 years ago|reply

The paper cited in this article (Phonotactic Reconstruction of Encrypted VoIP Conversations) really deserves to be highlighted, so I submitted it separately:

https://news.ycombinator.com/item?id=11995298

http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf

[+] tomp|9 years ago|reply

I don't understand... Why couldn't you do CRIME with no compression as well? Assuming you can control (parts of) the plaintext, surely plaintext+encrypt gives you more information than plaintext+compress+encrypt?

[+] ontoillogical|9 years ago|reply

Crime relies on compression --- the "CR" stands for "Compression Ratio"

The idea is that the DEFLATE compression algorithm used in the TLS compression mode CRIME attacks will build and index of repeated strings and compress by providing keys to that index.

Here's a beautiful demonstration of another similar compression algorithm: http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equal...

So, if you control some subset of the plaintext you can make guesses as to what the secret you're trying to get at is, and if the size changes after compression you know that you got two hits to that bucket in the index, so your guess is right. You can use this technique to guess some string character by character -- reducing your seach space to n*m instead of n^m for a string of length m with character set of length n.

[+] JoshTriplett|9 years ago|reply

If you have a blob of encrypted data, that should normally tell you nothing about the plaintext, except its approximate length (assuming it wasn't padded). Normally, the length of the plaintext should tell you very little about its content; it doesn't do you much good to know "the message is about 4096 bytes". However, if you compress first, then you know the approximate length of the compressed data, which means you know something about the content of the plaintext.

As one of several possible attacks: imagine the attacker could supply data that will get fed back to them in the encrypted blob, along with your own data, and all compressed together. The blob will compress better if your data matches their data. So, repeatedly feed in your data and get back blobs, watching the total size of the compressed-then-encrypted blob, and you can predict enough about their data to guess it.

[+] geofft|9 years ago|reply

If you control parts of the plaintext, a well-designed cipher doesn't tell you anything about the rest of the plaintext.

But if you control parts of the plaintext, a compression algorithm will absolutely tell you information about the rest of the plaintext. That's its job—identifying similarities between different parts of the plaintext.

And it will leak that information in the form of the size of its output, and the standard definition of "well-designed cipher" doesn't do anything to mask sizes; the size of the encrypted data is the same as the size of the data passed to the cipher. So now both the data and its size are sensitive, and nobody's encrypting the size.

(You could sort of work around this by padding, but then you've basically removed the point of compression.)

[+] nightcracker|9 years ago|reply

Nope, secure ciphers are resistant against attack-controlled plaintext attacks.

The reason that plaintext -> compress -> encrypt has issues is because the length of the ciphertext gives away the plaintext if the attacker can control parts of the plaintext.

[+] laughinghan|9 years ago|reply

No, because the length of the plaintext+encrypt depends only on the length of the plaintext and not on the contents of the plaintext; whereas the length of the plaintext+compress+encrypt depends on the length of the compressed plaintext, and the length of the compressed plaintext depends on the contents of the plaintext.

So, different plaintext of the same lengths, which would've resulted in plaintext+encrypt of the same length, instead results in compressed plaintext of different lengths which results in plaintext+compress+encrypt of different lengths, leaking additional information!

[+] Asooka|9 years ago|reply

Not at all - with compression thrown in, you get a measure of the entropy of the original plaintext from the length of the compressed message.

[+] arknave|9 years ago|reply

I picked up on the reference to Stockfighter, but does anyone know if the walking machine learning game mentioned at the end of the article exists? Sounds like a fun game.

[+] ontoillogical|9 years ago|reply

Heh, I was referring to OpenAI Gym (https://gym.openai.com/), specifically https://gym.openai.com/envs#mujoco

[+] jakozaur|9 years ago|reply

Would adding some tiny random size help? Based on my poorly understanding, if after compress, but before encrypt we add random 0 to 16 bytes or 1% of size that could defeat quite a lot of attacks (like CRIME).

[+] IncRnd|9 years ago|reply

Despite the question being flawed. The correct answer is a series of questions: Who is the attacker? What are you guarding? What assumptions are there about the operating environment? What invariants (regulations, compliance, etc) exist?

There may be compensating controls that invalidate the perceived needs for encryption or compression, for example. i.e. don't design in the dark.

Of course, the interviewer may just want a canned scripted answer - but the interview is your chance to shine, showing how you can discuss all the angles.

[+] spatulon|9 years ago|reply

That was a fun read. Do I detect a nod to tptacek's "If You’re Typing the Letters A-E-S Into Your Code You’re Doing It Wrong"?

https://www.nccgroup.trust/us/about-us/newsroom-and-events/b...

[+] biokoda|9 years ago|reply

If you're compressing audio, the simple solution is to compress using constant bitrate.

[+] VLM|9 years ago|reply

Unfortunately one way to define variable bit rate is its compressed CBR.

Rather than defining at the protocol level "insert comfort noise here" at the compression level you get bitstream level "I donno what this stream is at a higher level, but replace the next 1000 bits with zeros".

That's if you do simple sampling. I donno about weird higher level vocoders. I think you could create a constant bit rate vocoder that really is constant. But it'll likely be uncompressible if its a good one because a vocoder basically is a compressor that's specifically very smart about human speech input. If your vocoder output is compressible its not a good one.

I think if you replaced your compression step with run it thru a constant rate vocoder you'd get what you're looking for. Probably.

[+] jayd16|9 years ago|reply

Would be great if Apple understood this and compressed IPA contents before encrypting.

Instead, when you submit something to the AppStore, you end up with a much bigger app than the one you uploaded.

To add insult to injury, if you ask Apple about this fuck up you get an esoteric support email about removing "contiguous zeros." As in, "make your app less compressible so it won't be obvious we're doing this wrong."

[+] poelzi|9 years ago|reply

if your compression can compress your encrypted data, you should change your encryption mechanism to something that actually works...

[+] em3rgent0rdr|9 years ago|reply

What if you compress and then only send data at regular periods and regular packet sizes? That way no information can be gleaned. E.g. after compressing you pad the data if it is unusually short, or you include other compressed data too, or you only use constant bit-rate compression algorithm.

[+] hueving|9 years ago|reply

That quoted voip paper isn't actually as damaging as it sounds. IIRC that 0.6 rating was for less than half of the words so if you're trying to listen to a conversation to get something meaningful, it's probably not going to happen.

[+] ontoillogical|9 years ago|reply

That's a good point, and the .6 is an optimistic score (I think it was from running a subset of their model)

I got the sense that the authors felt that this proves an attack of this sort was possible and viable, but that their model wasn't quite there yet.

OTOH this should be enough to acknowledge that the voice encryption/compression scheme they are attacking is not secure.

[+] cvwright|9 years ago|reply

Fair enough. The larger point, however, is that this stuff was supposed to be encrypted. The adversary shouldn't be able to learn even one bit about the plaintext.

After all, nobody would buy a new encryption module that advertised it could protect you 40% of the time.

[+] panic|9 years ago|reply

Has there been any research into compression that's generally safe to use before encryption? E.g., matching only common substrings longer than the key length would (I think?) defeat CRIME at the cost of compression ratio.

[+] Qantourisc|9 years ago|reply

Maybe we need encryption that also plays with the length of the message / or randomly pad our date before encryption ? I am however no expert, so I have no clue how feasible, or full of holes this method would be .

[+] itsnotvalid|9 years ago|reply

I am always thinking, if the compression scheme is known, you would need some good noonce to avoid known plaintext (for example, compression format's header is always the same), and also by CRIME, which is to remover the dictionary of the compression.

I think it is best to use built-in compression scheme by the compression program to do the encryption first, as those often take these into account (and the header is not leaked, since only the content is encrypted).

235 comments