The visual description of the colliding files, at http://shattered.io/static/pdf_format.png, is not very helpful in understanding how they produced the PDFs, so I took apart the PDFs and worked it out.
Basically, each PDF contains a single large (421,385-byte) JPG image, followed by a few PDF commands to display the JPG. The collision lives entirely in the JPG data - the PDF format is merely incidental here. Extracting out the two images shows two JPG files with different contents (but different SHA-1 hashes since the necessary prefix is missing). Each PDF consists of a common prefix (which contains the PDF header, JPG stream descriptor and some JPG headers), and a common suffix (containing image data and PDF display commands).
The header of each JPG contains a comment field, aligned such that the 16-bit length value of the field lies in the collision zone. Thus, when the collision is generated, one of the PDFs will have a longer comment field than the other. After that, they concatenate two complete JPG image streams with different image content - File 1 sees the first image stream and File 2 sees the second image stream. This is achieved by using misalignment of the comment fields to cause the first image stream to appear as a comment in File 2 (more specifically, as a sequence of comments, in order to avoid overflowing the 16-bit comment length field). Since JPGs terminate at the end-of-file (FFD9) marker, the second image stream isn't even examined in File 1 (whereas that marker is just inside a comment in File 2).
tl;dr: the two "PDFs" are just wrappers around JPGs, which each contain two independent image streams, switched by way of a variable-length comment field.
Wow, nice. Thanks. It seems everyone is thinking the secret sauce had something to do with embedding appropriate junk to the PDF format itself.
I was cynically thinking that Google might, as an aside, be recruiting with this disclosure; looking for people (like you) who actually solved the "how they did it" puzzle.
So if one were hashing data or a document format that also contained a bit of self-referential integrity data, e.g. the end of the data has a CRC of the rest of the block, or maybe a cryptographic signature using some other hash, wouldn't that further constrain the search space? Then not only would one need to back out the random bits needed to complete the SHA-1 collision, it would also have to satisfy a some other constraint within the other data in the block.
The best thing would be if one could prove a mathmatically incompatible set of counter functions where data colliding in the hash of one function would prevent the other function from validating correctly.
To put things into perspective, let the Bitcoin network hashrate (double SHA256 per second) = B and the number of SHA1 hashes calculated in shattered = G.
B = 3,116,899,000,000,000,000
G = 9,223,372,036,854,775,808
Every three seconds the Bitcoin mining network brute-forces the same amount of hashes as Google did to perform this attack. Of course, the brute-force approach will always take longer than a strategic approach; this comment is only meant to put into perspective the sheer number of hashes calculated.
- Google used GPUs, much of the Bitcoin network relies on fully custom ASICs now and mining without them isn't really profitable anymore
- SHA1 hashes can also be computed over twice as fast as SHA256 even on GPUs, so if someone were to go out and build SHA1 ASICs, you could probably do this very, very fast. It's almost certain that intelligence agencies could invest this effort to say, break SHA1 SSL certs.
Well, certainly there's a lot of work that can be done by a computer network that spans what? 500+k computers all pegging their many cpu/gpu/asic processors at 100% 24/7. The bitcoin network, if seen as one processing unit, is history's most powerful computer. Of course it can punch out difficult work quickly. That doesn't mean your average Russian hacker can.
Of course this cuts both ways, we'll probably be seeing ASIC's designed just for this work that'll vastly cut down how much time is required to make these collisions.
Also, I find it impressive SHA-1 lasted 10 years. Think about all our advances in that time from both a hardware and cryptographic perspective. A decade of life in the cryptospace should be seen as a win, not a loss. I'm not sure why we thought these things would last forever. If anything, the cryptospace is faster moving now than ever. The concept of a decades long standard like we had with DES or AES is probably never going to happen again.
I wonder if SHA1 was used instead of SHA256 for Bitcoin mining, how much optimization these SHA1 discoveries would bring to the mining process (since it's more like partial brute forcing rather than finding collisions) i.e. what would be the percentage difference in the difficulty because of them - on the long run, since it takes some time to implement those things as ASICs.
One practical attack using this: create a torrent of some highly desirable content- the latest hot TV show in high def or whatever. Make two copies, one that is malware free, another that isn't.
Release the clean one and let it spread for a day or two. Then join the torrent, but spread the malware-hosting version. Checksums would all check out, other users would be reporting that it's the real thing, but now you've got 1000 people purposely downloading ransomware from you- and sharing it with others.
Apparently it costs around $100,000 to compute the collisions, but so what? If I've got 10,000 installing my 1BTC-to-unlock ransomware, I'll get a return on investment.
This will mess up torrent sharing websites in a hurry.
Edit: some people have pointed out some totally legitimate potential flaws in this idea. And they're probably right, those may sink the entire scheme. But keep in mind that this is one idea off the top of my head, and I'm not any security expert. There's plenty of actors out there who have more reasons and time to think up scarier ideas.
The reality is, we need to very quickly stop trusting SHA1 for anything. And a lot of software is not ready to make that change overnight.
We're at the "First collision found" stage, where the programmer reaction is "Gather around a co-worker's computer, comparing the colliding inputs and running the hash function on them", and the non-expert reaction is "Explain why a simple collision attack is still useless, it's really the second pre-image attack that counts".
This point seems to be getting re-hashed (no pun intended) a lot, so here's a quick summary: there are three kinds of attacks on cryptographic hashes: collision attacks, second-preimage attacks, and first-preimage attacks.
Collision attack: find two documents with the same hash. That's what was done here.
Second-preimage attack: given a document, find a second document with the same hash.
First-preimage attack: given an arbitrary hash, find a document with that hash.
These are in order of increasing severity. A collision attack is the least severe, but it's still very serious. You can't use a collision to compromise existing certificates, but you can use them to compromise future certificates because you can get a signature on one document that is also valid for a different document. Collision attacks are also stepping stones to pre-image attacks.
UPDATE: some people are raising the possibility of hashes where some values have 1 or 0 preimages, which makes second and first preimage attacks formally impossible. Yes, such hashes are possible (in fact trivial) to construct, but they are not cryptographically secure. One of the requirements for a cryptographically secure hash is that all possible hash values are (more or less) equally likely.
The severity between the preimage attacks depends on context.
For Git, for example, a first-preimage attack won't buy you anything, but a second-preimage could be potentially devastating depending on how lucky you get.
If Mallory wanted to make it look like you signed a document you didn't, second-preimage would be devestating. And with the demonstration of two PDFs sharing the same hash, this is a pretty severe one: many PGP signatures, for example, still use SHA-1. But even so, you could take some signature from any point in the past.
> UPDATE: some people are raising the possibility of hashes where some values have 1 or 0 preimages, which makes second and first preimage attacks formally impossible. Yes, such hashes are possible (in fact trivial) to construct, but they are not cryptographically secure. One of the requirements for a cryptographically secure hash is that all possible hash values are (more or less) equally likely.
True, but the main issue with these proposals is that we want hash functions that are much shorter than the original thing being hashed. For instance, we want every string of length one million bits to get hashed down to some string of length 4096. So those 2^4096 short hash strings can't mostly have 1 or 0 pre-images, in fact on average they have to have gazillions of pre-images each.
* DHT/torrent hashes - A group of malicious peers could serve malware for a given hash.
* Git - A commit may be replaced by another without affecting the following commits.
* PGP/GPG -- Any old keys still in use. (New keys do not use SHA1.)
* Distribution software checksum. SHA1 is the most common digest provided (even MD5 for many).
Edit: Yes, I understand this is a collision attack. But yes, it's still a attack vector as 2 same blocks can be generated now, with one published, widely deployed (torrent/git), and then replaced at a later date.
and it's super effective: The possibility of false positives can be neglected as the probability is smaller than 2^-90.
It's also interesting that this attack is from the same author that detected that Flame (the nation-state virus) was signed using an unknown collision algorithm on MD5 (cited in the shattered paper introduction).
Schneier predicted "between 2018--2021 depending on resources". He explicitly says "Since this argument only takes into account commodity hardware and not [...] GPUs". Since Google used GPUs that very well explains the speedup to 2017.
I'm trying to play with this in git. Added the first file, committed, and then overwrote the file with the second file and committed again. But even when cloning this repository into another directory, I'm still getting different files between commit 1 and 2. What does it take to trick git into thinking the files are the same? I half expected "git status" to say "no changes" after overwriting the first (committed) pdf with the second pdf?
This is because git adds a header and zlib compresses the PDFs such that they no longer collide when stored in git. But of course, they still collide when extracted from git:
$ ls -l; for i in 1 2; do sha1sum < shattered-$i.pdf; \
git cat-file -p $(git hash-object -w shattered-$i.pdf) |
sha1sum; done; find .git/objects -type f -print0 | xargs -0 ls -l
total 1664
-rw-r--r--@ 1 jay staff 422435 Feb 23 10:32 shattered-1.pdf
-rw-r--r--@ 1 jay staff 422435 Feb 23 10:32 shattered-2.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a
38762cf7f55934b34d179ae6a4c80cadccbb7f0a
38762cf7f55934b34d179ae6a4c80cadccbb7f0a
38762cf7f55934b34d179ae6a4c80cadccbb7f0a
-r--r--r-- 1 jay staff 381104 Feb 23 10:41 .git/objects/b6/21eeccd5c7edac9b7dcba35a8d5afd075e24f2
-r--r--r-- 1 jay staff 381102 Feb 23 10:41 .git/objects/ba/9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0
It's worth noting that either of the changes, adding a header or deflating the content, would remove the collision. The former because this is a chosen-prefix collision attack, the latter because the compression alters the content entirely.
I'm not a cryptographer, so I wonder: do the git header and zlib compression add significant complexity to manufacturing two files that collide inside git?
Git stores the files as so-called blob objects, they're prefixed with a simple header. So even if sha1(file1)==sha1(file2), your git will still get different hashes, because it's doing sha1(prefix+file1), sha1(prefix+file2). You'd need a file that collides with this prefix. This is certainly possible, but not sure if it may mean you need to pay for that several thousand cpu/gpu cluster google used to make it feasible. Maybe there are precomputations that can be reused somehow.
That's not the attack on git -- it's the commit ID. You can have two patches that have the same commit. Apply one now, and in the future replace it with the second one, without affecting any other commits IDs afterwards.
It's semi-relevant to your question, but if you want to learn more about Git internals, I recommend the last chapter of the "Pro Git" book, which is free:
This tool is simply testing whether or not the file has a disturbance vector that makes it a potential file with higher than usual probability to be in a collision
For those of us who are totally clueless about the construction of these hash functions, what is the fundamental flaw in SHA-1 that allows this attack? How do newer hash functions avoid it?
Computing a collision today costs about $100K from my reading of the paper. So most uses of SHA1 are protecting documents of far lower value, and would not be likely attack targets (today).
There are many areas of security where you can genuinely get by with obfuscation, hoping the attacker looks elsewhere, or general security-through-obscurity.
You can't in crypto. When the entire system relies on an axiom being true, you need to make sure it's true. The attacks are only going to get better. The attacks are going to come from the future. The embedded systems will not be replaced in time.
SHA1 use cases are not limited into integrity verification of documents, but used a lot for traffic integrity and generation of authentication codes:
- Torrents of all kinds.
- Version control systems (where ability attacks like displacing release pointers become easier).
- IpSec, SSH, PGP and a number of other protected data exchange systems.
Being able to subvert integrity guarantees is a nice building block for complicated man-in-the-middle attacks.
Say you want to replicate this in 1 month, you need 1320 GPUs for a month. They didn't specify which ones, but say at 1000 bucks per GPU, that's a 1.3M USD investment. And some pocket change for power etc.
There isn't anything new about this result actually, Google just set aside the necessary resources to demonstrate it.
> We then leveraged Google’s technical expertise and cloud infrastructure to compute the collision which is one of the largest computations ever completed.
And this, my friends, is why the big players (google, Amazon, etc) will win at the cloud offering game. When the instances are not purchased they can be used extensively internally.
What's the impact to something like git that makes extensive use of SHA-1?
In their example they've created two PDFs with the same SHA-1. Could I replace the blob in a git repo with the "bad" version of a file if it matches the SHA-1?
[+] [-] nneonneo|9 years ago|reply
Basically, each PDF contains a single large (421,385-byte) JPG image, followed by a few PDF commands to display the JPG. The collision lives entirely in the JPG data - the PDF format is merely incidental here. Extracting out the two images shows two JPG files with different contents (but different SHA-1 hashes since the necessary prefix is missing). Each PDF consists of a common prefix (which contains the PDF header, JPG stream descriptor and some JPG headers), and a common suffix (containing image data and PDF display commands).
The header of each JPG contains a comment field, aligned such that the 16-bit length value of the field lies in the collision zone. Thus, when the collision is generated, one of the PDFs will have a longer comment field than the other. After that, they concatenate two complete JPG image streams with different image content - File 1 sees the first image stream and File 2 sees the second image stream. This is achieved by using misalignment of the comment fields to cause the first image stream to appear as a comment in File 2 (more specifically, as a sequence of comments, in order to avoid overflowing the 16-bit comment length field). Since JPGs terminate at the end-of-file (FFD9) marker, the second image stream isn't even examined in File 1 (whereas that marker is just inside a comment in File 2).
tl;dr: the two "PDFs" are just wrappers around JPGs, which each contain two independent image streams, switched by way of a variable-length comment field.
[+] [-] taftster|9 years ago|reply
I was cynically thinking that Google might, as an aside, be recruiting with this disclosure; looking for people (like you) who actually solved the "how they did it" puzzle.
[+] [-] digikata|9 years ago|reply
The best thing would be if one could prove a mathmatically incompatible set of counter functions where data colliding in the hash of one function would prevent the other function from validating correctly.
[+] [-] m3ta|9 years ago|reply
B = 3,116,899,000,000,000,000
G = 9,223,372,036,854,775,808
Every three seconds the Bitcoin mining network brute-forces the same amount of hashes as Google did to perform this attack. Of course, the brute-force approach will always take longer than a strategic approach; this comment is only meant to put into perspective the sheer number of hashes calculated.
[+] [-] problems|9 years ago|reply
- Google used GPUs, much of the Bitcoin network relies on fully custom ASICs now and mining without them isn't really profitable anymore
- SHA1 hashes can also be computed over twice as fast as SHA256 even on GPUs, so if someone were to go out and build SHA1 ASICs, you could probably do this very, very fast. It's almost certain that intelligence agencies could invest this effort to say, break SHA1 SSL certs.
[+] [-] drzaiusapelord|9 years ago|reply
Of course this cuts both ways, we'll probably be seeing ASIC's designed just for this work that'll vastly cut down how much time is required to make these collisions.
Also, I find it impressive SHA-1 lasted 10 years. Think about all our advances in that time from both a hardware and cryptographic perspective. A decade of life in the cryptospace should be seen as a win, not a loss. I'm not sure why we thought these things would last forever. If anything, the cryptospace is faster moving now than ever. The concept of a decades long standard like we had with DES or AES is probably never going to happen again.
[+] [-] comboy|9 years ago|reply
[+] [-] dwaltrip|9 years ago|reply
From the article:
Nine quintillion (9,223,372,036,854,775,808) SHA1 computations in total.
6,500 years of CPU computation for 1st phase of attack.
110 years of GPU computation for 2nd phase of attack.
[+] [-] mabbo|9 years ago|reply
Release the clean one and let it spread for a day or two. Then join the torrent, but spread the malware-hosting version. Checksums would all check out, other users would be reporting that it's the real thing, but now you've got 1000 people purposely downloading ransomware from you- and sharing it with others.
Apparently it costs around $100,000 to compute the collisions, but so what? If I've got 10,000 installing my 1BTC-to-unlock ransomware, I'll get a return on investment.
This will mess up torrent sharing websites in a hurry.
Edit: some people have pointed out some totally legitimate potential flaws in this idea. And they're probably right, those may sink the entire scheme. But keep in mind that this is one idea off the top of my head, and I'm not any security expert. There's plenty of actors out there who have more reasons and time to think up scarier ideas.
The reality is, we need to very quickly stop trusting SHA1 for anything. And a lot of software is not ready to make that change overnight.
[+] [-] cesarb|9 years ago|reply
We're at the "First collision found" stage, where the programmer reaction is "Gather around a co-worker's computer, comparing the colliding inputs and running the hash function on them", and the non-expert reaction is "Explain why a simple collision attack is still useless, it's really the second pre-image attack that counts".
[+] [-] kuschkufan|9 years ago|reply
[+] [-] orblivion|9 years ago|reply
[+] [-] floatboth|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] lisper|9 years ago|reply
Collision attack: find two documents with the same hash. That's what was done here.
Second-preimage attack: given a document, find a second document with the same hash.
First-preimage attack: given an arbitrary hash, find a document with that hash.
These are in order of increasing severity. A collision attack is the least severe, but it's still very serious. You can't use a collision to compromise existing certificates, but you can use them to compromise future certificates because you can get a signature on one document that is also valid for a different document. Collision attacks are also stepping stones to pre-image attacks.
UPDATE: some people are raising the possibility of hashes where some values have 1 or 0 preimages, which makes second and first preimage attacks formally impossible. Yes, such hashes are possible (in fact trivial) to construct, but they are not cryptographically secure. One of the requirements for a cryptographically secure hash is that all possible hash values are (more or less) equally likely.
[+] [-] mikegerwitz|9 years ago|reply
For Git, for example, a first-preimage attack won't buy you anything, but a second-preimage could be potentially devastating depending on how lucky you get.
If Mallory wanted to make it look like you signed a document you didn't, second-preimage would be devestating. And with the demonstration of two PDFs sharing the same hash, this is a pretty severe one: many PGP signatures, for example, still use SHA-1. But even so, you could take some signature from any point in the past.
[+] [-] bo1024|9 years ago|reply
True, but the main issue with these proposals is that we want hash functions that are much shorter than the original thing being hashed. For instance, we want every string of length one million bits to get hashed down to some string of length 4096. So those 2^4096 short hash strings can't mostly have 1 or 0 pre-images, in fact on average they have to have gazillions of pre-images each.
[+] [-] yuhong|9 years ago|reply
[+] [-] tripzilch|9 years ago|reply
why not??
[+] [-] mate_soos|9 years ago|reply
https://github.com/vegard/sha1-sat
and his Master Thesis, whose quality is approaching a PhD thesis is here:
https://www.duo.uio.no/bitstream/handle/10852/34912/thesis-o...
Note that they also only mention MiniSat as a footnote, which is pretty bad. The relevant paper is at
http://minisat.se/downloads/MiniSat.pdf
All of these are great reads. Highly recommended.
[+] [-] amichal|9 years ago|reply
[+] [-] anilgulecha|9 years ago|reply
* DHT/torrent hashes - A group of malicious peers could serve malware for a given hash.
* Git - A commit may be replaced by another without affecting the following commits.
* PGP/GPG -- Any old keys still in use. (New keys do not use SHA1.)
* Distribution software checksum. SHA1 is the most common digest provided (even MD5 for many).
Edit: Yes, I understand this is a collision attack. But yes, it's still a attack vector as 2 same blocks can be generated now, with one published, widely deployed (torrent/git), and then replaced at a later date.
[+] [-] ikeboy|9 years ago|reply
See https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2013... and https://bitcoinchain.com/block_explorer/address/37k7toV1Nv4D...
[+] [-] Aissen|9 years ago|reply
and it's super effective: The possibility of false positives can be neglected as the probability is smaller than 2^-90.
It's also interesting that this attack is from the same author that detected that Flame (the nation-state virus) was signed using an unknown collision algorithm on MD5 (cited in the shattered paper introduction).
[+] [-] korm|9 years ago|reply
https://www.schneier.com/blog/archives/2012/10/when_will_we_...
Pretty close in his estimation.
[+] [-] qznc|9 years ago|reply
[+] [-] 0x0|9 years ago|reply
[+] [-] js2|9 years ago|reply
It's worth noting that either of the changes, adding a header or deflating the content, would remove the collision. The former because this is a chosen-prefix collision attack, the latter because the compression alters the content entirely.
I'm not a cryptographer, so I wonder: do the git header and zlib compression add significant complexity to manufacturing two files that collide inside git?
[+] [-] hannob|9 years ago|reply
[+] [-] anilgulecha|9 years ago|reply
[+] [-] eneveu|9 years ago|reply
https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Po...
[+] [-] SamBam|9 years ago|reply
It says "Upload any file to test if they are part of a collision attack."
When I upload either of their two sample collision documents, it says they are "Safe."
[+] [-] cavanasm|9 years ago|reply
[+] [-] aburan28|9 years ago|reply
[+] [-] Ajedi32|9 years ago|reply
[+] [-] mikeash|9 years ago|reply
[+] [-] mckoss|9 years ago|reply
[+] [-] danielweber|9 years ago|reply
You can't in crypto. When the entire system relies on an axiom being true, you need to make sure it's true. The attacks are only going to get better. The attacks are going to come from the future. The embedded systems will not be replaced in time.
[+] [-] pavfarb|9 years ago|reply
- Torrents of all kinds. - Version control systems (where ability attacks like displacing release pointers become easier). - IpSec, SSH, PGP and a number of other protected data exchange systems.
Being able to subvert integrity guarantees is a nice building block for complicated man-in-the-middle attacks.
[+] [-] jasode|9 years ago|reply
Is there a rough calculation in terms of today's $$$ cost to implement the attack?
[+] [-] raphaelj|9 years ago|reply
> The monetary cost of computing the second block of the
> attack by renting Amazon instances can be estimated from
> these various data. Using a p2.16xlarge instance,
> featuring 16 K80 GPUs and nominally costing
> US$ 14.4 per hour would cost US$ 560 K for the
> necessary 71 device years. It would be more economical
> for a patient attacker to wait for low “spot prices” of
> the smaller g2.8xlarge instances, which feature four K520
> GPUs, roughly equivalent to a K40 or a GTX 970. Assuming
> thusly an effort of 100 device years, and a typical spot
> price of US$ 0.5 per hour, the overall cost would be of
> US$ 110 K.
[+] [-] gcp|9 years ago|reply
There isn't anything new about this result actually, Google just set aside the necessary resources to demonstrate it.
[+] [-] scrollaway|9 years ago|reply
> Using a p2.16xlarge instance, featuring 16 K80 GPUs and nominally costing US 14.4 per hour would cost US 560 K for the necessary 71 device years
[+] [-] milge|9 years ago|reply
[+] [-] empath75|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] jeffdavis|9 years ago|reply
I know the attack isn't practical today, but the writing is on the wall.
[+] [-] jgrahamc|9 years ago|reply
Actually a serious question. How do we communicate something like this to the general public?
[+] [-] matt_wulfeck|9 years ago|reply
And this, my friends, is why the big players (google, Amazon, etc) will win at the cloud offering game. When the instances are not purchased they can be used extensively internally.
[+] [-] koolba|9 years ago|reply
In their example they've created two PDFs with the same SHA-1. Could I replace the blob in a git repo with the "bad" version of a file if it matches the SHA-1?