Announcing the first SHA-1 collision

[+] nneonneo|9 years ago|reply

The visual description of the colliding files, at http://shattered.io/static/pdf_format.png, is not very helpful in understanding how they produced the PDFs, so I took apart the PDFs and worked it out.

Basically, each PDF contains a single large (421,385-byte) JPG image, followed by a few PDF commands to display the JPG. The collision lives entirely in the JPG data - the PDF format is merely incidental here. Extracting out the two images shows two JPG files with different contents (but different SHA-1 hashes since the necessary prefix is missing). Each PDF consists of a common prefix (which contains the PDF header, JPG stream descriptor and some JPG headers), and a common suffix (containing image data and PDF display commands).

The header of each JPG contains a comment field, aligned such that the 16-bit length value of the field lies in the collision zone. Thus, when the collision is generated, one of the PDFs will have a longer comment field than the other. After that, they concatenate two complete JPG image streams with different image content - File 1 sees the first image stream and File 2 sees the second image stream. This is achieved by using misalignment of the comment fields to cause the first image stream to appear as a comment in File 2 (more specifically, as a sequence of comments, in order to avoid overflowing the 16-bit comment length field). Since JPGs terminate at the end-of-file (FFD9) marker, the second image stream isn't even examined in File 1 (whereas that marker is just inside a comment in File 2).

tl;dr: the two "PDFs" are just wrappers around JPGs, which each contain two independent image streams, switched by way of a variable-length comment field.

[+] taftster|9 years ago|reply

Wow, nice. Thanks. It seems everyone is thinking the secret sauce had something to do with embedding appropriate junk to the PDF format itself.

I was cynically thinking that Google might, as an aside, be recruiting with this disclosure; looking for people (like you) who actually solved the "how they did it" puzzle.

[+] digikata|9 years ago|reply

So if one were hashing data or a document format that also contained a bit of self-referential integrity data, e.g. the end of the data has a CRC of the rest of the block, or maybe a cryptographic signature using some other hash, wouldn't that further constrain the search space? Then not only would one need to back out the random bits needed to complete the SHA-1 collision, it would also have to satisfy a some other constraint within the other data in the block.

The best thing would be if one could prove a mathmatically incompatible set of counter functions where data colliding in the hash of one function would prevent the other function from validating correctly.

[+] m3ta|9 years ago|reply

To put things into perspective, let the Bitcoin network hashrate (double SHA256 per second) = B and the number of SHA1 hashes calculated in shattered = G.

B = 3,116,899,000,000,000,000

G = 9,223,372,036,854,775,808

Every three seconds the Bitcoin mining network brute-forces the same amount of hashes as Google did to perform this attack. Of course, the brute-force approach will always take longer than a strategic approach; this comment is only meant to put into perspective the sheer number of hashes calculated.

[+] problems|9 years ago|reply

A few considerations though:

- Google used GPUs, much of the Bitcoin network relies on fully custom ASICs now and mining without them isn't really profitable anymore

- SHA1 hashes can also be computed over twice as fast as SHA256 even on GPUs, so if someone were to go out and build SHA1 ASICs, you could probably do this very, very fast. It's almost certain that intelligence agencies could invest this effort to say, break SHA1 SSL certs.

[+] drzaiusapelord|9 years ago|reply

Well, certainly there's a lot of work that can be done by a computer network that spans what? 500+k computers all pegging their many cpu/gpu/asic processors at 100% 24/7. The bitcoin network, if seen as one processing unit, is history's most powerful computer. Of course it can punch out difficult work quickly. That doesn't mean your average Russian hacker can.

Of course this cuts both ways, we'll probably be seeing ASIC's designed just for this work that'll vastly cut down how much time is required to make these collisions.

Also, I find it impressive SHA-1 lasted 10 years. Think about all our advances in that time from both a hardware and cryptographic perspective. A decade of life in the cryptospace should be seen as a win, not a loss. I'm not sure why we thought these things would last forever. If anything, the cryptospace is faster moving now than ever. The concept of a decades long standard like we had with DES or AES is probably never going to happen again.

[+] comboy|9 years ago|reply

I wonder if SHA1 was used instead of SHA256 for Bitcoin mining, how much optimization these SHA1 discoveries would bring to the mining process (since it's more like partial brute forcing rather than finding collisions) i.e. what would be the percentage difference in the difficulty because of them - on the long run, since it takes some time to implement those things as ASICs.

[+] dwaltrip|9 years ago|reply

This says more about the Bitcoin network than it does about the ease with which one can create a SHA-1 collision.

From the article:

Nine quintillion (9,223,372,036,854,775,808) SHA1 computations in total.

6,500 years of CPU computation for 1st phase of attack.

110 years of GPU computation for 2nd phase of attack.

[+] mabbo|9 years ago|reply

One practical attack using this: create a torrent of some highly desirable content- the latest hot TV show in high def or whatever. Make two copies, one that is malware free, another that isn't.

Release the clean one and let it spread for a day or two. Then join the torrent, but spread the malware-hosting version. Checksums would all check out, other users would be reporting that it's the real thing, but now you've got 1000 people purposely downloading ransomware from you- and sharing it with others.

Apparently it costs around $100,000 to compute the collisions, but so what? If I've got 10,000 installing my 1BTC-to-unlock ransomware, I'll get a return on investment.

This will mess up torrent sharing websites in a hurry.

Edit: some people have pointed out some totally legitimate potential flaws in this idea. And they're probably right, those may sink the entire scheme. But keep in mind that this is one idea off the top of my head, and I'm not any security expert. There's plenty of actors out there who have more reasons and time to think up scarier ideas.

The reality is, we need to very quickly stop trusting SHA1 for anything. And a lot of software is not ready to make that change overnight.

[+] cesarb|9 years ago|reply

On a quick scroll of the comments, I haven't seen this posted so far: http://valerieaurora.org/hash.html

We're at the "First collision found" stage, where the programmer reaction is "Gather around a co-worker's computer, comparing the colliding inputs and running the hash function on them", and the non-expert reaction is "Explain why a simple collision attack is still useless, it's really the second pre-image attack that counts".

[+] kuschkufan|9 years ago|reply

Love this page. Someone should remind her to update the table for sha1 now.

[+] orblivion|9 years ago|reply

I'm a programmer and I exhibit some of the column 3 behavior.

[+] floatboth|9 years ago|reply

Heh, RIPEMD-160, forever under peer review :)

[+] unknown|9 years ago|reply

[deleted]

[+] lisper|9 years ago|reply

This point seems to be getting re-hashed (no pun intended) a lot, so here's a quick summary: there are three kinds of attacks on cryptographic hashes: collision attacks, second-preimage attacks, and first-preimage attacks.

Collision attack: find two documents with the same hash. That's what was done here.

Second-preimage attack: given a document, find a second document with the same hash.

First-preimage attack: given an arbitrary hash, find a document with that hash.

These are in order of increasing severity. A collision attack is the least severe, but it's still very serious. You can't use a collision to compromise existing certificates, but you can use them to compromise future certificates because you can get a signature on one document that is also valid for a different document. Collision attacks are also stepping stones to pre-image attacks.

UPDATE: some people are raising the possibility of hashes where some values have 1 or 0 preimages, which makes second and first preimage attacks formally impossible. Yes, such hashes are possible (in fact trivial) to construct, but they are not cryptographically secure. One of the requirements for a cryptographically secure hash is that all possible hash values are (more or less) equally likely.

[+] mikegerwitz|9 years ago|reply

The severity between the preimage attacks depends on context.

For Git, for example, a first-preimage attack won't buy you anything, but a second-preimage could be potentially devastating depending on how lucky you get.

If Mallory wanted to make it look like you signed a document you didn't, second-preimage would be devestating. And with the demonstration of two PDFs sharing the same hash, this is a pretty severe one: many PGP signatures, for example, still use SHA-1. But even so, you could take some signature from any point in the past.

[+] bo1024|9 years ago|reply

> UPDATE: some people are raising the possibility of hashes where some values have 1 or 0 preimages, which makes second and first preimage attacks formally impossible. Yes, such hashes are possible (in fact trivial) to construct, but they are not cryptographically secure. One of the requirements for a cryptographically secure hash is that all possible hash values are (more or less) equally likely.

True, but the main issue with these proposals is that we want hash functions that are much shorter than the original thing being hashed. For instance, we want every string of length one million bits to get hashed down to some string of length 4096. So those 2^4096 short hash strings can't mostly have 1 or 0 pre-images, in fact on average they have to have gazillions of pre-images each.

[+] yuhong|9 years ago|reply

Also the difference between identical and chosen prefix attacks too (this attack is identical prefix).

[+] tripzilch|9 years ago|reply

> This point seems to be getting re-hashed (no pun intended)

why not??

[+] mate_soos|9 years ago|reply

I am a bit saddened that Vegard Nossum's work, which they used for encoding SHA-1 to SAT, is only mentioned as a footnote. The github code is at

https://github.com/vegard/sha1-sat

and his Master Thesis, whose quality is approaching a PhD thesis is here:

https://www.duo.uio.no/bitstream/handle/10852/34912/thesis-o...

Note that they also only mention MiniSat as a footnote, which is pretty bad. The relevant paper is at

http://minisat.se/downloads/MiniSat.pdf

All of these are great reads. Highly recommended.

[+] amichal|9 years ago|reply

Linked http://shattered.io/ has two PDFs that render differently as examples. They indeed have same SHA-1 and are even the same size.

  $ls -l sha*.pdf 
  -rw-r--r--@ 1 amichal  staff  422435 Feb 23 10:01 shattered-1.pdf
  -rw-r--r--@ 1 amichal  staff  422435 Feb 23 10:14 shattered-2.pdf
  $shasum -a 1 sha*.pdf
  38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-1.pdf
  38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-2.pdf

Of course other hashes are different:

  $shasum -a 256 sha*.pdf

  2bb787a73e37352f92383abe7e2902936d1059ad9f1ba6daaa9c1e58ee6970d0  shattered-1.pdf
  d4488775d29bdef7993367d541064dbdda50d383f89f0aa13a6ff2e0894ba5ff  shattered-2.pdf 

  $md5 sha*.pdf
  
  MD5 (shattered-1.pdf) = ee4aa52b139d925f8d8884402b0a750c
  MD5 (shattered-2.pdf) = 5bd9d8cabc46041579a311230539b8d1

[+] anilgulecha|9 years ago|reply

Big things affected:

* DHT/torrent hashes - A group of malicious peers could serve malware for a given hash.

* Git - A commit may be replaced by another without affecting the following commits.

* PGP/GPG -- Any old keys still in use. (New keys do not use SHA1.)

* Distribution software checksum. SHA1 is the most common digest provided (even MD5 for many).

Edit: Yes, I understand this is a collision attack. But yes, it's still a attack vector as 2 same blocks can be generated now, with one published, widely deployed (torrent/git), and then replaced at a later date.

[+] ikeboy|9 years ago|reply

Someone just made about $2500

See https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2013... and https://bitcoinchain.com/block_explorer/address/37k7toV1Nv4D...

[+] Aissen|9 years ago|reply

I love the fact that there is a tool for detecting any collision using this algorithm: https://github.com/cr-marcstevens/sha1collisiondetection

and it's super effective: The possibility of false positives can be neglected as the probability is smaller than 2^-90.

It's also interesting that this attack is from the same author that detected that Flame (the nation-state virus) was signed using an unknown collision algorithm on MD5 (cited in the shattered paper introduction).

[+] korm|9 years ago|reply

[2012] Schneier - When Will We See Collisions for SHA-1?

https://www.schneier.com/blog/archives/2012/10/when_will_we_...

Pretty close in his estimation.

[+] qznc|9 years ago|reply

Schneier predicted "between 2018--2021 depending on resources". He explicitly says "Since this argument only takes into account commodity hardware and not [...] GPUs". Since Google used GPUs that very well explains the speedup to 2017.

[+] 0x0|9 years ago|reply

I'm trying to play with this in git. Added the first file, committed, and then overwrote the file with the second file and committed again. But even when cloning this repository into another directory, I'm still getting different files between commit 1 and 2. What does it take to trick git into thinking the files are the same? I half expected "git status" to say "no changes" after overwriting the first (committed) pdf with the second pdf?

[+] js2|9 years ago|reply

This is because git adds a header and zlib compresses the PDFs such that they no longer collide when stored in git. But of course, they still collide when extracted from git:

    $ ls -l; for i in 1 2; do sha1sum < shattered-$i.pdf; \
    git cat-file -p $(git hash-object -w shattered-$i.pdf) |
    sha1sum; done; find .git/objects -type f -print0 | xargs -0 ls -l
    total 1664
    -rw-r--r--@ 1 jay  staff  422435 Feb 23 10:32 shattered-1.pdf
    -rw-r--r--@ 1 jay  staff  422435 Feb 23 10:32 shattered-2.pdf
    38762cf7f55934b34d179ae6a4c80cadccbb7f0a
    38762cf7f55934b34d179ae6a4c80cadccbb7f0a
    38762cf7f55934b34d179ae6a4c80cadccbb7f0a
    38762cf7f55934b34d179ae6a4c80cadccbb7f0a
    -r--r--r--  1 jay  staff  381104 Feb 23 10:41 .git/objects/b6/21eeccd5c7edac9b7dcba35a8d5afd075e24f2
    -r--r--r--  1 jay  staff  381102 Feb 23 10:41 .git/objects/ba/9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0

See "Object Storage" for details at https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

It's worth noting that either of the changes, adding a header or deflating the content, would remove the collision. The former because this is a chosen-prefix collision attack, the latter because the compression alters the content entirely.

I'm not a cryptographer, so I wonder: do the git header and zlib compression add significant complexity to manufacturing two files that collide inside git?

[+] hannob|9 years ago|reply

Git stores the files as so-called blob objects, they're prefixed with a simple header. So even if sha1(file1)==sha1(file2), your git will still get different hashes, because it's doing sha1(prefix+file1), sha1(prefix+file2). You'd need a file that collides with this prefix. This is certainly possible, but not sure if it may mean you need to pay for that several thousand cpu/gpu cluster google used to make it feasible. Maybe there are precomputations that can be reused somehow.

[+] anilgulecha|9 years ago|reply

That's not the attack on git -- it's the commit ID. You can have two patches that have the same commit. Apply one now, and in the future replace it with the second one, without affecting any other commits IDs afterwards.

[+] eneveu|9 years ago|reply

It's semi-relevant to your question, but if you want to learn more about Git internals, I recommend the last chapter of the "Pro Git" book, which is free:

https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Po...

[+] SamBam|9 years ago|reply

I'm confused by the "File Tester" at https://shattered.it/

It says "Upload any file to test if they are part of a collision attack."

When I upload either of their two sample collision documents, it says they are "Safe."

[+] cavanasm|9 years ago|reply

I get "Collision found". Couldn't even begin to guess what's causing it to say it's safe on your end, unless you modified the files.

[+] aburan28|9 years ago|reply

This tool is simply testing whether or not the file has a disturbance vector that makes it a potential file with higher than usual probability to be in a collision

[+] Ajedi32|9 years ago|reply

For me it says "Collision found in shattered-1.pdf"

[+] mikeash|9 years ago|reply

For those of us who are totally clueless about the construction of these hash functions, what is the fundamental flaw in SHA-1 that allows this attack? How do newer hash functions avoid it?

[+] mckoss|9 years ago|reply

Computing a collision today costs about $100K from my reading of the paper. So most uses of SHA1 are protecting documents of far lower value, and would not be likely attack targets (today).

[+] danielweber|9 years ago|reply

There are many areas of security where you can genuinely get by with obfuscation, hoping the attacker looks elsewhere, or general security-through-obscurity.

You can't in crypto. When the entire system relies on an axiom being true, you need to make sure it's true. The attacks are only going to get better. The attacks are going to come from the future. The embedded systems will not be replaced in time.

[+] pavfarb|9 years ago|reply

SHA1 use cases are not limited into integrity verification of documents, but used a lot for traffic integrity and generation of authentication codes:

- Torrents of all kinds. - Version control systems (where ability attacks like displacing release pointers become easier). - IpSec, SSH, PGP and a number of other protected data exchange systems.

Being able to subvert integrity guarantees is a nice building block for complicated man-in-the-middle attacks.

[+] jasode|9 years ago|reply

>Nine quintillion computations; 6,500 years of CPU; 110 years of GPU

Is there a rough calculation in terms of today's $$$ cost to implement the attack?

[+] raphaelj|9 years ago|reply

They provide an estimate in their paper:

> The monetary cost of computing the second block of the

> attack by renting Amazon instances can be estimated from

> these various data. Using a p2.16xlarge instance,

> featuring 16 K80 GPUs and nominally costing

> US$ 14.4 per hour would cost US$ 560 K for the

> necessary 71 device years. It would be more economical

> for a patient attacker to wait for low “spot prices” of

> the smaller g2.8xlarge instances, which feature four K520

> GPUs, roughly equivalent to a K40 or a GTX 970. Assuming

> thusly an effort of 100 device years, and a typical spot

> price of US$ 0.5 per hour, the overall cost would be of

> US$ 110 K.

[+] gcp|9 years ago|reply

Say you want to replicate this in 1 month, you need 1320 GPUs for a month. They didn't specify which ones, but say at 1000 bucks per GPU, that's a 1.3M USD investment. And some pocket change for power etc.

There isn't anything new about this result actually, Google just set aside the necessary resources to demonstrate it.

[+] scrollaway|9 years ago|reply

The PDF says:

> Using a p2.16xlarge instance, featuring 16 K80 GPUs and nominally costing US 14.4 per hour would cost US 560 K for the necessary 71 device years

[+] milge|9 years ago|reply

Probably peanuts to any 3 letter government agency.

[+] empath75|9 years ago|reply

Probably nothing, because they'll just used compromised aws keys to spin up instances for free.

[+] unknown|9 years ago|reply

[deleted]

[+] jeffdavis|9 years ago|reply

Does git have any path away from SHA1?

I know the attack isn't practical today, but the writing is on the wall.

[+] jgrahamc|9 years ago|reply

How am I going to explain this to my wife?

Actually a serious question. How do we communicate something like this to the general public?

[+] matt_wulfeck|9 years ago|reply

> We then leveraged Google’s technical expertise and cloud infrastructure to compute the collision which is one of the largest computations ever completed.

And this, my friends, is why the big players (google, Amazon, etc) will win at the cloud offering game. When the instances are not purchased they can be used extensively internally.

[+] koolba|9 years ago|reply

What's the impact to something like git that makes extensive use of SHA-1?

In their example they've created two PDFs with the same SHA-1. Could I replace the blob in a git repo with the "bad" version of a file if it matches the SHA-1?

485 comments