SHA-1 'fully and practically broken' by new collision (2020)

[+] ImJasonH|4 years ago|reply

Reminder that GitHub has blocked Git commit collisions since 2017, and as far as anybody is aware hasn't seen one in the wild.

https://github.blog/2017-03-20-sha-1-collision-detection-on-...

[+] api|4 years ago|reply

Random collisions in 160-bit space are incredibly unlikely. This is talking about intentional collision, and means that it's entirely feasible for someone with significant compute power to create a git commit that has the exact same hash as another git commit. This could allow someone to silently modify a git commit history to e.g. inject malware or a known "bug" into a piece of software. The modified repository would be indistinguishable if you're only using git hashes.

Git's uses SHA-1 for unique identifiers, which is technically okay as long as they are not considered secure. If git were designed today it would probably use SHA2 or SHA3 but it's probably not going to change due to the massive install base.

Edit: anyone know if git's PGP signing feature creates a larger hash of the data in the repo? If not maybe git should add a feature where signing is done after the computation of a larger hash such as SHA-512 over all commits since the previous signature.

[+] tantalor|4 years ago|reply

> A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

- Scott Chacon

[+] ofou|4 years ago|reply

For the ones interested, you can use SHA-256 today. Warning: Still in experimental mode.

    # .gitconfig file
    [extensions]
        objectFormat = sha256

[+] garmaine|4 years ago|reply

Does GitHub support this?

[+] bjarneh|4 years ago|reply

> The technique that the researchers developed is quite complex and required two months of computations on 900 individual GPUs.

[+] asdfman123|4 years ago|reply

I wonder how much money in Bitcoin they would have made if they had used those hashes to mine instead.

Speaking of which, it would be funny if they were pretending to do security research but added a secret backdoor to mine bitcoin instead, somehow exporting those hashes or using the partial results of SHA-1 calculations (BTC isn't SHA-1).

I'm just joking, but I wonder if that's possible. If anyone is Machiavellian enough for that, it's security researchers.

[+] lallysingh|4 years ago|reply

Or roughly 1.3M GPU hours. Less as GPUs get upgraded. Not sure how many AWS F1 (FPGA) hours it would take.

[+] motohagiography|4 years ago|reply

I only half joke when I ask, when you break a cryptogrpahic hash, does it mean that now it's just a really amazing compression algorithm, but with a very heavy compute requirement? The non-joking half is speculating what data compression in a post-quantum compute world looks like.

[+] zzo38computer|4 years ago|reply

Fortunately, Fossil can now use SHA3-256 as well as SHA-1 (even in the same repository if necessary, perhaps due to upgrading an older repository), and I think it also has a hard mode to detect if SHA-1 collisions seem likely.

(I think git doesn't allow a repository to have multiple kind of hashes; from what I understand, only a single algorithm must be used.)

Since SHA-1 and SHA3-256 have different hash length, you can tell which one it is. It doesn't work so well if the length is the same; one way to fix it would be use a multicodec prefix. (My own implementation (currently incomplete) internally uses the multicodec numbers to identify the hash algorithms, but these multicodec numbers are never stored in the repository; instead, it is only applicable for the argument for the function to compute the hash, which can also be used in other programs.)

[+] b3n|4 years ago|reply

Unfortunately Fossil uses a fast SHA hash for the password hashes of user accounts in the database, rather than a key derivation function, which is disappointing.

[+] zzo38computer|4 years ago|reply

My idea is a different hash construction, which is 2D construction, which has a infinite internal state and infinite output length. Each row and each column also has a sequence number input, and then there are two mixing up functions (each of which has a finite input and output of equal size than the input); part of the result is propagated horizontally and part of it vertically, so each cell has two inputs and the input includes both the message block and the sequence number, which overwrites a part of the internal state (the rest of which has been mixed with the other state). Each output block (the output blocks are vertical and the message blocks are horizontal, which is why it is slow) is then truncated and concatenated to produce the final output, after adding padding at the end of the message (including the length of the message) to mix it up even more. If the message length is m and the hash length is h, then it requires O(h) space and O(mh) time. If ChaCha20 is in use, then it is easy to see (by the kind of calculations made by ChaCha20) that if the input is zero then the output will also be zero; therefore, the initial sequence number should be 1 and not 0. You can also compute a secondary hash which must be misaligned with the first one, so that in case a collision is found in one block that the collision will not be propagated with the other blocks too (unless a collision can be found within two consecutive blocks, in which case you would hope that the extra dimension (since it is a 2D construction) can avoid the collision).

[+] brirec|4 years ago|reply

Reading your post I’m sure you know far more about information theory than I do, so forgive me if this is a stupid question, but…

1. How is infinite internal state and output size possible? There has to be some actual limit for internal state, at least, right? Or else you’ll just run out of memory?

2. Wouldn’t a larger output size risk leaking data? The chance of collision becomes lower, but it also seems to toy with what I understand about why we use cryptographic hashes without ridiculous output size.

3A. What happens if you want a 1 MB hash of a 1 KB file?

3B. How predictable does a super long hash become?

At what point is it no longer a one-way hash?

[+] cletus|4 years ago|reply

Git was created 16 years ago. The impending breakage of SHA-1 was known even at that time, just like how MD5 had been broken before it.

I'm honestly still shocked that updating the hashing algorithm wasn't built into Git from day one. I really wonder why. Did people think this wouldn't happen? Were they so in love with the performance of C/C++ being able to pass around 20 byte hashes on the stack without worrying about a more complicated structure (eg a collection of variable length hashes)?

It just seems like such a massive and foreseeable oversight that's going to cause pain for some time to come for really no reason at all.

[+] alerighi|4 years ago|reply

SHA-1 will still work fine for the purpose of git. It is just no longer considered secure for cryptographic operations, such as digital signature, that doesn't mean that you can't use it for other purposes, like git does. Using it is still fine and will ever be fine.

Making the hashing algorithm exchangeable would have introduces complexity in a software that is already complex, and also less efficient (one of the reasons git was created was speed for large project like the Linux kernel) for no real purpose. If you want to change the algorithm, given that you will break compatibility with all existing repositories, tools, and clients, you make a fork of git because you are changing too much.

I don't see why migrating to SHA-256. The collisions are still very unlikely to generate accidentally, and sure, if you want to do damage to a repository you can create one on purpose, as you can as well commit some hooks that contains malware, or alter the git history in any way you want, so what's the point?

[+] ori_b|4 years ago|reply

For someone to be able to break your repo using sha1 collisions, they need to be able to commit to it.

If you don't trust someone, don't let them commit to your repo.

> Were they so in love with the performance of C/C++ being able to pass around 20 byte hashes on the stack without worrying about a more complicated structure (eg a collection of variable length hashes)?

The hashes show up everywhere. They're how every object in git is identified. They make it onto disk. They're not just commit ids.

Changing hashes changes the disk format.

[+] formerly_proven|4 years ago|reply

Git is designed around content-addressed storage and while you can cater for hash migrations, it gets pretty messy design-wise. Gits core data structures were also designed really quickly to patch a very urgent need of the kernel hackers. I doubt it has anything to do with being in love passing 20 byte strings on the stack. The fine git folks have produced a rather detailed document about the hash migration (https://git-scm.com/docs/hash-function-transition/) and it is not particularly simple to do this.

[+] elfchief|4 years ago|reply

Git was not intended (AFAIK) to be cryptographically secure. Being unsuitable for crypto != being unsuitable for other uses.

[+] dec0dedab0de|4 years ago|reply

Linus posted about it on google plus in 2017. I haven't re-read it yet, but I remember one of the ideas floating around hn at the time was to just have two hashes per commit. That is, two insecure hashes may be secure together for git's purposes. Even though we can generate collisions for md5, and sha1, it would be much more difficult to have a file generate an arbitrary collision for both at the same time.

Here is a link to Linus's post on the internet archive

https://web.archive.org/web/20170717192607/https://plus.goog...

And here are some hn posts around the same time

https://news.ycombinator.com/item?id=13733481

https://news.ycombinator.com/item?id=13719368

[+] ploxiln|4 years ago|reply

> I'm honestly still shocked that updating the hashing algorithm wasn't built into Git from day one. I really wonder why.

The first basically functional version of Git was created in less than a month by Linus Torvalds, immediately after Bitkeeper revoked the license/permission for linux kernel developers to use that proprietary source control tool for free. Linus took a look at Monotone and Mercurial IIRC, but they were not nearly fast enough at that time, so he created Git to just do what the linux kernel development community needed.

This isn't the first time Linus did something that academics said was a terrible mistake, and then his creation took over the freaking world, because it works so freaking well. (And it still does today, 16 years later. Maybe tomorrow it won't, I dunno. But it probably will.)

[+] umvi|4 years ago|reply

"We note that classical collisions and chosen-prefix collisions do not threaten all usages of SHA-1."

[+] tantalor|4 years ago|reply

> going to cause pain

What kind of attack on a git repo are you worried about?

[+] renewiltord|4 years ago|reply

It is a law of nature that all successful products find themselves not accounting for some predictable problems.

[+] oshiar53-0|4 years ago|reply

FYI signed git commits and tags weren't a thing back then.

[+] throwaway984393|4 years ago|reply

> I'm honestly still shocked that updating the hashing algorithm wasn't built into Git from day one.

The name literally means stupid person.

[+] unknown|4 years ago|reply

[deleted]

[+] simpleguitar|4 years ago|reply

I have a question.

Is it not possible to create a good cryptographic hashing algorithm that results in 128bit or 160bit hash?

It seems all the modern, secure ones are 256 bits or larger. Is that because 160 bits is too easy to compute collisions for any algorithm, or are we just not able to fully realize the entropy that can be recorded into 160 (or even 128) bits?

[+] garmaine|4 years ago|reply

> Is it not possible to create a good cryptographic hashing algorithm that results in 128bit or 160bit hash?

No, it is provably not possible to do that. A N-bit hash function has maximum N/2-bits of collision resistance. So a 128-bit hash function (like MD5) has a maximum of 2^64 bits of security against a classical attacker, even if it wasn't broken in other ways (as MD5 is). A 160-bit hash function like SHA-1 would have 2^80 bits of security against collision attacks on a classical computer.

Both MD5 and SHA-1 are broken in other ways, but you could for example use SHA-2 or SHA-3 in truncated mode if you wanted a "secure" 128-bit or 160-bit hash output. Indeed there are standard operating modes for doing so, meant for when you need a drop-in replacement for MD5 or SHA-1.

But fundamentally, 64-bit or 80-bit security is too low for any except some highly specified use cases. And even then the extra bits of a stronger hash will rarely kill you, so why bother? Just use SHA-2 or SHA-3 and not worry about it.

In cases where you can demonstrate that you only care about preimage resistance and not collision resistance, then a 128-bit hash would be sufficient. However often collision attacks crop in in unexpected places or when your protocol is used in ways you didn't design for. Better to just double the hash size and not worry about it.

[+] rgovostes|4 years ago|reply

(Jan 7, 2020)

[+] lifthrasiir|4 years ago|reply

The chosen-prefix collision with a complexity of 2^63.4 (2020).

[+] akomtu|4 years ago|reply

Looks like they needed 1.5 millions GPU hours to find a collision with a random hash.

[+] 1ncorrect|4 years ago|reply

I’ve wondered for a while if it is viable to use multiple hashes, or nested hashes, to mitigate these vulnerabilities, for example:

message+hash1(message)+hash2(message) message+hash1(hash1(message))

To my lay understanding, it would provide multiple chained validation steps, but I’m curious if there are any obvious flaws with this model.

[+] jiggawatts|4 years ago|reply

This a common lay-person approach to "strengthening" cryptography, but is not as effective as using a better algorithm instead.

For example, SHA256 and SHA512 require approximately the same number of "cycles per byte" when using modern CPU instruction sets. So 2× SHA256 is 2× slower than 1× SHA512. On some processor models, 1× SHA512 is faster than 1× SHA256!

Of course, things aren't always this simple, but you get the idea.

Similarly, for some class of attacks, it's only 2× as much work to crack 2× SHA256 as it would take to crack 1×SHA256. However, for these attacks it is (2^256)× more work to crack SHA512, which takes the difficulty increase from "slightly more" to "absolutely impossible in this physical universe".

[+] garmaine|4 years ago|reply

If the hash function does its job at mixing input, this is no different than applying just the final hash.

(With the technical caveat that doing at least two serial hashes does improve the properties of the hash function by defending against length extension attacks. That's why things like HMAC do multiple chained hashes. But it does zilch for collision or preimage resistance.

[+] dang|4 years ago|reply

Some past threads:

SHA-1 collisions now cost $45k [pdf] - https://news.ycombinator.com/item?id=23350223 - May 2020 (62 comments)

The first chosen-prefix collision for SHA-1 - https://news.ycombinator.com/item?id=21979333 - Jan 2020 (352 comments)

Abusing SHA-1 collisions for Chromium updates - https://news.ycombinator.com/item?id=20114809 - June 2019 (36 comments)

A SHA-1 chosen-prefix collision attack - https://news.ycombinator.com/item?id=19907127 - May 2019 (71 comments)

From Collisions to Chosen-Prefix Collisions Application to Full SHA-1 [pdf] - https://news.ycombinator.com/item?id=19878917 - May 2019 (18 comments)

SHA-1 Collision Detection on GitHub.com - https://news.ycombinator.com/item?id=13917990 - March 2017 (90 comments)

Linus' reply on Git and SHA-1 collision - https://news.ycombinator.com/item?id=13719368 - Feb 2017 (262 comments)

When Will We See Collisions for SHA-1? (2012) - https://news.ycombinator.com/item?id=13719079 - Feb 2017 (1 comment)

Announcing the first SHA-1 collision - https://news.ycombinator.com/item?id=13713480 - Feb 2017 (485 comments)

How would Git handle a SHA-1 collision on a blob? - https://news.ycombinator.com/item?id=13547348 - Feb 2017 (5 comments)

Why it’s harder to forge a SHA-1 certificate than to find a SHA-1 collision - https://news.ycombinator.com/item?id=10778773 - Dec 2015 (43 comments)

The Cost of Creating Collisions Using SHA-1 - https://news.ycombinator.com/item?id=8629906 - Nov 2014 (9 comments)

When Will We See Collisions for SHA-1? - https://news.ycombinator.com/item?id=4618069 - Oct 2012 (51 comments)

[+] john_alan|4 years ago|reply

> Our work show that SHA-1 is now fully and practically broken for use in digital signatures

Yet 2nd preimage resistance is intact. Zzz.

[+] musicale|4 years ago|reply

Old collision (2020). ;-)

[+] r0f1|4 years ago|reply

Obligatory link: https://shattered.io/

[+] fanf2|4 years ago|reply

That was an earlier break; this article is about https://sha-mbles.github.io/

[+] nikeee|4 years ago|reply

The article states that the collision found is something different than that one found by Google et al.

197 comments