top | item 39837363

(no title)

IMHO "be padded into a comment" is included in "is valid code", still 1 in <number_of_particles_in_universe_here^1E100> is a good approximation of that probability.

Please, correct me if I'm wrong.

discuss

maxcoder4|1 year ago

Do you mean with the current public knowledge or hypothetically? For md5 all of these are doable right now (except maybe code that "makes sense"for human reader). Also in practice it's much easier to do this with a data file, as demonstrated for SHA1 with a "backdoored" certificate.

hn_p4ttern|1 year ago

1) We are talking about sha1, md5 is out of topic

2) This is the main topic ! Being able to generate >>valid code<< with a >>specific purpose<< , so that GIT have to change its hashing algorithm;

3) A.K.A your answer is total nonsense.

Everyone else, ok, I'm listening, give proof that you can change code on GitHub stealthy messing with hashing, moreover inserting a "payload" creating a SHA-1 collision in a reasonable computational time, everything else is BS.

TrueDuality|1 year ago

Even without comments your additional requirements aren't relevant, but not in the way I think you're assuming.

When you're searching for a practical collision you only need a way to generate systematic output that semantically will be interpreted with your intent. The easiest way to do this is to include semantically irrelevant data to something that was manually produced that is semantically relevant.

In the programming domain, source code specifically, comments are the easiest way to include semantically irrelevant information but you could also include unused functions, variable names etc. You are literally limited by the constraints of your imagination and your ability to dodge CI failure checks.

Aha! You might say, but any human that saw that change or PR would immediately notice the garbage produced and catch the collision attempt! (this is your argument) Unfortunately no, that assumes your search space that I talked about is over semantic garbage. It's a bit more work, but your search space for a collision could be "Shakespearean sonnet's that would make a literary buff cry" as long as you had a generator that could produce it and produced different outputs from different seeds.

We now have access to a generator that can take an incrementing seed number, and produce both semantically meaningful and meaningful semantically irrelevant content. The language models. Interestingly this moves the compute cost to the generator (usually the compute restriction is on the hash being attacked).

It's definitely not practical with our current compute capabilities to attack a search space of 2^256 through brute force for a simple hash much less including waiting for a language model to produce an output using a different input seed for each check but that's not what this article is about either...

What these collision attacks (such as the linked article) do is _decrease the search space_. Without any algorithmic tricks the search space for sha2-256 is 2^256. These tricks are eating away at that exponent. This work results in a reduction of a collision to 2^49.8. That is a massive drop in the search space. Is it still feasible to attack today? Absolutely not. But a few more of these tricks and I can see those "garbage comments" collision happening, but wait a tiny little fraction of a time beyond that and include language models for your search space?

Hell your changes could be _productive_ and produced incrementally through a series of commits if you really wanted to limit your search space and get creative about it.

With SHA-1 collisions attacks using semantic garbage are already considered practical. We're still probably computationally constrained in using language models to produce semantically viable collisions but we're not that far off either. Those comments won't be garbage. You will not be able to distinguish it from any other AI generated code being committed which is rapidly improving in quality and efficiency to generate.

Even without language models you could use something like a language's EBNF grammar as a token generator for source code which would probably pass any glance checks, but definitely not dedicated inspection like a code review. That is probably something that IS PRACTICAL TODAY for SHA1.

hn_p4ttern|1 year ago

My point is: why you should change hashing algorithm in GIT ??? Let's elaborate:

1. Do SHA-1 put a security risk in GIT ?

2. Is that practically exploitable in any way?

In some application, for example password hashing, SSH MAC, etc, you have good reasons to change hashing algorithm when it became obsolete: because an attacker can be computationally advantaged to crack a password, to compromise the integrity of transmitted packets, etc.

But not because an hashing algorithm became obsolete for some application is obsolete for ALL possible application. Moreover, in some specific application could be DESIRABLE a fasted hashing algorithm.

So why You should change SHA-1 in GIT ?

>> "But a few more of these tricks and I can see those "garbage comments" collision happening"

I don't think so, is computationally astronomically difficult whatever tricks yo u invent. The point here IS NOT to generate a collision adding "garbage comments", again, is to alter the behaviour of committed code in a functional way.

>> "Even without language models you could use something like a language's EBNF grammar as a token generator for source code which would probably pass any glance checks, but definitely not dedicated inspection like a code review. That is probably something that IS PRACTICAL TODAY for SHA1"

Yeah, prove it!

Rygian|1 year ago

That's what I meant.