top | item 41999218

(no title)

curiousllama | 1 year ago

Out of curiosity, what is false positive rate of a hash match?

If the FPR is comparable to asking a human "are these the same image?", then it would seem to be equivalent to a visual search. I wonder if (or why) human verification is actually necessary here.

discuss

order

EVa5I7bHFq9mnYK|1 year ago

I doubt sha1 hashes are used for this. Those image hashes should match files regardless of orientation, cropping, resizing, re-compression, color correction etc. The collision could be far more frequent with these hashes.

bluGill|1 year ago

The hash should ideally match even if you use photoshop to cut the one person out of the picture and put that person into a different photo. I'm not sure if that is possible, but that is what we want.

cool_dude85|1 year ago

The reason human verification is necessary is that the government is relying on something called the "private search" doctrine to conduct the search without a warrant. This doctrine allows them to repeat a search already conducted by a private party (i.e., Google) without getting a warrant. Since Google didn't actually look at the file, the government is not able to look at the file without a warrant, as that search exceeds the scope of the initial search Google performed.

ARandumGuy|1 year ago

Naively, 1/(2^{hash_size_in_bits}). Which is about 1 in 4 billion odds for a 32 bit hash, and gets astronomically low at higher bit counts.

Of course, that's assuming a perfect, evenly distributed hash algorithm. And that's just the odds that any given pair of images has the same hash, not the odds that a hash conflict exists somewhere on the internet.

henryfjordan|1 year ago

You need to know the input space as well as the output space (hash size).

If you have a 32bit hash but your input is only 16bit, you'll never have a collision (and you'll be wasting a ton of space on your hashes!).

Image files can get into the megabytes though, so unless the output hash is large the potential for collisions is probably not all that low.

gorjusborg|1 year ago

> Out of curiosity, what is false positive rate of a hash match?

No way to know without knowledge of the 'proprietary hashing technology'. Theoretically though, a hash can have infinitely many inputs that produce the same output.

Mismatching hash values from the same hashing algorithm can prove mismatching inputs, but matching hash values don't ensure matching inputs.

> I wonder if (or why) human verification is actually necessary here

It's not about frequency, it's about criticality of getting it right. If you are going to make a negatively life-altering report on someone, you'd better make sure the accusation is legitimate.

cool_dude85|1 year ago

I'd say the focus on hashing is a bit of a red herring.

Most anyone would agree that the hash matching should probably form probable cause for a warrant, allowing a judge to sign off on the police searching (i.e., viewing) the image. So, if it's a collision, the cops get a warrant and open up your linux ISO or cat meme, and it's all good. Probably the ideal case is that they get a warrant to search the specific image, and are only able to obtain a warrant to search your home and effects, etc. if the image does appear to be CSAM.

At issue here is the fact that no such warrant was obtained.

nokcha|1 year ago

For non-broken cryptographic hashes (e.g., SHA-256), the false-positive rate is negligible. Indeed, cryptographic hashes were designed so that even nation-state adversaries do not have the resources to generate two inputs that hash to the same value.

See also:

https://en.wikipedia.org/wiki/Collision_resistance

https://en.wikipedia.org/wiki/Preimage_attack

int_19h|1 year ago

These are not the kinds of hashes used for CSAM detection, though, because that would only work for the exact pixel-by-pixel copy - any resizing, compression etc would drastically change the hash.

Instead, systems like these use perceptual hashing, in which similar inputs produce similar hashes, so that one can test for likeness. Those have much higher collision rates, and are also much easier to deliberately generate collisions for.