top | item 47069590

(no title)

Heh. I once had to make an argument that 256 bit randomly assigned identifiers are good enough without explicit collision checking. People wanted me to add complex and expensive collision checks.

My argument was the 2^256 actually approaches the number of atom in the observable universe (within 1 to 3 orders of magnitude), and that collisions are so unlikely that we'll have millions of datacenter meltdowns first (all assuming we have a good source of random numbers, of course). In the end I convinced everybody that even 128 bits are good enough, without any collision checking required.

I thought my arguments was clever, but this is so much better. :)

discuss

da_chicken|12 days ago

Nah, it's much easier than that.

The total amount of computer data across all of humanity is less that 1 yottabyte. We're expected to reach 1 yottabyte within the next decade, and will probably do so before 2030. That's all data, everywhere, including nation-states.

The birthday paradox says that you'll reach a 50% chance of at least one collision (as a conservative first order approximation) at the square root of the domain size. sqrt(2^256) is 2^128.

Now, a 256 bit identifier takes up 32 bytes of storage. 2^128 * 32 bytes = 10^16 yottabytes. That's 10 quadrillion yottabytes just to store the keys. And it's even odds whether you'll have a collision or not.

And if the 50% number scares them, well, you'll have a 1% chance of a collision at around... 2^128 * 0.1. Yeah, so you don't reach a 1% over the whole life of the system until you get to a quadrillion yottabytes.

Because you're never getting anywhere near the square root of the size, the chances of any collision occurring are flatly astronomical.

nextaccountic|12 days ago

If the mechanism for generating those 256 random bits is distributed and untrusted parties generate ids, then you need collision detection because they may be malicious

If it's not distributed you can just have a counter

If it's distributed but coordinated by a single party (say, it's your servers), you can do sharding on incremented counters. Like, every server are assigned a region of ids

linuxhansl|11 days ago

In this case it was distributed without our data centers (10k's machines or so at that time spread around the planet), but the code to generate ids was 100% under our control. Rather than inventing some distributed generation (or collision detection), a stateless approach with random numbers just seemed the right choice.