When good pseudorandom numbers go bad

somat|9 months ago

This is a perpetual problem in computer science, people want a hash function, then decide that the random function very nearly does what they want. then use the random function as a hash function and are dumbfounded that it turns out there is no hard specification for how it works internally. The random function has no guarantee of result across versions, system, time. Hell I would even be harsh enough to say that any reproducibility in random is an accident of implementation. random should always be considered the non-deterministic function, hash is the deterministic function.

hansvm|9 months ago

> hash is the deterministic function

In most languages and libraries, hashing is still only deterministic within a given run of the program. Authors generally have no qualms "fixing" implementations over time, and some systems introduce a salt to the hash intentionally to help protect web programmers from DOS "CVEs".

If you want reproducible RNG, you need to write it yourself.

System's based on Jax's notion of splitting a PRNG are often nice. If your splitting function takes in inputs (salts, seeds, whatever you want to call them), you can gain the property that sub-branches are reproducible even if you change the overall input to have more or fewer components.

FooBarBizBazz|9 months ago

The article has absolutely nothing to do with "hashes vs. PRNGs". Literally nothing. It's all about the linear algebra and floating point operations that happen (in this case) to be used to transform those numbers. It has nothing to do with PRNGs.

Also, PRNG and hash functions are extremely similar, and there's no reason why one should be deterministic and the other not. They're both built out of simple integer and bit operations. If a certain PRNG is implementation-defined, that's a detail of the specific (possibly standard) library you've chosen; it's nothing fundamental.

AlotOfReading|9 months ago

Most people don't really care about numerical stability or correctness. What they usually want is reproducibility, but they go down a rabbit hole with those other topics as a way to get it, at least in part because everyone thinks reproducibility is too slow.

It was 20 years ago, but that's not the case today. The vast majority of hardware today implements 754 reproducibly if you're willing to stick to a few basic principles:

1. same inputs

2. same operations, in the same order

3. no "special functions", denormals, or NaNs.

If you accept these restrictions (and realistically you weren't handling NaNs or denormals properly anyway), you can get practical reproducibility on modern hardware for minimal (or no) performance cost if your toolchain cooperates. Sadly, toolchains don't prioritize this because it's easy to get wrong across the scope of a modern language and users don't know that it's possible.

craigacp|9 months ago

The same operations in the same order is a tough constraint in an environment where core count is increasing and clock speeds/IPC are not. It's hard to rewrite some of these algorithms to use a parallel decomposition that's the same as the serial one.

I've done a lot of work on reproducibility in machine learning systems, and its really, really hard. Even the JVM got me by changing some functions in `java.lang.Math` between versions & platforms (while keeping to their documented 2ulp error bounds).

thaumasiotes|9 months ago

I don't actually understand why you'd want reproducibility in a statistical simulation. If you fix the output, what are you learning? The point of the simulation is to produce different random numbers so you can see what the outcomes are like... right?

Let's say I write a paper that says "in this statistical model, with random seed 1495268404, I get Important Result Y", and you criticize me on the grounds that when you run the model with random seed 282086400, Important Result Y does not hold. Doesn't this entire argument fail to be conceptually coherent?

hansvm|9 months ago

4. No math libraries, even if their results aren't used in the deterministic path you care about (e.g., floating-point rounding mode bugs)

5. All floating-point optimizations must be hand-rolled. That's implied by (2) and "toolchain cooperates", but it's worth calling out explicitly. Consider, e.g., summing a few thousand floats. On many modern computers, a near-optimal solution is having four vector accumulators, skipping in chunks four vectors wide, adding to each accumulator, adding the accumulators at the end, and then horizontally adding the result (handling any non-length-aligned straggling elements left as an exercise for the reader, and optimal behavior for those varies wildly). However, this has different results depending on whether you use SSE, AVX, or AVX512 if you want to use the hardware to its full potential. You need to make a choice (usually a bias toward wider vector types is better than smaller types, especially on AMD chips, but this is problem-specific), and whichever choice you make you can't let the compiler reorder it.

dzaima|9 months ago

Even denormals and NaNs should be perfectly consistent, at least on CPUs. (as long as you're not inspecting the bit patterns of the NaNs, at least)

Irrational stdlib functions (trig/log/exp; not sqrt though for whatever reason) should really be basically the only source of non-reproducibility in typical programs (assuming they don't explicitly do different things depending on non-controlled properties; and don't use libraries doing that either, which is also a non-trivial ask; and that there's no overly-aggressive optimizer that does incorrect transformations).

I'd hope that languages/libraries providing seeded random sources with a guarantee of equal behavior across systems would explicitly note which operations aren't reproducible though, otherwise seeding is rather pointless; no clue if R does that.

aragilar|9 months ago

I mean, if you get different results for the same input on the same machine with the same code, I'd be concerned. But that's not that interesting really, as you actually care that you get the right answer, not the same answer (otherwise you get https://en.wikipedia.org/wiki/Oil_drop_experiment). Stability and correctness will get you closer to a right answer than demanding you get the same answer to someone else who's solving the wrong problem the wrong way.

mjcohen|9 months ago

I found this an enjoyable read. I also have Wilkinson, both text and Algol book, which I used many years ago to write a fortran eigenvalue/vector routine. Worked very nicely. Done in VAX fortran and showed me that having subscript checking on added 30% to the run time.

jdougan|9 months ago

I wonder what the bounds-check overhead looks like on a modern machine for that code?

coolcase|9 months ago

I don't grok this but if you had to describe it in a nutshell, is this because of a race condition? Differences in HW? Floating point ops have some randomness built in?

mattb314|9 months ago

Super rough summary of the first half: in order to pick out random vectors with a given shape (where the "shape" is determined by the covariance matrix), MASS::mvrnorm() computes some eigenvectors, and eigenvectors are only well defined up to a sign flip. This means tiny floating differences between machines can result in one machine choosing v_1, v_2, v_3,... as eigenvectors, while another machine chooses -v_1, v_3, -v_3,... The result for sampling random numbers is totally different with the sign flips (but still "correct" because we only care about the overall distribution--these are random numbers after all). The section around "Q1 / Q2" is the core of the article.

There's a lot of other stuff here too: mvtnorm::rmvnorm() also can use eigendecomp to generate your numbers, but it does some other stuff to eliminate the effect of the sign flips so you don't see this reproducibility issue. mvtnorm::rmvnorm also supports a second method (Cholesky decomp) that is uniquely defined and avoids eigenvectors entirely, so it's more stable. And there's some stuff on condition numbers not really mattering for this problem--turns out you can't describe all possible floating point problems a matrix could have with a single number.

jam0wal|9 months ago

So, you want to use the random function but want a constant output. Simpler to just use a constant array and not impose your bias on a corner case interpretation on random.

seanhunter|9 months ago

You can’t always do that because you often don’t know how many pseudorandom numbers you will need. Search for probabilistic computational linear algebra for more, but say you’re trying to do research into genetic conditions. Your data might be a matrix of samples vs genes and each cell a record of how much a gene is expressed in a certain sample. So you have a very big matrix that you need to do a singular value decomposition. The standard way to donthis involves random sampling of the columns to make the computational complexity manageable. You would still want to seed the rng so your results are reproducible.

38 comments