Switching from GMP to GCC's __int128 reduced run time by 95% (2016)

[+] nanis|4 years ago|reply

Note that soon after I posted this, Someone noticed[1,2] that it is possible avoid a whole lot of work, but I am still glad I realized how much performance one can gain by dropping GMP in favor of `__int128`.

It seems like for numbers on this side of 1024 bits, writing purpose-built routines using `__int128` or SSE/AVX instructions rather than general functionality is likely to pay off.

[1]: https://news.ycombinator.com/item?id=10915461

[2]: https://www.nu42.com/2016/01/excellent-numbers-explicit-solu...

[+] tgsovlerkhgsel|4 years ago|reply

Fun "large number handling" facts:

- GMP has limits (compare e.g. https://gmplib.org/list-archives/gmp-discuss/2004-April/0011...). 2^31 limbs of 64 bit each means you cannot process numbers larger than ~2^37 bits (2^34 bytes = 16 GiB) with the standard mpz functions.

- Multiplication of large number is its own science, with different algorithms optimal at different sizes. (https://gmplib.org/manual/Multiplication-Algorithms)

[+] olliej|4 years ago|reply

This is presumably due to losing many/most function calls (I can't imagine division being inlined but maybe I'm wrong), and fixed size allocation.

My guess is the rest of the overhead is things like gmp needing to allocate heap storage as it doesn't get the aid of having fixed size storage.

[+] madars|4 years ago|reply

Yep, the program referenced in the blogpost https://github.com/briandfoy/excellent_numbers/blob/master/c... is using GMP's variable-width numbers (mpz_* routines). If you know bit lengths in advance you provide pre-allocated C arrays to mpn_* routines: https://gmplib.org/manual/Low_002dlevel-Functions . This is likely going to be much faster (you avoid bookkeeping and malloc inside mpz_*), and what e.g. libsnark does for its general Fp arithmetic. (E.g. addition becomes https://github.com/scipr-lab/libff/blob/develop/libff/algebr... ) Of course, mpn_* routines still do loops which __int128 ones would presumably unroll but the bulk of the gain should come from reducing mpz_* overheads.

[+] unknown|4 years ago|reply

[deleted]

[+] anthk|4 years ago|reply

What about MPI?

[+] borramakot|4 years ago|reply

I don't know much about MPI, but I thought it related to message passing/distributed workloads rather than arithmetic performance?

9 comments