Improving Redis CRC performance

[+] colanderman|11 years ago|reply

CRCs are inherently non-parallelizable

No.

CRC is a bitwise linear function over XOR (+). Digesting a message ABCD, crc(ABCD), is simply the expression crc(crc(crc(crc(A) + B) + C) + D). To parallelize, this can be rearranged (via distributivity of CRC of XOR) as crc(crc(crc(crc(A) + B))) + crc(crc(C) + D). This is simply the CRC of the second half, XORed with the CRC of the first half passed through the function crc(crc(_)) (crc^2), which is itself linear and can be precomputed just like CRC is.

This extends to arbitrary block sizes and/or arbitrary strides through the message. So e.g. you can have four threads compute the CRCs of separate 64-byte (cache-line-sized) blocks within 256-byte chunks, separating each block within a thread using crc^192, offset the results as appropriate with crc^64^k, and then combine them with XOR.

(Protip: these fun properties are exactly what make CRCs a TERRIBLE choice for cryptography!)

[+] seiji|11 years ago|reply

you can have four threads compute

You could. You could also have a billion threads. But the article is talking about single-core instruction level parallelism where normal tricks like loop unrolling won't work.

[+] seiji|11 years ago|reply

If you like this, also check out other Redis-related walkthroughs from this year.

Improving the Redis list data structure: 1.) https://matt.sh/redis-quicklist 2.) https://matt.sh/redis-quicklist-adaptive 3.) https://matt.sh/redis-quicklist-visions

Adding geo commands to Redis as a loadable module: https://matt.sh/redis-geo

Prototype of native JSON support for Redis: https://matt.sh/redis-json

[+] cperciva|11 years ago|reply

Two things come to mind here:

1. I don't know why Redis is using CRC-16 and CRC-64, but if you want CRC-32 then using the CRC32 instruction in SSE 4.2 yields pretty impressive performance.

2. CRCs are parallelizable, it just takes a bit more work (and math).

[+] fitshipit|11 years ago|reply

I believe based on context that you and the article are talking about different kinds of parallelism. Yours is the more usual sense, the author seems to be referring to the fact that CRC does not exhibit instruction level parallelism which out-of-order execution exploits. This is because the loop is very tight (probably a few instructions) and each iteration depends on the results of the previous, so there is only a tiny window for reordering instructions.

[+] SFjulie1|11 years ago|reply

2. Theory of CRC comes with noisy telecom lines where noise have a weired non linear random distribution making it impossible to "vectorize" the problem (aka parallelizing) (http://scienceblogs.com/goodmath/2007/07/27/fractal-dust-and...)

A dict/document/transaction maybe a fractal like structure, but the probability of alteration is not happening with the same shape/locality.

After reading this http://www.slideshare.net/Dataversity/redis-in-action , I am surprised.

Why the heck did they decided to focus on CRC? Early optimization?

First CRC16 is in fact used as a "perfect hash function". It is confusing to focus on the implementation and not the purpose. It is used to evenly distribute data in bucket. It is a 'perfect hash function' they should use and focus on (CRC being a special subcase of totally legitimate candidates for being perfect hash function, but if tommorrow a super hyper fast perfect hash function is out there ... ).

The CRC64 is used for a valid case of CRC BUT in an invalid domain.

Mathematically I could prove they are wrong. And I know where they have a probability problem and how there are sharpening the transition between functional/dysfunctional state, but how they are making so irreversible they will be FUBAR.

1) They are leaving of course there ass open to an attack by image by a malevolent attacker (having access to the storage for which they do the checksum (it is almost used as a digital signature so it should be a crypto hash function)). 2) probability of random collisions are not balanced by the fact a cause altering the signal normally also has some odds to alter the CRC and normally the signal is also a validation of the CRC that is the reason why CRC in real life should be a fixed fraction of the signal according to the probability of coalteration of both. So if you have a tolerance for faulty storage, as a result you will use it, and you might also reconstruct data from randomly corrupted with same CRC. It will be a mess the very exceptional but predictable day one shit happens, that's all I am saying. Every safeguards will be green :)

Something seems fishy.

[+] seiji|11 years ago|reply

1. from the antirez interview currently on the front page: 'One of the main characteristics of stuff I make is that they are "strange", don't resemble how a given problem was solved in the past' — sure, SSE CRC32 exists, but it's safer to use a bigger CRC, so performance wasn't a deciding factor there.

2. Are they? Do explain.

[+] Scaevolus|11 years ago|reply

I doubt this is as big a win as claimed. Sure, a benchmark managed to achieve 1.6GB/s CRC-64 throughput, but that requires 16KB (!) of lookup tables. If those don't reliably stay in at least L2 cache, the throughput could be very poor in practice.

If speed is important, xxHash is probably a better choice than CRC-64. http://fastcompression.blogspot.com/2012/04/selecting-checks...

If you want both speed and security, Siphash-2-4 is a cryptographically secure hash algorithm that runs at ~1.4cycles/byte for long inputs (faster than even crc16speed), and yet is still fast for short inputs (2.4cpb for 64 bytes). It also doesn't require 16KB of lookup tables. https://131002.net/siphash/ http://bench.cr.yp.to/results-auth.html#amd64-titan0

[+] seiji|11 years ago|reply

The constraint was: the output must be CRC-64-Jones, so switching implementations wasn't an option.

So, given that constraint, any better solutions are welcome. :)

[+] baruch|11 years ago|reply

I've collected quite a few CRC32 methods and compared them all, it would be interesting to do the same for CRC64 to get the whole range of options too.

The results are here: https://github.com/baruch/crcbench/blob/master/log.i3-2330M-...

[+] b3tta|11 years ago|reply

"crc-mark-adler-hw" uses SSE4.2 just like your "crc-sse42". The latter one is just very poorly written, since it only uses _mm_crc32_u8() once per iteration (instead of 3 times).

[+] MrBuddyCasino|11 years ago|reply

Nice! I've ported it to Java, and got roughly a 2x speed increase over the version posted at http://stackoverflow.com/a/20586626/58962 (same question Mark Adler posted in).

On Java 8, I get 647MB/s on a 2,3GHz MBP (the non-optimized version was 320 MB/s). Not bad, but Matt's C version manages 1628.09 MB/s.

Update: with some optimizations and eliminating I/O in the benchmark, we now clock in at a respectable 1150 MB/s (up from 390 MB/s unoptimized).

[+] istvan__|11 years ago|reply

I am not surprised about the Java performance.

[+] Isamu|11 years ago|reply

The improvement is to switch from a 256-entry table to an 8x256 table, apparently unrolling the 8-bit loop to get a full 64 bits in each iteration.

So if speed is the thing, why not a 16-bit table? Same idea, faster. 8*64k = 512 kbyte table.

[+] seiji|11 years ago|reply

If your lookup table is too big it'll blow out your L1 cache too.

apparently unrolling

It's not just _unrolling_, but it's merging positional-dependent pre-computed values and xor'ing them together. A little more complicated than just a quick loop unrolling (hence, nobody discovered it until 2006).

[+] tlipcon|11 years ago|reply

Existing library which does the same thing, and faster, with arbitrary polynomials: http://code.google.com/p/crcutil/

[+] seiji|11 years ago|reply

Not entirely the same. The problem isn't the polynomial, but the entire model.

crcutil also has many fatal flaws for inclusion in other projects: the code is horrifically ugly, it's hosted on google code so it looks unmaintained and abandoned, it's C++, it uses unreadable ASM to be "fast" but that's not portable across all the architectures Redis runs on, and it has a horrible build system (because of ASM and C++ and other issues).

[+] MichaelGG|11 years ago|reply

Why not use xxHash? https://code.google.com/p/xxhash/

Docs say it hits over 13GB/sec on a Core i5 3340M @2.7GHs.

Or if this is an older design decision, why not use MurmurHash, which is also pretty fast and has been around for a while.

[+] seiji|11 years ago|reply

That's almost like answering "do you want chicken or steak?" with "I'd like a balloon!"

The article didn't pretend to compare all possible options, it aimed to improved exiting options while maintaing backwards compatibility.

Redis doesn't make design decisions based on what's fast, it prefers to use what's simple.

[+] tedunangst|11 years ago|reply

> This is a pure performance optimization, not a correctness requirement. Removing the initial pre-alignment loop doesn't change any results.

Somebody needs to test their code on a CPU that requires alignment before loading 64bit values.

[+] unknown|11 years ago|reply

[deleted]

38 comments