CRC is a bitwise linear function over XOR (+). Digesting a message ABCD, crc(ABCD), is simply the expression crc(crc(crc(crc(A) + B) + C) + D). To parallelize, this can be rearranged (via distributivity of CRC of XOR) as crc(crc(crc(crc(A) + B))) + crc(crc(C) + D). This is simply the CRC of the second half, XORed with the CRC of the first half passed through the function crc(crc(_)) (crc^2), which is itself linear and can be precomputed just like CRC is.
This extends to arbitrary block sizes and/or arbitrary strides through the message. So e.g. you can have four threads compute the CRCs of separate 64-byte (cache-line-sized) blocks within 256-byte chunks, separating each block within a thread using crc^192, offset the results as appropriate with crc^64^k, and then combine them with XOR.
(Protip: these fun properties are exactly what make CRCs a TERRIBLE choice for cryptography!)
You could. You could also have a billion threads. But the article is talking about single-core instruction level parallelism where normal tricks like loop unrolling won't work.
1. I don't know why Redis is using CRC-16 and CRC-64, but if you want CRC-32 then using the CRC32 instruction in SSE 4.2 yields pretty impressive performance.
2. CRCs are parallelizable, it just takes a bit more work (and math).
I believe based on context that you and the article are talking about different kinds of parallelism. Yours is the more usual sense, the author seems to be referring to the fact that CRC does not exhibit instruction level parallelism which out-of-order execution exploits. This is because the loop is very tight (probably a few instructions) and each iteration depends on the results of the previous, so there is only a tiny window for reordering instructions.
Why the heck did they decided to focus on CRC? Early optimization?
First CRC16 is in fact used as a "perfect hash function". It is confusing to focus on the implementation and not the purpose. It is used to evenly distribute data in bucket. It is a 'perfect hash function' they should use and focus on (CRC being a special subcase of totally legitimate candidates for being perfect hash function, but if tommorrow a super hyper fast perfect hash function is out there ... ).
The CRC64 is used for a valid case of CRC BUT in an invalid domain.
Mathematically I could prove they are wrong. And I know where they have a probability problem and how there are sharpening the transition between functional/dysfunctional state, but how they are making so irreversible they will be FUBAR.
1) They are leaving of course there ass open to an attack by image by a malevolent attacker (having access to the storage for which they do the checksum (it is almost used as a digital signature so it should be a crypto hash function)).
2) probability of random collisions are not balanced by the fact a cause altering the signal normally also has some odds to alter the CRC and normally the signal is also a validation of the CRC that is the reason why CRC in real life should be a fixed fraction of the signal according to the probability of coalteration of both. So if you have a tolerance for faulty storage, as a result you will use it, and you might also reconstruct data from randomly corrupted with same CRC. It will be a mess the very exceptional but predictable day one shit happens, that's all I am saying. Every safeguards will be green :)
1. from the antirez interview currently on the front page: 'One of the main characteristics of stuff I make is that they are "strange", don't resemble how a given problem was solved in the past' — sure, SSE CRC32 exists, but it's safer to use a bigger CRC, so performance wasn't a deciding factor there.
I doubt this is as big a win as claimed. Sure, a benchmark managed to achieve 1.6GB/s CRC-64 throughput, but that requires 16KB (!) of lookup tables. If those don't reliably stay in at least L2 cache, the throughput could be very poor in practice.
If you want both speed and security, Siphash-2-4 is a cryptographically secure hash algorithm that runs at ~1.4cycles/byte for long inputs (faster than even crc16speed), and yet is still fast for short inputs (2.4cpb for 64 bytes). It also doesn't require 16KB of lookup tables. https://131002.net/siphash/http://bench.cr.yp.to/results-auth.html#amd64-titan0
I've collected quite a few CRC32 methods and compared them all, it would be interesting to do the same for CRC64 to get the whole range of options too.
"crc-mark-adler-hw" uses SSE4.2 just like your "crc-sse42". The latter one is just very poorly written, since it only uses _mm_crc32_u8() once per iteration (instead of 3 times).
Nice! I've ported it to Java, and got roughly a 2x speed increase over the version posted at http://stackoverflow.com/a/20586626/58962 (same question Mark Adler posted in).
On Java 8, I get 647MB/s on a 2,3GHz MBP (the non-optimized version was 320 MB/s). Not bad, but Matt's C version manages 1628.09 MB/s.
Update: with some optimizations and eliminating I/O in the benchmark, we now clock in at a respectable 1150 MB/s (up from 390 MB/s unoptimized).
If your lookup table is too big it'll blow out your L1 cache too.
apparently unrolling
It's not just _unrolling_, but it's merging positional-dependent pre-computed values and xor'ing them together. A little more complicated than just a quick loop unrolling (hence, nobody discovered it until 2006).
Not entirely the same. The problem isn't the polynomial, but the entire model.
crcutil also has many fatal flaws for inclusion in other projects: the code is horrifically ugly, it's hosted on google code so it looks unmaintained and abandoned, it's C++, it uses unreadable ASM to be "fast" but that's not portable across all the architectures Redis runs on, and it has a horrible build system (because of ASM and C++ and other issues).
[+] [-] colanderman|11 years ago|reply
No.
CRC is a bitwise linear function over XOR (+). Digesting a message ABCD, crc(ABCD), is simply the expression crc(crc(crc(crc(A) + B) + C) + D). To parallelize, this can be rearranged (via distributivity of CRC of XOR) as crc(crc(crc(crc(A) + B))) + crc(crc(C) + D). This is simply the CRC of the second half, XORed with the CRC of the first half passed through the function crc(crc(_)) (crc^2), which is itself linear and can be precomputed just like CRC is.
This extends to arbitrary block sizes and/or arbitrary strides through the message. So e.g. you can have four threads compute the CRCs of separate 64-byte (cache-line-sized) blocks within 256-byte chunks, separating each block within a thread using crc^192, offset the results as appropriate with crc^64^k, and then combine them with XOR.
(Protip: these fun properties are exactly what make CRCs a TERRIBLE choice for cryptography!)
[+] [-] seiji|11 years ago|reply
You could. You could also have a billion threads. But the article is talking about single-core instruction level parallelism where normal tricks like loop unrolling won't work.
[+] [-] seiji|11 years ago|reply
Improving the Redis list data structure: 1.) https://matt.sh/redis-quicklist 2.) https://matt.sh/redis-quicklist-adaptive 3.) https://matt.sh/redis-quicklist-visions
Adding geo commands to Redis as a loadable module: https://matt.sh/redis-geo
Prototype of native JSON support for Redis: https://matt.sh/redis-json
[+] [-] cperciva|11 years ago|reply
1. I don't know why Redis is using CRC-16 and CRC-64, but if you want CRC-32 then using the CRC32 instruction in SSE 4.2 yields pretty impressive performance.
2. CRCs are parallelizable, it just takes a bit more work (and math).
[+] [-] fitshipit|11 years ago|reply
[+] [-] SFjulie1|11 years ago|reply
A dict/document/transaction maybe a fractal like structure, but the probability of alteration is not happening with the same shape/locality.
After reading this http://www.slideshare.net/Dataversity/redis-in-action , I am surprised.
Why the heck did they decided to focus on CRC? Early optimization?
First CRC16 is in fact used as a "perfect hash function". It is confusing to focus on the implementation and not the purpose. It is used to evenly distribute data in bucket. It is a 'perfect hash function' they should use and focus on (CRC being a special subcase of totally legitimate candidates for being perfect hash function, but if tommorrow a super hyper fast perfect hash function is out there ... ).
The CRC64 is used for a valid case of CRC BUT in an invalid domain.
Mathematically I could prove they are wrong. And I know where they have a probability problem and how there are sharpening the transition between functional/dysfunctional state, but how they are making so irreversible they will be FUBAR.
1) They are leaving of course there ass open to an attack by image by a malevolent attacker (having access to the storage for which they do the checksum (it is almost used as a digital signature so it should be a crypto hash function)). 2) probability of random collisions are not balanced by the fact a cause altering the signal normally also has some odds to alter the CRC and normally the signal is also a validation of the CRC that is the reason why CRC in real life should be a fixed fraction of the signal according to the probability of coalteration of both. So if you have a tolerance for faulty storage, as a result you will use it, and you might also reconstruct data from randomly corrupted with same CRC. It will be a mess the very exceptional but predictable day one shit happens, that's all I am saying. Every safeguards will be green :)
Something seems fishy.
[+] [-] seiji|11 years ago|reply
2. Are they? Do explain.
[+] [-] Scaevolus|11 years ago|reply
If speed is important, xxHash is probably a better choice than CRC-64. http://fastcompression.blogspot.com/2012/04/selecting-checks...
If you want both speed and security, Siphash-2-4 is a cryptographically secure hash algorithm that runs at ~1.4cycles/byte for long inputs (faster than even crc16speed), and yet is still fast for short inputs (2.4cpb for 64 bytes). It also doesn't require 16KB of lookup tables. https://131002.net/siphash/ http://bench.cr.yp.to/results-auth.html#amd64-titan0
[+] [-] seiji|11 years ago|reply
So, given that constraint, any better solutions are welcome. :)
[+] [-] baruch|11 years ago|reply
The results are here: https://github.com/baruch/crcbench/blob/master/log.i3-2330M-...
[+] [-] b3tta|11 years ago|reply
[+] [-] MrBuddyCasino|11 years ago|reply
On Java 8, I get 647MB/s on a 2,3GHz MBP (the non-optimized version was 320 MB/s). Not bad, but Matt's C version manages 1628.09 MB/s.
Update: with some optimizations and eliminating I/O in the benchmark, we now clock in at a respectable 1150 MB/s (up from 390 MB/s unoptimized).
[+] [-] istvan__|11 years ago|reply
URL related: http://java-is-the-new-c.blogspot.com/2014/12/a-persistent-k...
[+] [-] Isamu|11 years ago|reply
So if speed is the thing, why not a 16-bit table? Same idea, faster. 8*64k = 512 kbyte table.
[+] [-] seiji|11 years ago|reply
apparently unrolling
It's not just _unrolling_, but it's merging positional-dependent pre-computed values and xor'ing them together. A little more complicated than just a quick loop unrolling (hence, nobody discovered it until 2006).
[+] [-] tlipcon|11 years ago|reply
[+] [-] seiji|11 years ago|reply
crcutil also has many fatal flaws for inclusion in other projects: the code is horrifically ugly, it's hosted on google code so it looks unmaintained and abandoned, it's C++, it uses unreadable ASM to be "fast" but that's not portable across all the architectures Redis runs on, and it has a horrible build system (because of ASM and C++ and other issues).
[+] [-] MichaelGG|11 years ago|reply
Docs say it hits over 13GB/sec on a Core i5 3340M @2.7GHs.
Or if this is an older design decision, why not use MurmurHash, which is also pretty fast and has been around for a while.
[+] [-] seiji|11 years ago|reply
The article didn't pretend to compare all possible options, it aimed to improved exiting options while maintaing backwards compatibility.
Redis doesn't make design decisions based on what's fast, it prefers to use what's simple.
[+] [-] tedunangst|11 years ago|reply
Somebody needs to test their code on a CPU that requires alignment before loading 64bit values.
[+] [-] unknown|11 years ago|reply
[deleted]