top | item 46780057

(no title)

jrockway | 1 month ago

sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.

Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/

It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

discuss

EdSchouten|1 month ago

It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.

For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.

oconnor663|1 month ago

As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.

grumbelbart2|1 month ago

Is that even when using the SHA256 hardware extensions? https://en.wikipedia.org/wiki/SHA_instruction_set

oconnor663|1 month ago

It's mixed. You get something in the neighborhood of a 3-4x speedup with SHA-NI, but the algorithm is fundamentally serial. Fully parallel algorithms like BLAKE3 and K12, which can use wide vector extensions like AVX-512, can be substantially faster (10x+) even on one core. And multithreading compounds with that, if you have enough input to keep a lot of cores occupied. On the other hand, if you're limited to one thread and older/smaller vector extensions (SSE, NEON), hardware-accelerated SHA-256 can win. It can also win in the short input regime where parallelism isn't possible (< 4 KiB for BLAKE3).