(no title)
jrockway | 1 month ago
Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/
It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.
EdSchouten|1 month ago
Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.
For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.
oconnor663|1 month ago
grumbelbart2|1 month ago
oconnor663|1 month ago