top | item 46704320

(no title)

zigzag312 | 1 month ago

> the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.

It seems that JPEG can be decoded on the GPU [1] [2]

> CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation.

According to smhasher tests [3] CRC32 is not limited by memory bandwidth. Even if we multiply CRC32 scores x4 (to estimate 512 bit wide SIMD from 128 bit wide results), we still don't get close to memory bandwidth.

The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely an improvement over CRC32.

> to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash

Why would you need to use a cryptographic hash function to check integrity of archived files? Quality a non-cryptographic hash function will detect corruptions due to things like bit-rot, bad RAM, etc. just the same.

And why is 256 bits needed here? Kopia developers, for example, think 128 bit hashes are big enough for backup archives [4].

[1] https://docs.nvidia.com/cuda/nvjpeg/index.html

[2] https://github.com/CESNET/GPUJPEG

[3] https://github.com/rurban/smhasher

[4] https://github.com/kopia/kopia/issues/692

discuss

myrmidon|1 month ago

Maybe the CRC32 implementations in the smasher suite just aren't that fast?

[1] claims 15 GB/s for the slowest implementation (Chromium) they compared (all vectorized).

> The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely an improvement over CRC32.

Why? What kind of error rate do you expect, and what kind of reliability do you want to achieve? Assumptions that would lead to a >32bit checksum requirement seem outlandish to me.

[1] https://github.com/corsix/fast-crc32?tab=readme-ov-file#x86_...

zigzag312|1 month ago

From SMHasher test results quality of xxhash seems higher. It has less bias / higher uniformity that CRC.

What bothers me with probability calculations, is that they always assume perfect uniformity. I've never seen any estimates how bias affects collision probability and how to modify the probability formula to account for non-perfect uniformity of a hash function.

fc417fc802|1 month ago

> The 32 bit hash of CRC32 is too low for file checksums.

What makes you say this? I agree that there are better algorithms than CRC32 for this usecase, but if I was implementing something I'd most likely still truncate the hash to somewhere in the same ballpark (likely either 32, 48, or 64 bits).

Note that the purpose of the hash is important. These aren't being used for deduplication where you need a guaranteed unique value between all independently queried pieces of data globally but rather just to detect file corruption. At 32 bits you have only a 1 out of 2^(32-1) chance of a false negative. That should be more than enough. By the time you make it to 64 bits, if you encounter a corrupted file once _every nanosecond_ for the next 500 years or so you would expect to miss only a single event. That is a rather absurd level of reliability in my view.

zigzag312|1 month ago

I've seen few arguments that with the amount of data we have today the 2^(32-1) chance can happen, but I can't vouch their calculations were done correctly.

Readme in SMHasher test suite also seems to indicate that 32 bits might be too few for file checksums:

"Hash functions for symbol tables or hash tables typically use 32 bit hashes, for databases, file systems and file checksums typically 64 or 128bit, for crypto now starting with 256 bit."

jmillikin|1 month ago

  > It seems that JPEG can be decoded on the GPU [1] [2]

Sure, but you wouldn't want to. Many algorithms can be executed on a GPU via CUDA/ROCm, but the use cases for on-GPU JPEG/PNG decoding (mostly AI model training? maybe some sort of giant megapixel texture?) are unrelated to anything you'd use CBZ for.

For a comic book the performance-sensitive part is loading the current and adjoining pages, which can be done fast enough to appear instant on the CPU. If the program does bulk loading then it's for thumbnail generation which would also be on the CPU.

Loading compressed comic pages directly to the GPU would be if you needed to ... I dunno, have some sort of VR library browser? It's difficult to think of a use case.

  > According to smhasher tests [3] CRC32 is not limited by memory bandwidth.
  > Even if we multiply CRC32 scores x4 (to estimate 512 bit wide SIMD from 128
  > bit wide results), we still don't get close to memory bandwidth.

Your link shows CRC32 at 7963.20 MiB/s (~7.77 GiB/s) which indicates it's either very old or isn't measuring pure CRC32 throughput (I see stuff about the C++ STL in the logs).

Look at https://github.com/corsix/fast-crc32 for example, which measures 85 GB/s (GB, GiB, eh close enough) on the Apple M1. That's fast enough that I'm comfortable calling it limited by memory bandwidth on real-world systems. Obviously if you solder a Raspberry Pi to some GDDR then the ratio differs.

  > The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely
  > an improvement over CRC32.

You don't want to use xxhash (or crc32, or cityhash, ...) for checksums of archived files, that's not what they're designed for. Use them as the key function for hash tables. That's why their output is 32- or 64-bits, they're designed to fit into a machine integer.

File checksums don't have the same size limit so it's fine to use 256- or 512-bit checksum algorithms, which means you're not limited to xxhash.

  > Why would you need to use a cryptographic hash function to check integrity
  > of archived files? Quality a non-cryptographic hash function will detect
  > corruptions due to things like bit-rot, bad RAM, etc. just the same.

I have personally seen bitrot and network transmission errors that were not caught by xxhash-type hash functions, but were caught by higher-level checksums. The performance properties of hash functions used for hash table keys make those same functions less appropriate for archival.

  > And why is 256 bits needed here? Kopia developers, for example, think 128
  > bit hashes are big enough for backup archives [4].

The checksum algorithm doesn't need to be cryptographically strong, but if you're using software written in the past decade then SHA256 is supported everywhere by everything so might as well use it by default unless there's a compelling reason not to.

For archival you only need to compute the checksums on file transfer and/or periodic archive scrubbing, so the overhead of SHA256 vs SHA1/MD5 doesn't really matter.

I don't know what kopia is, but according to your link it looks like their wire protocol involves each client downloading a complete index of the repository content, including a CAS identifier for every file. The semantics would be something like Git? Their list of supported algorithms looks reasonable (blake, sha2, sha3) so I wouldn't have the same concerns as I would if they were using xxhash or cityhash.

zigzag312|1 month ago

> which can be done fast enough to appear instant on the CPU

Big scanned PDFs can be problfrom more efficient processing (if it had HW support for such technique)

> Your link shows CRC32 at 7963.20 MiB/s (~7.77 GiB/s) which indicates it's either very old or isn't measuring pure CRC32 throughput

It may not be fastest implementation of CRC32, but it's also done on old Ryzen 5 3350G 3.6GHz. Below the table are results done on different HW. On Intel i7-6820HQ CRC32 achieves 27.6 GB/s.

> measures 85 GB/s (GB, GiB, eh close enough) on the Apple M1. That's fast enough that I'm comfortable calling it limited by memory bandwidth on real-world systems.

That looks incredibly suspicious since Apple M1 has maximum memory bandwidth of 68.25 GB/s [1].

> I have personally seen bitrot and network transmission errors that were not caught by xxhash-type hash functions, but were caught by higher-level checksums. The performance properties of hash functions used for hash table keys make those same functions less appropriate for archival.

Your argument is meaningless without more details. xxhash supports 128 bits, which I doubt wouldn't be able to catch an error in you case.

SHA256 is an order of magnitude or more slower than non-cryptographic hashes. In my experience archival process usually has big enough effect on performance to care about it.

I'm beginning to suspect your primary reason for disliking xxhash is because it's not de facto standard like CRC or SHA. I agree that this is a big one, but you constantly imply like there's more to why xxhash is bad. Maybe my knowledge is lacking, care to explain? Why wouldn't 128 bit xxhash be more than enough for checksums of files. AFAIK the only thing it doesn't do is protect you against tampering.

> I don't know what kopia is, but according to your link it looks like their wire protocol involves each client downloading a complete index of the repository content, including a CAS identifier for every file. The semantics would be something like Git? Their list of supported algorithms looks reasonable (blake, sha2, sha3) so I wouldn't have the same concerns as I would if they were using xxhash or cityhash.

Kopia uses hashes for block level deduplication. What would be an issue, if they used 128 bit xxhash instead of 128 bit cryptographic hash like they do now (if we assume we don't need to protection from tampering)?

[1] https://en.wikipedia.org/wiki/Apple_M1