Prefix sum: 20 GB/s (2.6x baseline)

[+] jasonthorsness|4 months ago|reply

"We achieve 19.8 GB/s prefix sum throughput—1.8x faster than a naive implementation and 2.6x faster than FastPFoR"

"FastPFoR is well-established in both industry and academia. However, on our target platform (Graviton4, SIMDe-compiled) it benchmarks at only ~7.7 GB/s, beneath a naive scalar loop at ~10.8 GB/s."

I thought the first bit was a typo but it was correct; the naive approach was faster than a "better" method. Another demonstration of how actually benchmarking on the target platform is important!

[+] Galanwe|4 months ago|reply

While the results look impressive, I can't help but think "yeah but had you stored an absolute value every X deltas instead of just a stream of deltas, you would have had a perfectly scalable parallel decoding"

[+] ashtonsix|4 months ago|reply

I just did a mini-ablation study for this (prefix sum). By getting rid of the cross-block carry (16 values), you can increase perf from 19.85 to 23.45 GB/s: the gain is modest as most performance is lost on accumulator carry within the block.

An absolute value every 16 deltas would undermine compression: a greater interval would lose even the modest performance gain, while a smaller interval would completely lose the compressibility benefits of delta coding.

It's a different matter, although there is definitely plausible motivation for absolute values every X deltas: query/update locality (mini-partition-level). You wouldn't want to transcode a huge number of values to access/modify a small subset.

[+] shoo|4 months ago|reply

fgiesen has a 3 part series of blog posts from 2018, "reading bits in far too many ways", focusing on the problem of reading data encoded a variable-bit-length code

Part 3 explores a straightforward idea that can be used to improve performance of bitstream decoding - decode multiple streams at once.

> if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once?

> So how many streams should you use? It depends. At this point, for anything that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine.

> If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams

For delta coding, perhaps instead of resetting the delta encoding stream with an absolute value every X deltas, for a small value of X, this suggests we could use a much larger value of X, i.e. partition the input into coarse chunks and delta encode each, and then rewrite our decoder to decode a pair (or more) of the resulting streams at once.

Unclear if that would help at all here, or if there's already enough independent work for the CPU to keep busy with.

Part 1 https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...

Part 2 https://fgiesen.wordpress.com/2018/02/20/reading-bits-in-far...

Part 3 https://fgiesen.wordpress.com/2018/09/27/reading-bits-in-far...

[+] jdonaldson|4 months ago|reply

While that sounds like a dealbreaker, I can't help think "yeah but if a decoding method took advantage of prefix in a similarly scalable way, one would reap the same benefits".

[+] yogishbaliga|4 months ago|reply

Way back in time, I used delta encoding for storing posting list (inverted index for search index). I experimented with using GPUs for decoding the posting list. It turned out that, as another reply mentioned copying posting list from CPU memory to GPU memory was taking way too long. If posting list is static, it can be copied to GPU memory once. This will make the decoding faster. But still there is a bottle neck of copying the result back into CPU memory.

Nvidia's unified memory architecture may make it better as same memory can be shared between CPU and GPU.

[+] Certhas|4 months ago|reply

AMD has had unified memory for ages in HPC and for a while now in the Strix Halo systems. I haven't had the chance to play with one yet, but I have high hopes for some of our complex simulation workloads.

[+] ashtonsix|4 months ago|reply

Oh neat. I have some related unpublished SOTA results I want to release soon: PEF/BIC-like compression ratios, with faster boolean algebra than Roaring Bitsets.

[+] hughw|4 months ago|reply

The shared memory architecture doesn't eliminate copying the data across to the device. Edit: or back.

[+] nullbyte|4 months ago|reply

This code looks like an alien language to me. Or maybe I'm just rusty at C.

[+] mananaysiempre|4 months ago|reply

SIMD intrinsics are less C and more assembly with overlong mnemonics and a register allocator, so even reading them is something of a separate skill. Unlike the skill of achieving meaningful speedups by writing them (i.e. low-level optimization), it’s nothing special, but expect to spend a lot of time jumping between the code and the reference manuals[1,2] at first.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

[2] https://developer.arm.com/architectures/instruction-sets/int...

[+] ashtonsix|4 months ago|reply

The weirdness probably comes from heavy use of "SIMD intrinsics" (Googleable term). These are functions with a 1:1 correspondence to assembly instructions, used for processing multiple values per instruction.

[+] ack_complete|4 months ago|reply

This is partially due to the compromises of mappingvector intrinsics into C (with C++ only being marginally better). In a more vector-oriented language, such as shader languages, this:

  s1 = vaddq_u32(s1, vextq_u32(z, s1, 2));
  s1 = vaddq_u32(s1, vdupq_laneq_u32(s0, 3));

would be more like this:

  s1.xy += s1.zw;
  s1 += s0.w;

[+] coolThingsFirst|4 months ago|reply

I had the opportunity(misfortune) to click on that resume and it could some aesthetic improvements. :)

Fun project btw, HPC is always interesting.

[+] tonetegeatinst|4 months ago|reply

Wonder if PTX programming for a GPU would accelerate this.

[+] ashtonsix|4 months ago|reply

If the data is already in GPU memory, yes. Otherwise you'll be limited by the DRAM<->VRAM memory bottleneck.

When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads.

This note holds for all GPU-based approaches, not just PTX.

[+] bassp|4 months ago|reply

Yes! There’s a canonical algorithm called the “Blelloch scan” for prefix sum (aka prefix scan, because you can generalize “sum” to “any binary associative function”) that’s very gpu friendly. I have… fond is the wrong word, but “strong” memories of implementing in a parallel programming class :)

Here’s a link to a pretty accessible writeup, if you’re curious about the details: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

[+] almostgotcaught|4 months ago|reply

Lol do you think "PTX programming" is some kind of trick path to perf? It's just inline asm. Sometimes it's necessary but most of the time "CUDA is all you need":

https://github.com/b0nes164/GPUPrefixSums

[+] TinkersW|4 months ago|reply

Your average none shared memory GPU communicates with the CPU over PCIe which is dogshit slow, like 100x slower than DRAM.

I can upload about an average of 3.7 MBs per millisecond to my GPU(PCIe gen 3, x8), but it can be spiky and sometimes take longer than you might expect.

By comparison a byte based AVX2 prefix scan can pretty much run at the speed of DRAM, so there is never any reason to transfer to the GPU.

32 comments