Imo the most "unhinged" cpus for AVX-512 are early batches of Alder Lakes which is the only cpu family that has nearly full coverage of all existing avx-512 subsets.
It's a shame that Intel seemed to really not want people to use it, given they started disabling the ability to use it in future microcode, and fused it off in later parts.
I think I have two of these sitting in a box, one prototype with avx512 and one retail without. Is it worth me breaking these out for ML experiments and such?
And, as noted in the article, that's an instruction which only works on two desktop CPU architectures (Tiger Lake and Zen 5), including one where it's arguably slower than not using it (Tiger Lake).
Meaning... this entire effort was for something that's faster on only a single kind of CPU (Zen 5).
This article is honestly one of the best I've read in a long time. It's esoteric and the result is 99.5% pointless objectively, but in reality it's incredibly useful and a wonderful guide to low-level x86 optimization end to end. The sections on cache alignment and uiCA + analysis notes are a perfect illustration of "how it's done."
What a post. It should have taken a week just to write it, never mind the amount of time it took to actually come up with all this stuff and overcome all the obstacles mentioned. What a dedication to improving the performance of phrase search.
Fascinating blog post. Having said that, it may seem like nitpicking but I have to take issue with the point about recursion, which is often far too easily blamed for inefficiency.
The blog post mentions it as one of the reasons for the inefficiency of the conventional algorithm.
A glance at the algorithm shows that the recursion in question is a tail call. This means that any overhead can be readily eliminated using a technique known for nearly fifty years already.
Steele, Guy Lewis (1977). "Debunking the "expensive procedure call" myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO". Proceedings of the 1977 annual conference on - ACM '77.
Isn't it true that if an algorithm can be tail-call optimized, it can be rewritten to not use recursion at all? (And conversely, if something can't be rewritten without recursion, it can't be tail-call optimized?)
> Why are you merging up to one rare token at the beginning or at the end? Let’s consider that someone searched for C_0 R_1 C_2 C_3. If we don’t do this merge, we would end up searching for C_0, R_1, C_2 C_3, and this is bad. As established, intersecting common tokens is a problem, so it’s way better to search C_0 R_1, C_2 C_3. I learned this the hard way…
But since R_1 C_2 C_3 is in the index as well, instead of searching for C_0 R_1, C_2 C_3 with a distance of 2, you can instead search for C_0 R_1, R_1 C_2 C_3 with a distance of 1 (overlapping), which hopefully means that the lists to intersect are smaller.
From my 1980s 8-bit CPU perspective, the instruction is unhinged based solely on the number of letters. Compared to LDA, STA, RTS, that's not an assembler mnemonic, it's a novel. :-)
"This is an instruction that doesn't exist in any computer right now, so why should I put it in a machine, if it's supposed to be realistic? Well, it's because it's ahead of time."
Sometimes I read through the instrinsics guide just to play the game of spotting instructions defined primarily because certain cryptologic agencies asked for it.
I'm trying to follow the example and getting lost.
> Imagine the scenario where "mary had" occurs in the following positions: [1, 6] and "a" appears in the position [2], so "mary had a" occurs in the positions [2]
Okay so basically this means "mary had a" ends at token position 2 (assuming "mary" is token position 0), and you're trying to create an efficient algorithm to do this backwards linking process.
It's not entirely clear from the article what's pre-computed ahead of time (when documents are added / indexed by the system) and what's done on-the-fly (in response to a user's specific search query).
Based on skimming the article it appears that you're doing this backwards linking process on the fly for a specific phrase the user enters (but I could be wrong about that).
> allows us to decompose this value
What value is being decomposed? It is not clear what "this value" refers to.
> one representing the group and the other the value
Ctrl+F is telling me that's the first occurrence of the word "group" in this post. What is the "group" being represented?
> pos = group * 16 + value
Okay so you're bit-packing two fields into pos. One of the things being packed is called "group," and it seems to be 16 bits. The other thing being packed is "value," and it seems to be 4 bits. So in total "pos" has 20 bits.
> the maximum document length is 1048576 tokens
It seems that "pos" simultaneously corresponds to a token position and our two-field bit-packed thing above. I can't figure out how these two things are in one-to-one correspondence.
I stopped reading there, given my confusion so far it seems unlikely I'll really be able to really understand much of what follows.
PS: I skimmed the article on roaring bitmaps. Seems they're basically bitmaps where each 1k (8192-bit) chunk has a storage format (dense, sparse, or RLE), and algorithms for quickly doing intersection, union, etc. with various optimized cases chosen roughly by the number of 1's. (Intersecting two dense bitmaps with say ~50% randomly distributed 1's you can't get faster than ANDing together the bit vectors. But if you're, say, intersecting a small sparse bitmap with ~5 1's against a big dense bitmap with ~4k 1's you can iterate over the 1's in the sparse bitmap checking whether each 1 is in the big bitmap.)
So in my mind I'm basically just blackboxing roaring bitmaps as "java.util.BitSet or vector<bool> with some extra optimizations that kick in if your data has sections where most of the bits are the same".
iamnotagenius|1 year ago
Namidairo|1 year ago
irthomasthomas|1 year ago
fuhsnn|1 year ago
adgjlsfhk1|1 year ago
nxobject|1 year ago
bri3d|1 year ago
Meaning... this entire effort was for something that's faster on only a single kind of CPU (Zen 5).
This article is honestly one of the best I've read in a long time. It's esoteric and the result is 99.5% pointless objectively, but in reality it's incredibly useful and a wonderful guide to low-level x86 optimization end to end. The sections on cache alignment and uiCA + analysis notes are a perfect illustration of "how it's done."
nine_k|1 year ago
Also, the fact that it's deprecated by Intel.
nine_k|1 year ago
LoganDark|1 year ago
nemoniac|1 year ago
The blog post mentions it as one of the reasons for the inefficiency of the conventional algorithm.
A glance at the algorithm shows that the recursion in question is a tail call. This means that any overhead can be readily eliminated using a technique known for nearly fifty years already.
Steele, Guy Lewis (1977). "Debunking the "expensive procedure call" myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO". Proceedings of the 1977 annual conference on - ACM '77.
slashdev|1 year ago
Regardless of how valid the excuse is, for such an obvious and old optimization, it’s very poorly supported.
queuebert|1 year ago
phendrenad2|1 year ago
yorwba|1 year ago
But since R_1 C_2 C_3 is in the index as well, instead of searching for C_0 R_1, C_2 C_3 with a distance of 2, you can instead search for C_0 R_1, R_1 C_2 C_3 with a distance of 1 (overlapping), which hopefully means that the lists to intersect are smaller.
jonstewart|1 year ago
mrandish|1 year ago
bri3d|1 year ago
camel-cdr|1 year ago
https://www.youtube.com/watch?v=r_pPF5npnio&t=3300 (55:00)
"This is an instruction that doesn't exist in any computer right now, so why should I put it in a machine, if it's supposed to be realistic? Well, it's because it's ahead of time."
pclmulqdq|1 year ago
inopinatus|1 year ago
rkagerer|1 year ago
rstuart4133|1 year ago
> The inverted index will look something like this:
He isn't wrong. It is indeed "something like".
ltbarcly3|1 year ago
csense|1 year ago
> Imagine the scenario where "mary had" occurs in the following positions: [1, 6] and "a" appears in the position [2], so "mary had a" occurs in the positions [2]
Okay so basically this means "mary had a" ends at token position 2 (assuming "mary" is token position 0), and you're trying to create an efficient algorithm to do this backwards linking process.
It's not entirely clear from the article what's pre-computed ahead of time (when documents are added / indexed by the system) and what's done on-the-fly (in response to a user's specific search query).
Based on skimming the article it appears that you're doing this backwards linking process on the fly for a specific phrase the user enters (but I could be wrong about that).
> allows us to decompose this value
What value is being decomposed? It is not clear what "this value" refers to.
> one representing the group and the other the value
Ctrl+F is telling me that's the first occurrence of the word "group" in this post. What is the "group" being represented?
> pos = group * 16 + value
Okay so you're bit-packing two fields into pos. One of the things being packed is called "group," and it seems to be 16 bits. The other thing being packed is "value," and it seems to be 4 bits. So in total "pos" has 20 bits.
> the maximum document length is 1048576 tokens
It seems that "pos" simultaneously corresponds to a token position and our two-field bit-packed thing above. I can't figure out how these two things are in one-to-one correspondence.
I stopped reading there, given my confusion so far it seems unlikely I'll really be able to really understand much of what follows.
PS: I skimmed the article on roaring bitmaps. Seems they're basically bitmaps where each 1k (8192-bit) chunk has a storage format (dense, sparse, or RLE), and algorithms for quickly doing intersection, union, etc. with various optimized cases chosen roughly by the number of 1's. (Intersecting two dense bitmaps with say ~50% randomly distributed 1's you can't get faster than ANDing together the bit vectors. But if you're, say, intersecting a small sparse bitmap with ~5 1's against a big dense bitmap with ~4k 1's you can iterate over the 1's in the sparse bitmap checking whether each 1 is in the big bitmap.)
So in my mind I'm basically just blackboxing roaring bitmaps as "java.util.BitSet or vector<bool> with some extra optimizations that kick in if your data has sections where most of the bits are the same".