top | item 42563357

(no title)

this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.

it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).

smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.

the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.

discuss

curiouscoding|1 year ago

Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")

I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small. That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).

Bulat_Ziganshin|1 year ago

with m/t, the algorithm is memory-bound, so the performance should be determined strictly by the memory throughput