top | item 45688819

(no title)

buybackoff | 4 months ago

It looks suspicious at 25x. Even 2.5x would be suspicious unless reading very small records.

I assume both cases have the file cached in RAM already fully, with a tiny size of 100MB. But the file read based version actually copies the data into a given buffer, which involves cache misses to get data from RAM to L1 for copying. The mmap version just returns the slice and it's discarded immediately, the actual data is not touched at all. Each record is 2 cache lines and with random indices is not prefetched. For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.

The benchmark compares actually getting data vs getting data location. Single digit nanos is the scale of good hash tables lookups with data in CPU caches, not actual IO. For fairness, both should use/touch the data, eg copy it.

discuss

order

a-dub|4 months ago

doing these sorts of benchmarks is actually quite tricky. you must clear the page cache by allocating >1x physical ram before each attempt.

moreover, mmap by default will load lazy, where mmap with MAP_POPULATE will prefetch. in the former case, reporting average operation times is not valid because the access time distributions are not gaussian (they have a one time big hit at first touch). with MAP_POPULATE (linux only), there is long loading delay when mmap is first called, but then the average access times will be very low. when pages are released will be determined by the operating system page cache eviction policy.

the data structure on top is best chosen based on desired runtime characteristics. if it's all going in ram, go ahead and use a standard randomized hash table. if it's too big to fit in ram, designing a structure that is aware of lru style page eviction semantics may make sense (ie, a hash table or other layout that preserves locality for things that are expected to be accessed in a temporally local fashion.)

codedokode|4 months ago

> you must clear the page cache

In Linux there is a /proc/sys/vm/drop_caches pseudo file that does this. Look how great Linux is compared to other OSes.

kragen|4 months ago

> For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.

I think this is the wrong order of magnitude. One core of my Ryzen 5 3500U seems to be able to run memcpy() at 10 gigabytes per second (0.1 nanoseconds per byte) and memset() at 31 gigabytes per second (0.03 nanoseconds per byte). I'd expect a sequential read of 100 bytes to take about 3 nanoseconds, not 100 nanoseconds.

However, I think random accesses do take close to 100 nanoseconds to transmit the starting row and column address and open the row. I haven't measured this on this hardware because I don't have a test I'm confident in.

bcrl|4 months ago

100 nanoseconds from RAM is correct. Latency != bandwidth. 3 nanoseconds would be from cache or so on a Ryzen. You ain't gonna get the benefits of prefetching on the first 100 bytes.

Tuna-Fish|4 months ago

> For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.

It's important to note that throughput is not just an inverse of latency, because modern OoO cpus with modern memory subsystems can have hundreds of requests in flight. If your code doesn't serialize accesses, latency numbers are irrelevant to throughput.

Scaevolus|4 months ago

Yeah, 3.3ns is about 12 CPU cycles. You can indeed create a pointer to a memory location that fast!