top | item 45133330

(no title)

jared_hulbert | 5 months ago

Cool. Original author here. AMA.

discuss

order

whizzter|5 months ago

Like people mention, hugetlb,etc could be an improvement, but the core issue holding it it down probably has to do with mmap, 4k pages and paging behaviours, mmap will cause faults for each "small" 4k page not in memory, causing a kernel jump and then whatever machinery to fill in the page-cache (and bring up data from disk with the associated latency).

This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.

The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.

jared_hulbert|5 months ago

When I run the 50GB in-mem setup I still have 40GB+ of free memory, I drop the page cache before I run "sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'" there wouldn't really be anything to evict from page cache and swap isn't changing.

I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.

pianom4n|5 months ago

The in-memory solution creates a 2nd copy of the data so 50GB doesn't fit in memory anymore. The kernel is forced to drop and then reload part of the cached file.

nchmy|5 months ago

I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...

And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?

jared_hulbert|5 months ago

I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.

When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.

Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.

john-h-k|5 months ago

You mention modern server CPUs have capability to “read direct to L3, skipping memory”. Can you elaborate on this?

jared_hulbert|5 months ago

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

AMD has something similar.

The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.

In theory... the specifics of what is supported exactly? I can't vouch for that.

Jap2-0|5 months ago

Would huge pages help with the mmap case?

jared_hulbert|5 months ago

Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.

But the arm64 systems with 16K or 64K native pages would have fewer faults.

inetknght|5 months ago

> Would huge pages help with the mmap case?

Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.

comradesmith|5 months ago

Thanks for the article. What about using file reads from a mounted ramdisk?

jared_hulbert|5 months ago

Hmm. tmpfs was slower. hugetlbfs wasn't working for me.