Like people mention, hugetlb,etc could be an improvement, but the core issue holding it it down probably has to do with mmap, 4k pages and paging behaviours, mmap will cause faults for each "small" 4k page not in memory, causing a kernel jump and then whatever machinery to fill in the page-cache (and bring up data from disk with the associated latency).
This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.
The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.
When I run the 50GB in-mem setup I still have 40GB+ of free memory, I drop the page cache before I run "sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'" there wouldn't really be anything to evict from page cache and swap isn't changing.
I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.
The in-memory solution creates a 2nd copy of the data so 50GB doesn't fit in memory anymore. The kernel is forced to drop and then reload part of the cached file.
I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...
And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?
I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.
When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.
Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.
The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.
In theory... the specifics of what is supported exactly? I can't vouch for that.
Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.
But the arm64 systems with 16K or 64K native pages would have fewer faults.
whizzter|5 months ago
This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.
The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.
jared_hulbert|5 months ago
I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.
pianom4n|5 months ago
nchmy|5 months ago
And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?
jared_hulbert|5 months ago
When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.
Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.
john-h-k|5 months ago
jared_hulbert|5 months ago
AMD has something similar.
The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.
In theory... the specifics of what is supported exactly? I can't vouch for that.
Jap2-0|5 months ago
jared_hulbert|5 months ago
But the arm64 systems with 16K or 64K native pages would have fewer faults.
inetknght|5 months ago
Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.
comradesmith|5 months ago
jared_hulbert|5 months ago