top | item 45135399

(no title)

ayende | 5 months ago

This is wrong, because your mmap code is being stalled for page faults (including soft page faults that you have when the data is in memory, but not mapped to your process).

The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.

Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.

And you can use both madvice / huge pages to control the mmap behavior

discuss

order

mrlongroots|5 months ago

Yes, it doesn't take a benchmark to find out that storage can not be faster than memory.

Even if you had a million SSDs and somehow were able to connect them to a single machine somehow, you would not outperform memory, because the data needs to be read into memory first, and can only then be processed by the CPU.

Basic `perf stat` and minor/major faults should be a first-line diagnostic.

johncolanduoni|5 months ago

This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

alphazard|5 months ago

> storage can not be faster than memory

This is an oversimplification. It depends what you mean by memory. It may be true when using NVMe on modern architectures in a consumer use case, but it's not true about computer architecture in general.

External devices can have their memory mapped to virtual memory addresses. There are some network cards that do this for example. The CPU can load from these virtual addresses directly into registers, without needing to make a copy to the general purpose fast-but-volatile memory. In theory a storage device could also be implemented in this way.

hinkley|5 months ago

I’m pretty sure that as of PCI-E 2 this is not true.

It’s only true if you need to process the data before passing it on. You can do direct DMA transfers between devices.

In which case one needs to remember that memory isn’t on the CPU. It has to beg for data just about as much as any peripheral. It uses registers and L1, which are behind two other layers of cache and an MMU.

lucketone|5 months ago

It would seem you summarised whole post.

That’s the point: “mmap” is slow because it is serial.

arghwhat|5 months ago

mmap isn't "serial", the code that was using the mapping was "serial". The kernel will happily fill different portions of the mapping in parallel if you have multiple threads fault on different pages.

(That doesn't undermine that io_uring and disk access can be fast, but it's comparing a lazy implementation using approach A with a quite optimized one using approach B, which does not make sense.)

guenthert|5 months ago

Well, yes, but isn't one motivation of io_uring to make user space programming simpler and (hence) less error prone? I mean, i/o error handling on mmap isn't exactly trivial.

arunc|5 months ago

Indeed. Use with mmap with MAP_POPULATE which will pre populate.

jared_hulbert|5 months ago

Someone else suggested this, results are even worse by 2.5s.