top | item 45633241

(no title)

repstosb | 4 months ago

Of course you should offer some method to disable caching/compression/encryption/ECC/etc. in intermediate layers whenever those are non-zero cost and might be duplicated at application level... that's the ancient end-to-end argument.

But that method doesn't necessarily have to be "something like O_DIRECT", which turns into a composition/complexity nightmare all for the sake of preserving the traditional open()/write()/close() interface. If you're really that concerned about performance, it's probably better to use an API that reflects the OS-level view of your data, as Linus pointed out in this ancient (2002!) thread:

https://yarchive.net/comp/linux/o_direct.html

Or, as noted in the 2007 thread that someone else linked above, at least posix_fadvise() lets the user specify a definite extent for the uncached region, which is invaluable information for the block and FS layers but not usually communicated at the time of open().

I think it's quite reasonable to consider the real problem to be the user code that after 20 years hasn't managed to migrate to something more sophisticated than open(O_DIRECT), rather than Linux's ability to handle every single cache invalidation corner case in every possible composition of block device wrappers. It really is a poorly-thought-out API from the OS implementor's perspective, even if at first seemingly simple and welcoming to an unsophisticated user.

discuss

jandrewrogers|4 months ago

O_DIRECT isn't about bypassing the kernel for the sake of reducing overhead. The gains would be small if that was the only reason.

O_DIRECT is used to disable cache replacement algorithms entirely in contexts where their NP-hardness becomes unavoidably pathological. You can't fix "fundamentally broken algorithm" with more knobs.

The canonical solution for workloads that break cache replacement is to dynamically rewrite the workload execution schedule in realtime at a very granular level. A prerequisite for this when storage is involved is to have perfect visibility and control over what is in memory, what is on disk, and any inflight I/O operations. The execution sequencing and I/O schedule are intertwined to the point of being essentially the same bit of code. For things like database systems this provides qualitative integer factor throughput improvements for many workloads, so very much worth the effort.

Without O_DIRECT, Linux will demonstrably destroy the performance of the carefully orchestrated schedule by obliviously running it through cache replacement algorithms in an attempt to be helpful. More practically, O_DIRECT also gives you fast, efficient visibility over the state of all storage the process is working with, which you need regardless.

Even if Linux handed strict explicit control of the page cache to the database process it doesn't solve the problem. Rewriting the execution schedule requires running algorithms across the internal page cache metadata. In modern systems this may be done 100 million times per second in userspace. You aren't gatekeeping analysis of that metadata with a syscall. The way Linux organizes and manages this metadata couldn't support that operation rate regardless.

Linux still needs to work well for processes that are well-served by normal cache replacement algorithms. O_DIRECT is perfectly adequate for disabling cache replacement algorithms in contexts where no one should be using cache replacement algorithms.