I've said it before and I'll say it again: Modern filesystems straight out expose the wrong API for 99% of applications. App developers almost never think of data as a stream of bytes. We think of data as a set of records. Values in the set change through atomic modification events.
The fundamental API primitives should be atomic changes. Atomic write (bytes), and atomic append. The funny thing about it is that POSIX already supports basically this API (datagrams) for both IPC and networking. It just doesn't support this API in the one place it would be most useful - the filesystem.
Ideally I want:
- Write() to be blocking / atomic by default. Don't return until data is safely committed.
- A transactional API: begin(fd); write(); write(); err = commit(fd). If any error happens, commit returns the error and none of the data is stored.
- An IOCP-style API for non-blocking applications. This is the API databases want to use, with the loop being <get network request>, <write data to filesystem>, <yield>, <get write completion event>, <send confirmation to client>.
- Deprecate fsync & friends. If you don't want to wait for the data to get committed, write in non-blocking mode and ignore the completion event.
Solving this problem in end-user applications is really hard - almost no applications implement atomic write on top of filesystems correctly. And they shouldn't have to - this should be the job of the filesystem. The filesystem can do this much more safely, with better guarantees, better performance and better error handling. Modern filesystems already have journals - buggy reimplementions of journals in userland doesn't help anyone. "Do not turn off console while game is saving" is an embarrassment to everyone.
This is why SQLite advertises itself as a replacement for fopen(). It's crazy hard to get right, and SQLite did the work so we don't have to: "Think of SQLite not as a replacement for Oracle but as a replacement for fopen()" - https://sqlite.org/about.html
Obviously this would be horrible for PG to use as a replacement for fopen() and other systems, but for many, many use-cases SQLite is a good replacement for fopen().
> Modern filesystems straight out expose the wrong API for 99% of applications. App developers almost never think of data as a stream of bytes. We think of data as a set of records
As I recall, older operating systems like VMS and MULTICS all did this, and it was all tremendously complex. Unix’ simple stream-of-bytes file abstraction was a reaction to this, and it worked so well that it became the prevailing model. Before doing again what didn’t work before, check up on why it failed the previous time.
> - An IOCP-style API for non-blocking applications.
This exists, it's called Linux AIO (distinct from Posix AIO). The problem is, when you use it, you have to reimplement caching (and buffering) in userspace instead of journaling – which is just as hard to get right and can just as easily – maybe even more easily – lead to corruption. (Postgres, as an example, relies on the OS to buffer writes.)
> - Write() to be blocking / atomic by default. Don't return until data is safely committed.
This is the wrong thing for 99% of use cases. You get terrible performance unless you batch things, which means that casual use ends up with severe performance issues. (IIRC this was actually a problem on Android devices – many apps were misusing SQLite by not using transactions, resulting in every single database update being a separate atomic write to disk. Not only did this kill performance, but it caused excessive flash wear.)
> Modern filesystems already have journals - buggy reimplementions of journals in userland doesn't help anyone.
Databases (among other systems) need features and control over the journal that a filesystem cannot provide. (Think MVCC, replication, etc.)
Beside – someone has to write the filesystem, and that filesystem uses largely the same mechanisms in the kernel that are exposed to userspace.
I agree fully that fsync ought to be deprecated for something with much more clearly-defined semantics. Both OS X and Linux have made attempts at this (F_FULLSYNC and sync_file_range, respectively), though clearly at least Linux still has some work to do.
But – barring such unclear semantics – the general model of using fsync to guarantee ordering is not a complex one to understand, and matches most use cases well.
Microsoft introduced transactional file system (and registry) APIs in Windows Vista [1]. But it was so complex that no one ever used it, and now it is semi-officially deprecated.
No no no, we need async I/O for files. Writing should not be synchronous, but you should be able to find out about completion of each write.
Asynchrony is absolutely critical for performance.
Filesystem I/O, and, really, disk/SSD I/O is really very much the same as network I/O nowadays. You might be traversing a SAN, for example, or the seek latency on HDD could be reasonable for an HDD but nowadays that's just way too high anyways.
Transactional APIs are very difficult for filesystems. Barriers are enough for many apps. A barrier is also much easier to start using in existing apps. You'd call a system call to request a write barrier, and wait, preferably asynchronously, for completion notice (whereupon you'd know all writes done before the barrier must have reached stable storage).
everything you describe will murder IO performance because you're essentially implementing a database with MVCC. it makes much more sense to implement what you described as a VFS library.
I've been surprised to see an apparent consensus from the filesystem developers that Postgres should be using direct IO.
I worry that if the Postgres people do make that change, they'll find themselves hearing from a different set of kernel developers that they should have known direct IO doesn't work properly and they should be using buffered IO instead.
Previously I'd thought the latter was the general view from the kernel side.
In particular, I'd taken this bit as a suggestion that if people found problems with buffered IO then the right thing to do is to ask the kernel side to improve things, rather than switch:
« As a result, our madvise and/or posix_fadvise interfaces may not be all that strong, because people sadly don't use them that much. It's a sad example of a totally broken interface (O_DIRECT) resulting in better interfaces not getting used, and then not getting as much development effort put into them. »
> I worry that if the Postgres people do make that change, they'll find themselves hearing from a different set of kernel developers that they should have known direct IO doesn't work properly and they should be using buffered IO instead.
That definitely will happen. But the fact remains that at the moment you'll get considerably higher performance when expertly using O_DIRECT, and there's nothing on the horizon to change that.
> In particular, I'd taken this bit as a suggestion that if people found problems with buffered IO then the right thing to do is to ask the kernel side to improve things, rather than switch:
I think partially that's just been overtaken by reality. A database is guaranteed to need its own buffer pool and you're a) going to have more information about recency in there b) the OS caching adds a good chunk of additional overhead. With buffered IO we (PostgreSQL) already had to add code to manage e.g. the amount of dirty data caching the OS does. The only reason DIO isn't always going to be beneficial after doing the necessary architectural improvements, is that the OS buffer pool is more adaptive in mixed use / not as well tuned databases.
Can we ignore for a second what the proper behavior should be, and instead focus on the documentation.
In my opinion, even a careful reading of the fsync man page does not cover what exactly happens if you close an fd, reopen the file in another process, and then call fsync. Am I supposed to read kernel source code? Ideally, after reading a man page, I should have no questions about exactly what guarantees are provided by an API.
I'm always surprised by what a mess doing what seem like simple file operations is. Maybe even more surprised that everything seems to generally work pretty well even with those issues. Even "I want to save this file" requires numerous sync operations on the file and the directory its in.
I'm certainly not qualified to criticize anyone for the current situation, and, as the article points out, even some of the more egregious sounding behavior (marking as clean pages after writing fails) has a pretty reasonable explanation. But, IIRC, as storage capacities continue to rise, error rates aren't falling nearly as fast. So, I'm left kinda wondering if there is some day in the future where the likely hood of encountering an error finally gets high enough that things don't work pretty well anymore.
Although its not exactly associated with this as such there is a growing understanding that SMB/CIFS shares have a nasty habit of reporting "on storage" before the data really is safe. That is a bit of a problem for many backup systems, unless you do a verify afterwards and pick up the pieces. Backups can involve massive files with odd write and read patterns and databases generally involve quite large files with odd read and write patterns compared to say document storage.
Perhaps we need database and backup oriented filesystems and not funny looking files on top of generic filesystems.
> what a mess doing what seem like simple file operations is
Proper handling and reporting of hardware-level errors all the way up through the stack (driver, block layer, filesystem, C library) to the application so it can recover in a reliable way is not a simple operation!
Simple operations are open/close/read/write. Those work. Until they don't, then you need to know how far back the operations you already did and "assumed" had worked didn't. And in this case the promise made to PostgreSQL by fsync() wasn't as firm as the "obvious" interpretation of the documentation would lead one to believe.
> When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean.
> Filesystem error handling seems to have improved. Reporting an error on a pwrite if the block device reports an error is perhaps the most basic error propagation a robust filesystem should do; few filesystems reported that error correctly in 2005. Today, most filesystems will correctly report an error when the simplest possible error condition that doesn’t involve the entire drive being dead occurs if there are no complicating factors.
Now, taking the case of a user pulling out a USB thumb drive as an excuse for not keeping the dirty pages around seems ... disingenuous?
If the storage device has disappeared for good, you can just return EIO for all further I/O operations, and mark all open file descriptions for which there were dirty pages such that any further fsync() calls on the corresponding fds return an error?
I mean, either you think you can still retry, then you should keep the dirty page around, or you think retrying is futile, then feel free to drop the dirty pages, but make sure anyone who tries something that would make this loss visible gets an error, which should only require keeping flags on open file descriptions, and possibly pages/inodes/block devices that (semi-)persist the error at the desired resolution, which you can broaden if the bookkeeping uses too much memory.
Yeah. The USB case is a cop out. For USB, keeping the pages dirty and the fsyncs erroring (as seems consistent with Postgres' needs and common sense) seems fine.
The memory can be reclaimed when 'umount --force', or something like that, discards filesystem dirty state.
> Such a change, though, would take Linux behavior further away from what POSIX mandates and would raise some other questions, including: when and how would that flag ever be cleared? So this change seems unlikely to happen.
I've seen this kind of problems when I was writing SSD firmware 10-15 years back. The operating systems just dont do much with the hardware reported errors. There is some old research papers called "IRON filesystems" that is a pretty good reading on how poor the error handling was and maybe still is.
There's no way to recover from a failed write (if the drive is still operating and could reallocate the sector, it would have already done that). So mark the pages damaged and deallocate their contents. Keep the metadata for the damaged pages around until someone tries to sync or close the associated file.
If you take into consideration that there are alternatives like FreeBSD and SmartOS which do not suffer from such serious and basic functionality malfunctions, it is illogical to keep putting up with GNU/Linux on the basis of being the only thing one is comfortable with.
Comfort is of little consolation or use if the operating system is this unreliable, especially since making sure that data is safely and correctly stored is core, basic functionality.
This article is really good with lots of details in those linked discussions.
wondering what happens to other critical libraries such as RocksDB/LevelDB, what actually happens when there is hardware error not limited to unplugged usb cable?
Sorry for being snarky, but from my ops experience MySQL manages to lose data even without hardware errors[1].
[1] My last experience was due to bug where certain pattern of data made MySQL/MariaDB think the data page was encrypted, after which it proceeded to discard that page and crash complaining that data is corrupter and from that point on refused to start until data got restored.
Here's a simpler fix: when the underlying device produces an error then mark the in-core inode (not on disk) as having an error and have all further writes return EIO. Then fsync() too can notice the error state flag being set and also return EIO.
"The job of calling fsync(), however, is handled in a single "checkpointer" process, which is concerned with keeping on-disk storage in a consistent state that can recover from failures"
And there in lies the rub. Sybase ASE calls fsync() upon every commit, which is the reason that database devices are still mostly implemented with raw devices. Before version 11.9.2 (as far as I recall) you ran the exact same risk if you used the file system as devices. Now it's safe, but performance can get pretty heinous on write intensive systems.
Sybase ASE opens files with either the O_SYNC or the O_DIRECT flag, the later being availlable since ASE 15.
The general good practice is to use raw devices or direct i/o for write intensive workloads.
BTW it uses asynchronous i/o too.
> Andres Freund, like a number of other PostgreSQL developers, has acknowledged that DIO is the best long-term solution. But he also noted that getting there is "a metric ton of work" that isn't going to happen anytime soon.
No, that is a "recipe for disaster", as they say. Not doing something that everyone acknowledging as important, because that's a "lot of work" is what makes projects a mess. I've seen that many times on various projects.
Diving in and quickly doing something complicated without a lot of careful consideration and testing is a recipe for disaster. Especially when there may be a simpler way of accomplishing the same goals, that just isn't available yet (be it future APIs, or some better way that no one has thought of yet).
[+] [-] josephg|8 years ago|reply
The fundamental API primitives should be atomic changes. Atomic write (bytes), and atomic append. The funny thing about it is that POSIX already supports basically this API (datagrams) for both IPC and networking. It just doesn't support this API in the one place it would be most useful - the filesystem.
Ideally I want:
- Write() to be blocking / atomic by default. Don't return until data is safely committed.
- A transactional API: begin(fd); write(); write(); err = commit(fd). If any error happens, commit returns the error and none of the data is stored.
- An IOCP-style API for non-blocking applications. This is the API databases want to use, with the loop being <get network request>, <write data to filesystem>, <yield>, <get write completion event>, <send confirmation to client>.
- Deprecate fsync & friends. If you don't want to wait for the data to get committed, write in non-blocking mode and ignore the completion event.
Solving this problem in end-user applications is really hard - almost no applications implement atomic write on top of filesystems correctly. And they shouldn't have to - this should be the job of the filesystem. The filesystem can do this much more safely, with better guarantees, better performance and better error handling. Modern filesystems already have journals - buggy reimplementions of journals in userland doesn't help anyone. "Do not turn off console while game is saving" is an embarrassment to everyone.
[+] [-] zie|8 years ago|reply
Obviously this would be horrible for PG to use as a replacement for fopen() and other systems, but for many, many use-cases SQLite is a good replacement for fopen().
[+] [-] teddyh|8 years ago|reply
As I recall, older operating systems like VMS and MULTICS all did this, and it was all tremendously complex. Unix’ simple stream-of-bytes file abstraction was a reaction to this, and it worked so well that it became the prevailing model. Before doing again what didn’t work before, check up on why it failed the previous time.
[+] [-] colanderman|8 years ago|reply
This exists, it's called Linux AIO (distinct from Posix AIO). The problem is, when you use it, you have to reimplement caching (and buffering) in userspace instead of journaling – which is just as hard to get right and can just as easily – maybe even more easily – lead to corruption. (Postgres, as an example, relies on the OS to buffer writes.)
> - Write() to be blocking / atomic by default. Don't return until data is safely committed.
This is the wrong thing for 99% of use cases. You get terrible performance unless you batch things, which means that casual use ends up with severe performance issues. (IIRC this was actually a problem on Android devices – many apps were misusing SQLite by not using transactions, resulting in every single database update being a separate atomic write to disk. Not only did this kill performance, but it caused excessive flash wear.)
> Modern filesystems already have journals - buggy reimplementions of journals in userland doesn't help anyone.
Databases (among other systems) need features and control over the journal that a filesystem cannot provide. (Think MVCC, replication, etc.)
Beside – someone has to write the filesystem, and that filesystem uses largely the same mechanisms in the kernel that are exposed to userspace.
I agree fully that fsync ought to be deprecated for something with much more clearly-defined semantics. Both OS X and Linux have made attempts at this (F_FULLSYNC and sync_file_range, respectively), though clearly at least Linux still has some work to do.
But – barring such unclear semantics – the general model of using fsync to guarantee ordering is not a complex one to understand, and matches most use cases well.
[+] [-] quietbritishjim|8 years ago|reply
[1] https://en.wikipedia.org/wiki/Transactional_NTFS
[+] [-] cryptonector|8 years ago|reply
Asynchrony is absolutely critical for performance.
Filesystem I/O, and, really, disk/SSD I/O is really very much the same as network I/O nowadays. You might be traversing a SAN, for example, or the seek latency on HDD could be reasonable for an HDD but nowadays that's just way too high anyways.
Transactional APIs are very difficult for filesystems. Barriers are enough for many apps. A barrier is also much easier to start using in existing apps. You'd call a system call to request a write barrier, and wait, preferably asynchronously, for completion notice (whereupon you'd know all writes done before the barrier must have reached stable storage).
[+] [-] gruez|8 years ago|reply
everything you describe will murder IO performance because you're essentially implementing a database with MVCC. it makes much more sense to implement what you described as a VFS library.
[+] [-] quickben|8 years ago|reply
- Write() to be blocking / atomic by default. Don't return until data is safely committed."
How do you deal with user pulling out usb stick?
[+] [-] mjw1007|8 years ago|reply
I worry that if the Postgres people do make that change, they'll find themselves hearing from a different set of kernel developers that they should have known direct IO doesn't work properly and they should be using buffered IO instead.
Previously I'd thought the latter was the general view from the kernel side.
For example this message from ten years ago, and other strongly-worded views in that thread: https://lkml.org/lkml/2007/1/10/235
In particular, I'd taken this bit as a suggestion that if people found problems with buffered IO then the right thing to do is to ask the kernel side to improve things, rather than switch:
« As a result, our madvise and/or posix_fadvise interfaces may not be all that strong, because people sadly don't use them that much. It's a sad example of a totally broken interface (O_DIRECT) resulting in better interfaces not getting used, and then not getting as much development effort put into them. »
[+] [-] anarazel|8 years ago|reply
That definitely will happen. But the fact remains that at the moment you'll get considerably higher performance when expertly using O_DIRECT, and there's nothing on the horizon to change that.
> For example this message from ten years ago, and other strongly-worded views in that thread: https://lkml.org/lkml/2007/1/10/235
> In particular, I'd taken this bit as a suggestion that if people found problems with buffered IO then the right thing to do is to ask the kernel side to improve things, rather than switch:
I think partially that's just been overtaken by reality. A database is guaranteed to need its own buffer pool and you're a) going to have more information about recency in there b) the OS caching adds a good chunk of additional overhead. With buffered IO we (PostgreSQL) already had to add code to manage e.g. the amount of dirty data caching the OS does. The only reason DIO isn't always going to be beneficial after doing the necessary architectural improvements, is that the OS buffer pool is more adaptive in mixed use / not as well tuned databases.
[+] [-] otterley|8 years ago|reply
[+] [-] bvinc|8 years ago|reply
In my opinion, even a careful reading of the fsync man page does not cover what exactly happens if you close an fd, reopen the file in another process, and then call fsync. Am I supposed to read kernel source code? Ideally, after reading a man page, I should have no questions about exactly what guarantees are provided by an API.
[+] [-] dagenix|8 years ago|reply
I'm certainly not qualified to criticize anyone for the current situation, and, as the article points out, even some of the more egregious sounding behavior (marking as clean pages after writing fails) has a pretty reasonable explanation. But, IIRC, as storage capacities continue to rise, error rates aren't falling nearly as fast. So, I'm left kinda wondering if there is some day in the future where the likely hood of encountering an error finally gets high enough that things don't work pretty well anymore.
[+] [-] gerdesj|8 years ago|reply
Perhaps we need database and backup oriented filesystems and not funny looking files on top of generic filesystems.
[+] [-] ajross|8 years ago|reply
Proper handling and reporting of hardware-level errors all the way up through the stack (driver, block layer, filesystem, C library) to the application so it can recover in a reliable way is not a simple operation!
Simple operations are open/close/read/write. Those work. Until they don't, then you need to know how far back the operations you already did and "assumed" had worked didn't. And in this case the promise made to PostgreSQL by fsync() wasn't as firm as the "obvious" interpretation of the documentation would lead one to believe.
[+] [-] loeg|8 years ago|reply
That behavior seems problematic.
As always, there's a great Dan Luu blog post on the subject: https://danluu.com/filesystem-errors/
> Filesystem error handling seems to have improved. Reporting an error on a pwrite if the block device reports an error is perhaps the most basic error propagation a robust filesystem should do; few filesystems reported that error correctly in 2005. Today, most filesystems will correctly report an error when the simplest possible error condition that doesn’t involve the entire drive being dead occurs if there are no complicating factors.
Emphasis added.
[+] [-] zAy0LfpBZLC8mAC|8 years ago|reply
If the storage device has disappeared for good, you can just return EIO for all further I/O operations, and mark all open file descriptions for which there were dirty pages such that any further fsync() calls on the corresponding fds return an error?
I mean, either you think you can still retry, then you should keep the dirty page around, or you think retrying is futile, then feel free to drop the dirty pages, but make sure anyone who tries something that would make this loss visible gets an error, which should only require keeping flags on open file descriptions, and possibly pages/inodes/block devices that (semi-)persist the error at the desired resolution, which you can broaden if the bookkeeping uses too much memory.
[+] [-] loeg|8 years ago|reply
The memory can be reclaimed when 'umount --force', or something like that, discards filesystem dirty state.
[+] [-] kazinator|8 years ago|reply
[+] [-] pkaye|8 years ago|reply
[+] [-] toothpasta|8 years ago|reply
[+] [-] caf|8 years ago|reply
That's not exactly true. In the thin-provisioned block device case, administrator action can make it resume accepting writes.
[+] [-] cesarb|8 years ago|reply
[+] [-] zAy0LfpBZLC8mAC|8 years ago|reply
[+] [-] Annatar|8 years ago|reply
Comfort is of little consolation or use if the operating system is this unreliable, especially since making sure that data is safely and correctly stored is core, basic functionality.
[+] [-] dis-sys|8 years ago|reply
wondering what happens to other critical libraries such as RocksDB/LevelDB, what actually happens when there is hardware error not limited to unplugged usb cable?
[+] [-] bbuchalter|8 years ago|reply
[+] [-] takeda|8 years ago|reply
[1] My last experience was due to bug where certain pattern of data made MySQL/MariaDB think the data page was encrypted, after which it proceeded to discard that page and crash complaining that data is corrupter and from that point on refused to start until data got restored.
[+] [-] cryptonector|8 years ago|reply
[+] [-] caf|8 years ago|reply
[+] [-] forkandwait|8 years ago|reply
[+] [-] kev009|8 years ago|reply
[+] [-] CaptainZapp|8 years ago|reply
And there in lies the rub. Sybase ASE calls fsync() upon every commit, which is the reason that database devices are still mostly implemented with raw devices. Before version 11.9.2 (as far as I recall) you ran the exact same risk if you used the file system as devices. Now it's safe, but performance can get pretty heinous on write intensive systems.
[+] [-] anarazel|8 years ago|reply
Those are journal commits, not the commits that the piece you quote is talking about (actual data files).
[+] [-] grumpydba|8 years ago|reply
[+] [-] deepsun|8 years ago|reply
No, that is a "recipe for disaster", as they say. Not doing something that everyone acknowledging as important, because that's a "lot of work" is what makes projects a mess. I've seen that many times on various projects.
[+] [-] anarazel|8 years ago|reply
And it'd not be the default anyway, as it requires more tuning.
We are working on getting there.
[+] [-] always_good|8 years ago|reply
You sound like a Dilbert manager's "I don't care, just have it on my desk at all costs by Friday."
That is not sound engineering.
[+] [-] derekp7|8 years ago|reply
[+] [-] 2RTZZSro|8 years ago|reply