top | item 19119991

PostgreSQL used fsync incorrectly for 20 years

676 points| lelf | 7 years ago |fosdem.org

307 comments

order
[+] georgebarnett|7 years ago|reply
Years ago (2010 iirc), I reported to the Postgres list, with a patch, that (one of) the reason Postgres “didn’t work on NFS” was because it couldn’t deal with short writes. I got told the usual “you’re holding it wrong” instead of an acknowledgement of PG not sticking to the spec.

I patched my own systems (wrap the calls in a loop as per standard practice) and then proceeded to run literally hundreds of thousands of PostgreSQL instances on NFS for many more years with no problems.

The patch was eventually integrated it seems but I never found out why because I lost interest in trying to work with the community after that experience.

[+] DannyBee|7 years ago|reply
This saddens me.

Looking at this thread - the patch is obviously right. It doesn't matter about NFS, Linux, or any host of random crap people brought up. (I unfortunately suspect if you hadn't mentioned any of that, and just said "write doesn't guarantee full writes, this handles that" it may have gone better).

The documentation (literally on every system) is quite clear:

"The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program should always call write in a loop, iterating until all the data is written."

(see https://www.gnu.org/software/libc/manual/html_node/I_002fO-P...)

Literally all documentation you can find about write on all systems that implement it clearly mentions it doesn't have to write all the bytes you ask it to.

Instead of saying "yeah, that's right", people go off in every random direction, with people arguing about your NFS mount options, what things can cause write to do that and whether they are possible[1] or they should care about them.

Sigh.

[1] spoiler alert: it doesn't matter. The API allows for this, you need to handle it.

[+] jarym|7 years ago|reply
Actually as great as Postgres is and as generally approachable the community is - my experience was the same a few times and I read it on the mailing list happening to others:

Someone comes along with a patch or idea. Bunch of big Postgres people come knock it and it dies right there.

Happened to me when I suggested a more multi-tenant approach back around 2010 and today we have Citus. I was told that (paraphrased) no users were asking for that sort of thing.

I see it kind of happening with the Foreign Key Array patch that the author asked for help to rebase and no one bothered to reply.

Someone suggested replacing the Postmaster with a threaded approach so it could scale better (and showed benchmarks of their implementation handling more connections). Community response was there were already 3rd party connection pools that do the job. An outsider looking in considers this nuts - most people would not run additional components if they did not need to!

Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit that causes frequent issues (more so now that we have partitioning) and also a problem for multi-lingual table names. It’s been proposed to increase this limit a few times. Every time the answer has been: abbreviate your table names or recompile your own version of Postgres.

Could name a few other examples too I’ve noticed over the years and just sighed at. I don’t expect every idea to be accepted or even welcomed - but there is that sense of bias against change.

[+] _pmf_|7 years ago|reply
I think you patch was too perfect. Always leave some small thing for the gatekeepers to latch on, then be all "oh, yes, totally overlooked this, thank you, I'll fix it ASAP".
[+] garganzol|7 years ago|reply
That's an excellent patch. Partial writes (and reads) is a known behavior which is often overlooked. It rarely (if ever) occurs on local file systems, and even the network ones manifest it only from time to time, usually when there is some kind of congestion/segmentation is going on.

It's a pity a lot developers out there do not have an ingrained awareness about this.

[+] networkimprov|7 years ago|reply
By "short writes" you mean partial writes?
[+] quotemstr|7 years ago|reply
The same pattern emerges in big commercial projects too. Why do you think it took 20 years to fix console resizing in Windows?
[+] cheez|7 years ago|reply
In any large organizations, fiefdoms are bound to develop. If you don't have principles for dealing with it, what ends up happening is that new ideas will be crowded out and claimed by one of the leaders of the fiefdom. The Linux model has formalized this process. I think more large open source projects should use this model.
[+] htfy96|7 years ago|reply
> In short, PostgreSQL assumes that a successful call to fsync() indicates that all data written since the last successful call made it safely to persistent storage. But that is not what the kernel actually does. When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean. So a read of the blocks that were just written will likely return something other than the data that was written.

> Google has its own mechanism for handling I/O errors. The kernel has been instrumented to report I/O errors via a netlink socket; a dedicated process gets those notifications and responds accordingly. This mechanism has never made it upstream, though. Freund indicated that this kind of mechanism would be "perfect" for PostgreSQL, so it may make a public appearance in the near future.

https://lwn.net/Articles/752063/

A real-life example can be found at https://stackoverflow.com/questions/42434872/writing-program...

[+] solidsnack9000|7 years ago|reply
The linked LWN article (from April 2018) is a great summary of the problem and potential solutions and its cause:

Ted Ts'o, instead, explained why the affected pages are marked clean after an I/O error occurs; in short, the most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else. So those pages cannot be kept if the user wants the system to remain usable after such an event.

[+] puzzle|7 years ago|reply
Bigtable, and presumably Spanner, goes beyond that. Data chunks have checksums and they're immediately re-read and verified after compactions, because it turns out that errors do happen and corruption also happens, even when you use ECC — roughly once every 5PB compacted, per Jeff Dean figures.
[+] metildaa|7 years ago|reply
Sounds like Google is continuing to act poorly when it comes to upstreaming their code.
[+] chubot|7 years ago|reply
Doesn't this affect all databases? Or is it a different issue?

https://www.sqlite.org/atomiccommit.html

SQLite does a "flush" or "fsync" operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. We are told that the flush and fsync primitives are broken on some versions of Windows and Linux. This is unfortunate. It opens SQLite up to the possibility of database corruption following a power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLite assumes that the operating system that it is running on works as advertised. If that is not quite the case, well then hopefully you will not lose power too often.

Also this seems related:

https://danluu.com/file-consistency/

That results in consistent behavior and guarantees that our operation actually modifies the file after it's completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.

The linked OSDI '14 paper looks good:

We find that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which we term persistence properties. We develop a tool named BOB that empirically tests persistence properties, and use it to demonstrate that these properties vary widely among six popular Linux file systems.

[+] loeg|7 years ago|reply
Yeah, IMO it's a Linux fsync bug. fsync() should not succeed if writes failed. fsync() should not clean dirty pages if writes failed. Both of these behaviors directly contravene the goals of user applications invoking fsync() as well as any reasonable API contract for safely persisting data.

Arguably, POSIX needs a more explicit fsync interface. I.e., sync these ranges; all dirty data, or just dirty data since last checkpoint; how should write errors be handled; etc. That doesn't excuse that Linux's fsync is totally broken and designed to eat data in the face of hardware errors.

That Dan Luu blog post you linked is fantastic and one I really enjoyed.

[+] masklinn|7 years ago|reply
> Doesn't this affect all databases? Or is it a different issue?

Possibly, https://wiki.postgresql.org/wiki/Fsync_Errors notes both MySQL and MongoDB had to be changed.

Note that the issue here different from either of the bits you quote. The problem here is that if you fsync(2) it fails and you fsync(2) again, on many systems the second call will always succeed because the first one has invalidated/cleared all extant buffers, and thus there's nothing for the second one to sync. Which is a success.

AKA because of systems' shortcuts an fsync success effectively means "all writes since the last fsync have succeeded", not "all writes since the last fsync success have succeeded". Writes between a success and a failure may be irrecoverably lost

[+] anarazel|7 years ago|reply
> Doesn't this affect all databases?

Yes, most. And several did similar changes (crash-restart -> recovery) to handle it too. It's possible to avoid the issue by using direct IO too, but often that's not the default mode of $database.

[+] jandrewrogers|7 years ago|reply
> Doesn't this affect all databases?

No, this is an artifact of storage engine design. Direct I/O is the norm for high-performance storage engines -- they don't use kernel cache or buffered I/O at all -- and many will bypass the file system given the opportunity (i.e. operate on raw block devices directly) which eliminates other classes of undesirable kernel behavior. Ironically, working with raw block devices requires much less code.

Fewer layers of abstraction between your database and the storage hardware make it easier to ensure correct behavior and high performance. Most open source storage engines leave those layers in because it reduces the level of expertise and sophistication required of the code designers -- it allows a broader pool of people to contribute -- though as this case shows it doesn't not necessarily make it any easier to ensure correctness.

[+] hyc_symas|7 years ago|reply
> Doesn't this affect all databases?

Didn't affect LMDB. If an fsync fails the entire transaction is aborted/discarded. Retrying was always inherently OS-dependent and unreliable, better to just toss it all and start over. Any dev who actually RTFM'd and followed POSIX specs would have been fine.

LMDB's crash reliability is flawless.

[+] coolestuk|7 years ago|reply
"have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk"

I remember highlighting this problem to the Firebird db developers maybe 13 years ago. AFAIR they were open to the problem I'd pointed out and went about fixing it on other platforms so that the db behaved everywhere as it did on OS X. I'm probably in the bottom 5% of IT professionals on this site. I'm rather amazed to find out that so late in the day Postgres has come round to fixing this.

I haven't used Firebird in years and can't find a link to the discussion (could have been via email).

[+] haberman|7 years ago|reply
> Both Chinner and Ts'o, along with others, said that the proper solution is for PostgreSQL to move to direct I/O (DIO) instead.

Wait, is "direct I/O" the same as O_DIRECT?

The same O_DIRECT that Linus skewered in 2007?

> There really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary.

https://lkml.org/lkml/2007/1/10/233

> Side note: the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way.

https://lkml.org/lkml/2007/1/10/235

More background from 2002-2007: https://yarchive.net/comp/linux/o_direct.html

[+] anarazel|7 years ago|reply
Turns out sometimes people other than Linus have more experience with IO than Linus.

I think there's pretty good reasons to go for DIO for a database. But only when there's a good sysadmin/DBA and when the system is dedicated to the database. There's considerable performance gains in going for DIO (at the cost of significant software complexity), but it's much more sensitive to bad tuning and isn't at all adaptive to overall system demands.

[+] Unklejoe|7 years ago|reply
I have to say he’s wrong when he says there’s no reason for EVER using O_DIRECT.

One HUGE reason is performance for large sequential writes when using fast media and slow processors. Specifically, when the write speed of the media is in the same ballpark as the effective memcpy() speed of the processor itself, which, believe it or not, is very possible today (but was probably more unlikely in 2007) when working with embedded systems and modern NVMe media.

Consider a system with an SoC like the NXP T2080 that has a high speed internal memory bus, DDR3, and PCIe 3.0 support. The processor cores themselves are relatively slow PowerPC cores, but the chip has a ton of memory bandwidth.

Now assume you had a contiguous 512 MiB chunk of data to write to an NVMe drive capable of a sustained write rate of 2000 MB/s.

The processor core itself can barely move 2000 MB/s of data, so it’s clear why direct IO would perform better since you’re telling the drive to pull the data directly from the buffer instead of memcpy-ing into into an intermediate kernel buffer first. With direct IO, you can perform zero-copy writes to the media.

This is why I’m able to achieve higher sequential write performance on some modern MVMe drives than most benchmark sites report, all while using a 1.2 GHz PowerPC system.

[+] icebraining|7 years ago|reply
> Wait, is "direct I/O" the same as O_DIRECT?

No, there are other ways of doing DIO besides that particular interface.

[+] bbunix|7 years ago|reply
Historic note (like from the 80's) - any time a machine was rebooted we'd type sync; sync; sync; reboot - the explanation was that the only guarantee was that the second sync wouldn't start until the first sync successfully completed, plus one for good luck...
[+] Jerry2|7 years ago|reply
I have a flash drive that I sometimes put a video on to watch it on a small TV in the basement and I've noticed that Linux doesn't copy the file right away. The 'cp' does finish quickly but the data is not on the flash drive yet. You either have to eject and wait or sync and wait for it to actually transfer.

Needless to say, this tripped me up few times and videos weren't fully transferred.

[+] hyc_symas|7 years ago|reply
"sync;sync;halt" - followed by turning off the power so we could open up the cabinet to do maintenance...
[+] blaisio|7 years ago|reply
I don't really understand why people went with this headline and not "PostgreSQL developers discover most programs have used fsync incorrectly for decades, including PostgreSQL".
[+] giovannibajo1|7 years ago|reply
Because most normal programs do open-write-sync-close, and that mostly works as expected.

Postgres does open-write-close and then later, in another unrelated process, open-fsync-close. They discovered this doesn’t always work, because if somebody somewhere does a fsync on that file for any reason, their process might miss the error as it doesn’t stick.

[+] macdice|7 years ago|reply
I agree, the initial report on pgsql-hackers had a subject that blamed PostgreSQL too strongly and I think that carried through to articles and blogs and Twitter etc.

You are right, this affects other things too. And indeed several projects took action to remove any notion of fsync retry, referencing the LWN article etc.

POSIX is probably underspecified. AN Austin Group defect report would be interesting...

[+] anarazel|7 years ago|reply
I think it's worthwhile to note that this, even before both kernel and postgres changes, really only is an issue if there are very serious storage issues, commonly causing more corruption than just forgetting the data due to be fsynced. The kernel has its own timeout / retry logic and if those retries succeed, there's no issue. Additionally, most filesystems remount read-only if there's any accompanying journaling errors, and in a lot of cases PG IO will also have a metadata effect.
[+] mehrdadn|7 years ago|reply
Am I misunderstanding this or does this mean Linux literally does not provide any way to ensure writes are flushed to a file in the presence of transient errors?

Does anyone know if Windows's FlushFileBuffers is susceptible to this as well? (P.S., interesting bit about FILE_FLAG_WRITE_THROUGH not working as you might expect: https://blogs.msdn.microsoft.com/oldnewthing/20170510-00/?p=...)

[+] macdice|7 years ago|reply
For the record and not mentioned in Tomas's talk: the PostgreSQL release due out next week will add a PANIC after any fsync failure (in other words, no longer retry). The same thing has been done by MySQLand MongoDB and probably everyone else doing (optional) buffered IO for database-ish stuff who followed this stuff on LWN etc.
[+] pkaye|7 years ago|reply
There is an old paper http://pages.cs.wisc.edu/~remzi/Classes/736/Papers/iron.pdf that analyzes how various filesystems do error handling and not too long ago it was fairly bad. My own experience was some of the older Windows would not even check if a write command failed. I used to laugh when developers would state how robust their databases were when the underlying filesystem does not even check for many errors. Hopefully things are better now.
[+] mindslight|7 years ago|reply
The kernel itself isn't really a transaction manager. If there is an I/O error on a file, then I'd only expect it to percolate up in that immediate "session". When should that error flag be expected to carry over until, filesystem remount? Or even longer, with the talk of storing a flag in the superblock?

Specifically it seems like asking for trouble to open() a path a second time, and expect fsync() calls to apply to writes done on the other file descriptor object - there's no guarantee they're even the same file [0]! At the very least, pass the actual file descriptor to the other process. Linux's initial behavior was in violation of what I've said here, but the patch to propagate the error to every fd open at the time of the error should resolve this to satisfaction.

I would think about the only reasonable assumption you could make about concurrent cross-fd syncing is that after a successful fsync(fd), any successful read(fd) should return data that was on disk at the time of that fsync(fd) or later. In other words, dirty pages shouldn't be returned by read() and then subsequently fail being flushed out.

disclaimer: It's been a while since I've written syscall code and I've had the luxury of never really having to really lean into the spec to obtain every last bit of performance. Given how many guarantees the worse-is-better philosophy forewent, I don't see much point to internalize what few are there.

[0] Ya sure, you can assert that the user shouldn't be otherwise playing around with the files, I'm just pointing out that translating path->fd isn't a pure function.

[+] EGreg|7 years ago|reply
Have there been any real-world consequences from this, and how can they be prevented?

Does MySQL have the same flaw?

[+] ausjke|7 years ago|reply
I read this at lwn.net a while ago but it seems there is no fix to it. How is MySQL doing? I believe Oracle etc are not having this problem as they deal with disk directly.
[+] pyrus|7 years ago|reply
Does anyone know if FoundationDB is affected?
[+] xorgar831|7 years ago|reply
The meta point here is that just in OSS folks assume that the code has been reviewed since it's open, but the reality is that unless someone actually does you really don't know. The popularity of a project doesn't mean there aren't fatal flaws.