top | item 45070485

(no title)

kbaker | 6 months ago

> If the directory containing the rollback journal is not fsynced after the journal file is deleted, then the journal file might rematerialize after a power failure, causing sqlite to roll back a committed transaction. And fsyncing the directory doesn't seem to happen unless you set synchronous to EXTRA, per the docs cited in the blog post.

I think this is the part that is confusing.

The fsyncing of the directory is supposed to be done by the filesystem/OS itself, not the application.

From man fsync,

    As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)).
So from sqlite's perspective on DELETE it is either: before the fsync call, and not committed, or after the fsync call, and committed (or partially written somehow and needing rollback.)

Unfortunately it seems like this has traditionally been broken on many systems, requiring workarounds, like SYNCHRONOUS = EXTRA.

discuss

order

agwa|6 months ago

No, the metadata is information like the modification time and permissions, not the directory entry.

The next paragraph in the man page explains this:

> Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

https://man7.org/linux/man-pages/man2/fsync.2.html

Edit to add: I don't think there's a single Unix-like OS on which fsync would also fsync the directory, since a file can appear in an arbitrary number of directories, and the kernel doesn't know all the directories in which an open file appears.

This is a moot point anyways, because in DELETE mode, the operation that needs to be durably persisted is the unlinking of the journal file - what would you fsync for that besides the directory itself?

kbaker|6 months ago

OK, interesting, I think I see... So you are asking about if SQLite opens and finds a not-committed rollback journal that looks valid, then it rolls it back?

I was more curious so I looked at the code here:

https://sqlite.org/src/file?name=src/pager.c&ci=trunk

and found something similar to what you are asking in this comment before `sqlite3PagerCommitPhaseTwo`:

    ** When this function is called, the database file has been completely
    ** updated to reflect the changes made by the current transaction and
    ** synced to disk. The journal file still exists in the file-system
    ** though, and if a failure occurs at this point it will eventually
    ** be used as a hot-journal and the current transaction rolled back.
So, it does this:

    ** This function finalizes the journal file, either by deleting,
    ** truncating or partially zeroing it, so that it cannot be used
    ** for hot-journal rollback. Once this is done the transaction is
    ** irrevocably committed.
Assuming fsync works on both the main database and the hot journal, then I don't see a way that it is not durable? Because, it has to write and sync the full hot journal, then write to the main database, then zero out the hot journal, sync that, and only then does it atomically return from the commit? (assuming FULL and DELETE)