SQLite: Wal2 Mode

[+] miroljub|2 years ago|reply

> In wal2 mode, the system uses two wal files instead of one. The files are named "<database>-wal" and "<database>-wal2", where "<database>" is of course the name of the database file. When data is written to the database, the writer begins by appending the new data to the first wal file. Once the first wal file has grown large enough, writers switch to appending data to the second wal file. At this point the first wal file can be checkpointed (after which it can be overwritten). Then, once the second wal file has grown large enough and the first wal file has been checkpointed, writers switch back to the first wal file. And so on.

Looks so logical that I don't understand why WAL mode was not implemented like this from the get go. Probably an optimization wrongly dismissed as premature?

Anyways, looking forward to this mode reaching general availability.

[+] athoscouto|2 years ago|reply

> Checkpointers do not block writers, and writers do not block checkpointers. This means that, even following the checkpoint, the wal file cannot be overwritten or deleted, and so all subsequent transactions must also be appended to the wal file. The work of the checkpointer is not wasted - SQLite remembers which parts of the wal file have already been copied into the db file so that the next checkpoint does not have to do so again

Probably because of this.

> but it does mean that the wal file may grow indefinitely if the checkpointer never gets a chance to finish without a writer appending to the wal file. There are also circumstances in which long-running readers may prevent a checkpointer from checkpointing the entire wal file - also causing the wal file to grow indefinitely in a busy system.

> Wal2 mode does not have this problem. In wal2 mode, wal files do not grow indefinitely even if the checkpointer never has a chance to finish uninterrupted.

I don't get how wal2 fixes the long-running reader problem though. Maybe they were just referring to the former problem?

[+] WolfOliver|2 years ago|reply

I'm not much into the details but I'm wondering why is it not always writing to "<database>-wal" once it is full, it is renamed to "<database>-wal-0" and it starts writing again into "<database>-wal", once it is full again it is renamed to "<database>-wal-1" ?

[+] dspillett|2 years ago|reply

> Looks so logical that I don't understand why WAL mode was not implemented like this from the get go. Probably an optimization wrongly dismissed as premature?

While it has advantages, it is also more code so more possible places to hide, and other disadvantages hence it doesn't completely deprecate the other WAL mode.

Also the advantages might not have been as commonly cared about in sqlite in earlier times, but it is being used in more & more places and sometimes at larger scales or with more significant concurrency needs, and the core has been pretty darn stable for quite some time, all of which factors change the dynamics of what is worth committing the dev/testing time to in terms of usefulness to the end users.

[+] tomashubelbauer|2 years ago|reply

Now I can't help but wonder if there should be a `waln` mode where the WAL files would round robin instead of alternate between just two potentially allowing for much more intense write cadence.

[+] foofie|2 years ago|reply

> Looks so logical (...)

https://en.wikipedia.org/wiki/Log-structured_merge-tree

[+] quietbritishjim|2 years ago|reply

It may simply have been a design oversight at the time. However, the page notes the following difference of WAL2 from WAL, which could be an issue in some cases:

> In wal mode, a checkpoint may be attempted at any time. In wal2 mode, the checkpointer has to wait until writers have switched to the "other" wal file before a checkpoint can take place.

[+] forinti|2 years ago|reply

I'm just speculating here, but in a normal database you would have different processes writing the wal files to the database or archives.

You don't have that with sqlite, so I don't see an obvious advantage for this, except if they now spawn a process or thread to do this concurrently.

Edit: so I read the doc (shame on me) and it has nothing to do with speed. Its purpose is to prevent a wal file from growing too large.

[+] alberth|2 years ago|reply

Bedrock

Bedrock is the more interesting branch.

It’s WAL2 + CONCURRENT

It’s also the branch Expensify uses to scale to 4M QPS, on a single node (6-years ago)

https://sqlite.org/src/timeline?r=bedrock

https://use.expensify.com/blog/scaling-sqlite-to-4m-qps-on-a...

[+] pstuart|2 years ago|reply

That certainly sounds appealing. Do you have any insights into if and when that might land on trunk?

[+] HALtheWise|2 years ago|reply

After investigating, it looks like there's actually two separate things here: "bedrock" for accessing the sqlite database over the network and "page-level locking" for allowing concurrent write transactions.

https://sqlite.org/src/doc/754ad35c/README-server-edition.ht...

I assume you could use the latter without the former.

[+] vdaea|2 years ago|reply

>In wal2 mode, the system uses two wal files instead of one. The files are named "<database>-wal" and "<database>-wal2",

Heh, I wonder how many people will delete the "wal" file thinking that, since they switched to wal2, the wal file must be a leftover.

[+] worksonmine|2 years ago|reply

If people just randomly delete files they don't fully understand on a production system maybe they should be bitten.

[+] ceeam|2 years ago|reply

Would it be a problem since the wal you delete, its inode, will still be open and processed at the DB closing normally? Just guessing, never tried that.

[+] fbdab103|2 years ago|reply

As opposed to the `-journal` file already created?

[+] maxbond|2 years ago|reply

Very similar to the left-right primitive.

https://docs.rs/left-right/latest/left_right/

My understanding is that this technique is older than the linked implementation (though independently rediscovered), but notably, this implementation was written to support a different high concurrency SQL database (for some definition of that) called Noria.

[+] magnat|2 years ago|reply

Microsoft SQL Server uses similar architecture [1], but instead of using separate log files, it allocates Virtual Log Files (VLF) inside a physical (on-disk) log file. VLFs are allocated from a ring buffer and apparently there can be several thousands of them before things start to break.

[1] https://learn.microsoft.com/en-us/sql/relational-databases/s...

[+] bob1029|2 years ago|reply

The design of SQL Server has grown increasingly interesting to me over the years. I still prefer SQLite for personal projects, but as the size of our team and average customer goes up, I am looking for something that others can reason with if need be. Ideally, something that I can throw over the fence and not have to answer a bunch of confused questions about pragmas and quirky connection reuse code.

This WAL2 feature is a perfect example of a new kind of concern I have. SQLite has a really competent facility for handling write-ahead today, but it has these edge cases where it may fail under adverse (but totally plausible) scenarios. I haven't yet had a completely corrupted SQLite database, but I have had one incident on a QA server where I had to delete the WAL/SHM files to get the database to work again.

[+] cjs_ac|2 years ago|reply

As far as I can tell, this feature has not yet been released.

[+] sgbeal|2 years ago|reply

> As far as I can tell, this feature has not yet been released.

It's been off-trunk since its inception in Oct. 2017 and there's been no discussion within the project of merging it into trunk (why that is i cannot speculate). It is actively maintained for use with the bedrock branch, as can be seen in the project's timeline:

https://sqlite.org/src/timeline?r=wal2

[+] raj555|2 years ago|reply

tangential point: one thing that always bothered me about WAL is that it is supposed to exist to help maintain data integrity, recover from crash etc but that file itself is written (committed to disk reliably) in batches and not after every change to the database, apparently to gain performance. Doesn't that defeat the purpose? How haven't things broken down despite this? Not specific to sqlite but databases in general. Never found an answer to this.

[+] bawolff|2 years ago|reply

I think that depends on the setting of PRAGMA synchronous.

I'm not an expert on this, but i think the idea is to separate durability from db corruption. (When synchronous = normal instead of full) you can potentially lose (comitted) data in WAL mode if a power failure happens at just the right moment, however your database won't be corrupt. No data will be half written. Each transaction will either be fully there or fully missing.

https://www.sqlite.org/pragma.html#pragma_synchronous

[+] eatonphil|2 years ago|reply

You can still batch and block before returning from a request to maintain durability. This improves throughout at the expense of latency.

Since SQLite is single writer I'm not sure if it does this. But this (batch yet block) is how I understood Postgres works.

Of course you can turn off the blocking too by setting postgres fsync configuration to an interval rather than synchronous.

[+] formerly_proven|2 years ago|reply

You only need to achieve durability on a COMMIT.

[+] runiq|2 years ago|reply

So it's basically double buffering, but for databases? That makes sense.

[+] gregw2|2 years ago|reply

In a high-level sense, yes! But it kind of depends on which part of the design pattern you mean by "double buffering for databases".

With double-buffering (2d/3d graphics) you are literally writing the final pixel-level data to the back buffer.[1]

In a database WAL scenario, to further analogize, it's more like you are writing the 2d/3d graphics commands to the buffer and executing them later. Because that is part of the point of the WAL -- it results in reduced disk writes because only the log file needs to be flushed to disk to guarantee a transaction is committed, rather than every data file/byte(/pixel) changed by the transaction.[2][3] (The WAL content is loosely a bit more like 3D (or 2D) vertex buffer objects/display lists [4] if you are familiar with those.)

Swapping the two WAL files though and alternating writing to each is yes like double buffering.

A third similar design pattern (to WALs) is used in operating systems' journaling filesystems[5] and actually was a contribution from OSes adopting database WAL techniques back in the 1990s.

Apologies if you know all this.

[1] https://en.wikipedia.org/wiki/Multiple_buffering#Double_buff...

[2] https://www.postgresql.org/docs/15/wal-intro.html

[3] https://en.wikipedia.org/wiki/Write-ahead_logging

[4] https://en.wikipedia.org/wiki/Display_list

[5] https://en.wikipedia.org/wiki/Journaling_file_system

[+] jbverschoor|2 years ago|reply

Tripple

[+] spiffytech|2 years ago|reply

Wal2 mode was included in the benchmarks for the HC-tree backend research:

https://sqlite.org/hctree/doc/hctree/doc/hctree/threadtest.w...

[+] alberth|2 years ago|reply

While HC-tree looks promising, keep in mind the very first sentence of your link

  This project contains no code stable enough to deploy. The database backend works well enough to run some test cases, but is still quite incomplete.

Link to shortcomings, some of which are significant

https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html#s...

[+] pstuart|2 years ago|reply

Yikes! WAL2 + begin-concurrent = minimal TPS. I was expecting the opposite.

[+] kevincox|2 years ago|reply

I wonder how this would compare to just punching a hole in the already applied section of the WAL (1) file. It seems that this should allow the filesystem to free the disk space similarly to if the file had been deleted while allowing the writers to keep writing. I guess eventually it would become a problem as you hit max logical file size but it seems that you can likely bet on getting a chance to truncate it before hitting exabytes.

[+] no_wizard|2 years ago|reply

I wonder how this with affect all the new-ish distributed SQLite systems like Litestream

[+] grogers|2 years ago|reply

How common is an automatically expanding WAL across other DBs? I'm most familiar with MySQL which uses a fixed size WAL (unless something changed recently). That of course comes with other tradeoffs like potentially blocking writes if checkpointing falls behind. But I'm curious if SQLite is an oddity in this respect compared to other DBs. Since it is used in embedded contexts it might prefer to save less data on disk in the common case, but with the edge case that wal2 mode now fixes.

[+] mike_hearn|2 years ago|reply

For example, RocksDB/LevelDB have done this from the very beginning. If a WAL file gets too long then a new one is created and the old one is sorted and written to an sstable asynchronously.

[+] dist-epoch|2 years ago|reply

Postgres creates multiple WAL files of fixed size. Old ones are deleted according to nebulous rules.

[+] yashthakker|2 years ago|reply

Great information! I'm looking forward to seeing this mode in action. Do you know when it will be available for general use?

[+] yihengx|2 years ago|reply

[deleted]

96 comments