> In wal2 mode, the system uses two wal files instead of one. The files are named "<database>-wal" and "<database>-wal2", where "<database>" is of course the name of the database file. When data is written to the database, the writer begins by appending the new data to the first wal file. Once the first wal file has grown large enough, writers switch to appending data to the second wal file. At this point the first wal file can be checkpointed (after which it can be overwritten). Then, once the second wal file has grown large enough and the first wal file has been checkpointed, writers switch back to the first wal file. And so on.
Looks so logical that I don't understand why WAL mode was not implemented like this from the get go. Probably an optimization wrongly dismissed as premature?
Anyways, looking forward to this mode reaching general availability.
> Checkpointers do not block writers, and writers do not block checkpointers. This means that, even following the checkpoint, the wal file cannot be overwritten or deleted, and so all subsequent transactions must also be appended to the wal file. The work of the checkpointer is not wasted - SQLite remembers which parts of the wal file have already been copied into the db file so that the next checkpoint does not have to do so again
Probably because of this.
> but it does mean that the wal file may grow indefinitely if the checkpointer never gets a chance to finish without a writer appending to the wal file. There are also circumstances in which long-running readers may prevent a checkpointer from checkpointing the entire wal file - also causing the wal file to grow indefinitely in a busy system.
> Wal2 mode does not have this problem. In wal2 mode, wal files do not grow indefinitely even if the checkpointer never has a chance to finish uninterrupted.
I don't get how wal2 fixes the long-running reader problem though. Maybe they were just referring to the former problem?
I'm not much into the details but I'm wondering why is it not always writing to "<database>-wal" once it is full, it is renamed to "<database>-wal-0" and it starts writing again into "<database>-wal", once it is full again it is renamed to "<database>-wal-1" ?
> Looks so logical that I don't understand why WAL mode was not implemented like this from the get go. Probably an optimization wrongly dismissed as premature?
While it has advantages, it is also more code so more possible places to hide, and other disadvantages hence it doesn't completely deprecate the other WAL mode.
Also the advantages might not have been as commonly cared about in sqlite in earlier times, but it is being used in more & more places and sometimes at larger scales or with more significant concurrency needs, and the core has been pretty darn stable for quite some time, all of which factors change the dynamics of what is worth committing the dev/testing time to in terms of usefulness to the end users.
Now I can't help but wonder if there should be a `waln` mode where the WAL files would round robin instead of alternate between just two potentially allowing for much more intense write cadence.
It may simply have been a design oversight at the time. However, the page notes the following difference of WAL2 from WAL, which could be an issue in some cases:
> In wal mode, a checkpoint may be attempted at any time. In wal2 mode, the checkpointer has to wait until writers have switched to the "other" wal file before a checkpoint can take place.
After investigating, it looks like there's actually two separate things here: "bedrock" for accessing the sqlite database over the network and "page-level locking" for allowing concurrent write transactions.
Would it be a problem since the wal you delete, its inode, will still be open and processed at the DB closing normally? Just guessing, never tried that.
My understanding is that this technique is older than the linked implementation (though independently rediscovered), but notably, this implementation was written to support a different high concurrency SQL database (for some definition of that) called Noria.
Microsoft SQL Server uses similar architecture [1], but instead of using separate log files, it allocates Virtual Log Files (VLF) inside a physical (on-disk) log file. VLFs are allocated from a ring buffer and apparently there can be several thousands of them before things start to break.
The design of SQL Server has grown increasingly interesting to me over the years. I still prefer SQLite for personal projects, but as the size of our team and average customer goes up, I am looking for something that others can reason with if need be. Ideally, something that I can throw over the fence and not have to answer a bunch of confused questions about pragmas and quirky connection reuse code.
This WAL2 feature is a perfect example of a new kind of concern I have. SQLite has a really competent facility for handling write-ahead today, but it has these edge cases where it may fail under adverse (but totally plausible) scenarios. I haven't yet had a completely corrupted SQLite database, but I have had one incident on a QA server where I had to delete the WAL/SHM files to get the database to work again.
> As far as I can tell, this feature has not yet been released.
It's been off-trunk since its inception in Oct. 2017 and there's been no discussion within the project of merging it into trunk (why that is i cannot speculate). It is actively maintained for use with the bedrock branch, as can be seen in the project's timeline:
tangential point: one thing that always bothered me about WAL is that it is supposed to exist to help maintain data integrity, recover from crash etc but that file itself is written (committed to disk reliably) in batches and not after every change to the database, apparently to gain performance. Doesn't that defeat the purpose? How haven't things broken down despite this? Not specific to sqlite but databases in general. Never found an answer to this.
I think that depends on the setting of PRAGMA synchronous.
I'm not an expert on this, but i think the idea is to separate durability from db corruption. (When synchronous = normal instead of full) you can potentially lose (comitted) data in WAL mode if a power failure happens at just the right moment, however your database won't be corrupt. No data will be half written. Each transaction will either be fully there or fully missing.
In a high-level sense, yes! But it kind of depends on which part of the design pattern you mean by "double buffering for databases".
With double-buffering (2d/3d graphics) you are literally writing the final pixel-level data to the back buffer.[1]
In a database WAL scenario, to further analogize, it's more like you are writing the 2d/3d graphics commands to the buffer and executing them later. Because that is part of the point of the WAL -- it results in reduced disk writes because only the log file needs to be flushed to disk to guarantee a transaction is committed, rather than every data file/byte(/pixel) changed by the transaction.[2][3] (The WAL content is loosely a bit more like 3D (or 2D) vertex buffer objects/display lists [4] if you are familiar with those.)
Swapping the two WAL files though and alternating writing to each is yes like double buffering.
A third similar design pattern (to WALs) is used in operating systems' journaling filesystems[5] and actually was a contribution from OSes adopting database WAL techniques back in the 1990s.
I wonder how this would compare to just punching a hole in the already applied section of the WAL (1) file. It seems that this should allow the filesystem to free the disk space similarly to if the file had been deleted while allowing the writers to keep writing. I guess eventually it would become a problem as you hit max logical file size but it seems that you can likely bet on getting a chance to truncate it before hitting exabytes.
How common is an automatically expanding WAL across other DBs? I'm most familiar with MySQL which uses a fixed size WAL (unless something changed recently). That of course comes with other tradeoffs like potentially blocking writes if checkpointing falls behind. But I'm curious if SQLite is an oddity in this respect compared to other DBs. Since it is used in embedded contexts it might prefer to save less data on disk in the common case, but with the edge case that wal2 mode now fixes.
For example, RocksDB/LevelDB have done this from the very beginning. If a WAL file gets too long then a new one is created and the old one is sorted and written to an sstable asynchronously.
[+] [-] miroljub|2 years ago|reply
Looks so logical that I don't understand why WAL mode was not implemented like this from the get go. Probably an optimization wrongly dismissed as premature?
Anyways, looking forward to this mode reaching general availability.
[+] [-] athoscouto|2 years ago|reply
Probably because of this.
> but it does mean that the wal file may grow indefinitely if the checkpointer never gets a chance to finish without a writer appending to the wal file. There are also circumstances in which long-running readers may prevent a checkpointer from checkpointing the entire wal file - also causing the wal file to grow indefinitely in a busy system.
> Wal2 mode does not have this problem. In wal2 mode, wal files do not grow indefinitely even if the checkpointer never has a chance to finish uninterrupted.
I don't get how wal2 fixes the long-running reader problem though. Maybe they were just referring to the former problem?
[+] [-] WolfOliver|2 years ago|reply
[+] [-] dspillett|2 years ago|reply
While it has advantages, it is also more code so more possible places to hide, and other disadvantages hence it doesn't completely deprecate the other WAL mode.
Also the advantages might not have been as commonly cared about in sqlite in earlier times, but it is being used in more & more places and sometimes at larger scales or with more significant concurrency needs, and the core has been pretty darn stable for quite some time, all of which factors change the dynamics of what is worth committing the dev/testing time to in terms of usefulness to the end users.
[+] [-] tomashubelbauer|2 years ago|reply
[+] [-] foofie|2 years ago|reply
https://en.wikipedia.org/wiki/Log-structured_merge-tree
[+] [-] quietbritishjim|2 years ago|reply
> In wal mode, a checkpoint may be attempted at any time. In wal2 mode, the checkpointer has to wait until writers have switched to the "other" wal file before a checkpoint can take place.
[+] [-] forinti|2 years ago|reply
You don't have that with sqlite, so I don't see an obvious advantage for this, except if they now spawn a process or thread to do this concurrently.
Edit: so I read the doc (shame on me) and it has nothing to do with speed. Its purpose is to prevent a wal file from growing too large.
[+] [-] alberth|2 years ago|reply
Bedrock is the more interesting branch.
It’s WAL2 + CONCURRENT
It’s also the branch Expensify uses to scale to 4M QPS, on a single node (6-years ago)
https://sqlite.org/src/timeline?r=bedrock
https://use.expensify.com/blog/scaling-sqlite-to-4m-qps-on-a...
[+] [-] pstuart|2 years ago|reply
[+] [-] HALtheWise|2 years ago|reply
https://sqlite.org/src/doc/754ad35c/README-server-edition.ht...
I assume you could use the latter without the former.
[+] [-] vdaea|2 years ago|reply
Heh, I wonder how many people will delete the "wal" file thinking that, since they switched to wal2, the wal file must be a leftover.
[+] [-] worksonmine|2 years ago|reply
[+] [-] ceeam|2 years ago|reply
[+] [-] fbdab103|2 years ago|reply
[+] [-] maxbond|2 years ago|reply
https://docs.rs/left-right/latest/left_right/
My understanding is that this technique is older than the linked implementation (though independently rediscovered), but notably, this implementation was written to support a different high concurrency SQL database (for some definition of that) called Noria.
[+] [-] magnat|2 years ago|reply
[1] https://learn.microsoft.com/en-us/sql/relational-databases/s...
[+] [-] bob1029|2 years ago|reply
This WAL2 feature is a perfect example of a new kind of concern I have. SQLite has a really competent facility for handling write-ahead today, but it has these edge cases where it may fail under adverse (but totally plausible) scenarios. I haven't yet had a completely corrupted SQLite database, but I have had one incident on a QA server where I had to delete the WAL/SHM files to get the database to work again.
[+] [-] cjs_ac|2 years ago|reply
[+] [-] sgbeal|2 years ago|reply
It's been off-trunk since its inception in Oct. 2017 and there's been no discussion within the project of merging it into trunk (why that is i cannot speculate). It is actively maintained for use with the bedrock branch, as can be seen in the project's timeline:
https://sqlite.org/src/timeline?r=wal2
[+] [-] raj555|2 years ago|reply
[+] [-] bawolff|2 years ago|reply
I'm not an expert on this, but i think the idea is to separate durability from db corruption. (When synchronous = normal instead of full) you can potentially lose (comitted) data in WAL mode if a power failure happens at just the right moment, however your database won't be corrupt. No data will be half written. Each transaction will either be fully there or fully missing.
https://www.sqlite.org/pragma.html#pragma_synchronous
[+] [-] eatonphil|2 years ago|reply
Since SQLite is single writer I'm not sure if it does this. But this (batch yet block) is how I understood Postgres works.
Of course you can turn off the blocking too by setting postgres fsync configuration to an interval rather than synchronous.
[+] [-] formerly_proven|2 years ago|reply
[+] [-] runiq|2 years ago|reply
[+] [-] gregw2|2 years ago|reply
With double-buffering (2d/3d graphics) you are literally writing the final pixel-level data to the back buffer.[1]
In a database WAL scenario, to further analogize, it's more like you are writing the 2d/3d graphics commands to the buffer and executing them later. Because that is part of the point of the WAL -- it results in reduced disk writes because only the log file needs to be flushed to disk to guarantee a transaction is committed, rather than every data file/byte(/pixel) changed by the transaction.[2][3] (The WAL content is loosely a bit more like 3D (or 2D) vertex buffer objects/display lists [4] if you are familiar with those.)
Swapping the two WAL files though and alternating writing to each is yes like double buffering.
A third similar design pattern (to WALs) is used in operating systems' journaling filesystems[5] and actually was a contribution from OSes adopting database WAL techniques back in the 1990s.
Apologies if you know all this.
[1] https://en.wikipedia.org/wiki/Multiple_buffering#Double_buff...
[2] https://www.postgresql.org/docs/15/wal-intro.html
[3] https://en.wikipedia.org/wiki/Write-ahead_logging
[4] https://en.wikipedia.org/wiki/Display_list
[5] https://en.wikipedia.org/wiki/Journaling_file_system
[+] [-] jbverschoor|2 years ago|reply
[+] [-] spiffytech|2 years ago|reply
https://sqlite.org/hctree/doc/hctree/doc/hctree/threadtest.w...
[+] [-] alberth|2 years ago|reply
https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html#s...
[+] [-] pstuart|2 years ago|reply
[+] [-] kevincox|2 years ago|reply
[+] [-] no_wizard|2 years ago|reply
[+] [-] grogers|2 years ago|reply
[+] [-] mike_hearn|2 years ago|reply
[+] [-] dist-epoch|2 years ago|reply
[+] [-] yashthakker|2 years ago|reply
[+] [-] yihengx|2 years ago|reply
[deleted]