We've found that periodic invocations of pg_repack are necessary to keep things from bloating beyond the inflection point where vacuuming just can't keep up. I'd love to see VACUUM somehow able to take care of that, because the WAL volume of repacking multi-terabyte databases — even when you're just targeting the worst culprit tables — is absurd (especially when you're keeping all your WALs around forever, for PITR purposes). I'm just not sure what more it can do.
There has been in the community for a couple of years now a set of patches to support REINDEX CONCURRENTLY, and there have been discussions for having a CLUSTER CONCURRENTLY, but nothing has materialized. Pushing forward into having such patches would make sense. There are many users asking for the possibility to do tuple reordering without holding an exclusive lock on the relations involved.
Where do things stand with the status of vacuum in Redshift which derived from Postgres?
What version of postgres vacuum implementation are they derived from? Have they implemented or incorporated any of these improvements since their fork? (Obviously Redshift doesn't need the btree index ones since it uses distribution keys and zone maps.)
I ask because my experience is that Redshift vacuum implementation is awful on large heavily used data warehouses, and I wonder if there is any hope for improvement.
Yeah, true. Even though Redshift is based on Postgres, it does not leverage the automatic vacuuming facility of postgres.
So the vacuuming of tables to recover disk space from deleted rows needs to be done manually.
There are 3 types of vacuuming in Redshift (DELETE ONLY, SORT, REINDEX). The vacuum SORT operation is done on tables that have a sort key. To the extent that a vacuum SORT is an expensive (high IO) operation, we recommend when possible, to avoid the need to vacuum by loading the rows in sort order. If you do that, you will not need to vacuum the table, and this is the optimal solution for very long tables.
When dealing with a very long table (>10b rows) that has a sort key, it’s useful to partition the table into multiple tables where the table names are modified to indicate the partition. This is useful when pruning the size of the table, you don’t have to run VACUUM DELETE ONLY.
The key is to vacuum on a regular basis, and keep your stats off below 10%. If you wait too long, you might end up in a place where you have to resize your cluster or do a deep copy.
One more thing: Run your vacuum jobs in the queue with the most memory.
Amazon Redshift is not actually based in Postgres, it just supports a Postgres interface. It is built on top of technology from the massive parallel processing (MPP) data-warehouse company ParAccel which was acquired by Actian, and licenced by Amazon.
A number of GC and other resource reclamation systems, especially soft real-time ones, have worked on accounting the cleanup costs to the parts of the system that make the mess in the first place.
In these systems, a function that is very careful with resource allocation will almost never experience a significant slowdown due to the background processing, because they either pay no cost or an amortized cost. For instance, one collector I knew of would garbage collect up to 10 objects on every allocation. If large allocations are front loaded then a full cycle almost never happens. In Swift or C++ or Rust you pay for your allocations as soon as they go out of scope.
In the context of an MVCC system, running a long query causes considerably more versioned data to pile up. If part of the cost of retiring that data was payed at the end of the transaction, several things happen. One, other transactions are less impacted by the overhead. Two, you end up throttling the bad actors, which encourages the problem to be fixed, and in the meantime the high water mark for vacuuming is reduced.
What's slow are stuff like java GC Eden management where deletes are an afterthought to what is a time-stamped LRU reference management server, and a lot more.
So, no. I don't think you can compare C++ with systems that run full fledged GCs and threat them in the same bullpark.
Is it really necessary to have a vacuum procedure?
Isn't it possible to just use a data structure to index rows by transaction ids, and on each transaction commit efficiently find all rows that aren't visible to any transaction, and add them to a free list?
Seems kind of a bad design to rely on periodic full data scans.
Efficient resource recovery while preserving consistency and minimally impacting workload throughput is one of THE central problems in database engine design. You can move the pain around but it never goes away, and a poor solution will very negatively impact the throughput of normal workloads. People have been thinking about this problem for as long as we've been building database engines.
This is not to say vacuum-like mechanisms are necessarily the best method but it was a common architectural idiom for database engines from the 1990s because it works well with sequential storage devices. Newer techniques tend to put resource recovery inline with the workload rather than outside of it but that has other disadvantages.
Indexing all rows in a database by transaction id would cause an extreme loss of throughput. That just creates a continuously mutated (and therefore locking) structure every thread constantly uses that is also paged to disk. And deletion of individual records is a problem for indexes generally.
In general, when a respected team sticks with a design for a long time it's safe to assume it's not a bad design but that the problem is harder than it might first appear.
In this case, I'd want to think carefully about the data volume: many places may have hundreds of transactions per second but also have queries which run for much longer periods of time and the visible state has to be accurate for all of them. That sounds a lot of contention for that shared data structure and “efficiently find all rows that aren't visible to any transaction” sounds decidedly non-trivial for busy servers with lots of data, which are generally the only ones where this matters.
Sadly it's a bit more complex than that. Because currently running different transactions may hold a view of the database as it was before the transaction commits, the actual view of what does can be reclaimed depends on what transactions are currently active. Thus upon commit, it may be that some tuples release by the committed transaction are still visible to others.
Vacuum actually works by looking for tuples where no running transactions can see them anymore. Postgres, in effect, maintains a minimum and maximum transaction ID that any tuple is visible for, and vacuum scans over all of those that have a max visible transaction ID (suggesting it's available to be reclaimed) and which is less than the minimum active transaction ID.
> Seems kind of a bad design to rely on periodic full data scans.
Note that it's not full scans - only pages that have been modified since the last vacuum, or were in a state last vacuum that they couldn't be processed, are vacuumed again. So there essentially is a block-level index for this.
This is essentially what innodb does, it calls this operation "purge" as in purging dead mvcc rows, and it calls the data structure the history list (IIRC). I don't know what the exact data structure is though to find purgable rows and how it avoids contention. The main difference is that innodb runs purge in background thread(s) as opposed to blocking foreground operations.
It has potential pitfalls too though. On a system with very high mutation rate, purging could fall behind. This mainly would happen if the history list overflows available buffer pool space and starts getting paged to disk (e.g. from one very long lived transection). I don't know if newer MySQL versions make purge running off disk more efficient, but 5.5 added a config for multiple purge threads.
Many things can seem this way from the armchair, but you can lend some benefit of the doubt. Postgres is worked on by very smart people and it's not the only system that works this way
Wouldn't a write lock on that structure become a hotly contended resource? Globally shared, globally writable structures tend not to scale very well.
Postgres already has the notion of creating and destroying transaction ids for each row version attached to that tuple (row version). That distributes your "structure" across the rows involved in it. Just because you delete a row version (whether by deleting that row, or by updating it — which in MVCC is synchronously deleting the old version and inserting a new one) doesn't mean that row version isn't still visible to other concurrent activity, specifically including read-only activity. You'd therefore have to write to this proposed structure on completing every read or write.
Well this might not answer your question directly, but it shows how VACCUM solves problem such as transaction etc. Its a great video to watch, covers topic in depth : https://www.youtube.com/watch?v=ZxhBkBNxvR0
Why isn’t the rate of vacuuming in Postgres determined by the rate of churn instead of things like wall clock time?
These weights and heuristics are either going to punish small databases of sabotage large ones, unless the factor in a predictor or how much there is to vacuum.
Edit: This is above my technical paygrade so I'll let others be the judge of whether these are in fact related! I believed so, from my earlier readings about the issue. Here's an article from the same blog in the OP about the Uber migration: http://rhaas.blogspot.in/2016/08/ubers-move-away-from-postgr...
Vacuuming is not why Uber flounced off PostgreSQL. They were naïvely using it as a KV store, and indexing every column, which caused a write-amplification cascade.
Edit, re: the parent's edit: Quoting that article, "Perhaps the thorniest problem which Uber raises is that of write amplification caused by secondary index updates." The word "vacuum" is mentioned exactly once, in the context of keeping up with index insertions (writes).
[+] [-] rosser|8 years ago|reply
[+] [-] ioltas|8 years ago|reply
[+] [-] gregw2|8 years ago|reply
What version of postgres vacuum implementation are they derived from? Have they implemented or incorporated any of these improvements since their fork? (Obviously Redshift doesn't need the btree index ones since it uses distribution keys and zone maps.)
I ask because my experience is that Redshift vacuum implementation is awful on large heavily used data warehouses, and I wonder if there is any hope for improvement.
[+] [-] scapecast|8 years ago|reply
So the vacuuming of tables to recover disk space from deleted rows needs to be done manually.
There are 3 types of vacuuming in Redshift (DELETE ONLY, SORT, REINDEX). The vacuum SORT operation is done on tables that have a sort key. To the extent that a vacuum SORT is an expensive (high IO) operation, we recommend when possible, to avoid the need to vacuum by loading the rows in sort order. If you do that, you will not need to vacuum the table, and this is the optimal solution for very long tables.
When dealing with a very long table (>10b rows) that has a sort key, it’s useful to partition the table into multiple tables where the table names are modified to indicate the partition. This is useful when pruning the size of the table, you don’t have to run VACUUM DELETE ONLY.
The key is to vacuum on a regular basis, and keep your stats off below 10%. If you wait too long, you might end up in a place where you have to resize your cluster or do a deep copy.
One more thing: Run your vacuum jobs in the queue with the most memory.
[+] [-] blaisio|8 years ago|reply
[+] [-] bigtones|8 years ago|reply
[+] [-] hinkley|8 years ago|reply
In these systems, a function that is very careful with resource allocation will almost never experience a significant slowdown due to the background processing, because they either pay no cost or an amortized cost. For instance, one collector I knew of would garbage collect up to 10 objects on every allocation. If large allocations are front loaded then a full cycle almost never happens. In Swift or C++ or Rust you pay for your allocations as soon as they go out of scope.
In the context of an MVCC system, running a long query causes considerably more versioned data to pile up. If part of the cost of retiring that data was payed at the end of the transaction, several things happen. One, other transactions are less impacted by the overhead. Two, you end up throttling the bad actors, which encourages the problem to be fixed, and in the meantime the high water mark for vacuuming is reduced.
[+] [-] rosser|8 years ago|reply
[+] [-] quickben|8 years ago|reply
What's slow are stuff like java GC Eden management where deletes are an afterthought to what is a time-stamped LRU reference management server, and a lot more.
So, no. I don't think you can compare C++ with systems that run full fledged GCs and threat them in the same bullpark.
[+] [-] devit|8 years ago|reply
Isn't it possible to just use a data structure to index rows by transaction ids, and on each transaction commit efficiently find all rows that aren't visible to any transaction, and add them to a free list?
Seems kind of a bad design to rely on periodic full data scans.
[+] [-] jandrewrogers|8 years ago|reply
This is not to say vacuum-like mechanisms are necessarily the best method but it was a common architectural idiom for database engines from the 1990s because it works well with sequential storage devices. Newer techniques tend to put resource recovery inline with the workload rather than outside of it but that has other disadvantages.
Indexing all rows in a database by transaction id would cause an extreme loss of throughput. That just creates a continuously mutated (and therefore locking) structure every thread constantly uses that is also paged to disk. And deletion of individual records is a problem for indexes generally.
[+] [-] acdha|8 years ago|reply
In this case, I'd want to think carefully about the data volume: many places may have hundreds of transactions per second but also have queries which run for much longer periods of time and the visible state has to be accurate for all of them. That sounds a lot of contention for that shared data structure and “efficiently find all rows that aren't visible to any transaction” sounds decidedly non-trivial for busy servers with lots of data, which are generally the only ones where this matters.
[+] [-] al_james|8 years ago|reply
Vacuum actually works by looking for tuples where no running transactions can see them anymore. Postgres, in effect, maintains a minimum and maximum transaction ID that any tuple is visible for, and vacuum scans over all of those that have a max visible transaction ID (suggesting it's available to be reclaimed) and which is less than the minimum active transaction ID.
[+] [-] anarazel|8 years ago|reply
Note that it's not full scans - only pages that have been modified since the last vacuum, or were in a state last vacuum that they couldn't be processed, are vacuumed again. So there essentially is a block-level index for this.
https://www.postgresql.org/docs/current/static/storage-vm.ht...
[+] [-] grogers|8 years ago|reply
It has potential pitfalls too though. On a system with very high mutation rate, purging could fall behind. This mainly would happen if the history list overflows available buffer pool space and starts getting paged to disk (e.g. from one very long lived transection). I don't know if newer MySQL versions make purge running off disk more efficient, but 5.5 added a config for multiple purge threads.
[+] [-] ketralnis|8 years ago|reply
Many things can seem this way from the armchair, but you can lend some benefit of the doubt. Postgres is worked on by very smart people and it's not the only system that works this way
Everything has tradeoffs.
[+] [-] rosser|8 years ago|reply
Postgres already has the notion of creating and destroying transaction ids for each row version attached to that tuple (row version). That distributes your "structure" across the rows involved in it. Just because you delete a row version (whether by deleting that row, or by updating it — which in MVCC is synchronously deleting the old version and inserting a new one) doesn't mean that row version isn't still visible to other concurrent activity, specifically including read-only activity. You'd therefore have to write to this proposed structure on completing every read or write.
[+] [-] porker|8 years ago|reply
[+] [-] antoaravinth|8 years ago|reply
[+] [-] frik|8 years ago|reply
[deleted]
[+] [-] hinkley|8 years ago|reply
These weights and heuristics are either going to punish small databases of sabotage large ones, unless the factor in a predictor or how much there is to vacuum.
[+] [-] anarazel|8 years ago|reply
It is determined by the rate of churn.
[+] [-] firasd|8 years ago|reply
Edit: This is above my technical paygrade so I'll let others be the judge of whether these are in fact related! I believed so, from my earlier readings about the issue. Here's an article from the same blog in the OP about the Uber migration: http://rhaas.blogspot.in/2016/08/ubers-move-away-from-postgr...
[+] [-] rosser|8 years ago|reply
Edit, re: the parent's edit: Quoting that article, "Perhaps the thorniest problem which Uber raises is that of write amplification caused by secondary index updates." The word "vacuum" is mentioned exactly once, in the context of keeping up with index insertions (writes).