Zheap – Reinvented PostgreSQL Storage

[+] malisper|5 years ago|reply

For those unaware, Zheap is a new storage engine for Postgres that handles updates and deletes in a different way. Currently, when you update a row in Postgres, Postgres creates a new copy of the row and marks the old one as deleted. This is done for several reasons, specifically it makes handling concurrency easier and it makes rolling back transactions easier.

The issue with this approach is over time this leads to lots of "dead rows" - deleted rows that are taking up space in your DB. Postgres has a background job called the "vacuum" which effectively garbage collects all deleted rows. Depending on your settings, the vacuum can be a pretty expensive job and even with the vacuum, you can still wind up using a lot more space than you actually need.

Zheap addresses these problems by using a different approach. When performing an update, instead of marking the old row as deleted and inserting a new one, Postgres will replace the old row with the new one and write a copy of the old row to a separate file. This means the main table file doesn't need to be larger than necessary to keep track of the dead rows.

Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast. For example, what happens if I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.

[+] castorp|5 years ago|reply

> Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast.

This is pretty much what Oracle is doing.

[+] stingraycharles|5 years ago|reply

Thanks for this. So I’m fairly familiar how MVCC relates to CoW, and the way that I read this is that it’s kind of the inverse of CoW: it’s an in-place update and a copy of the old data.

I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.

It’s a very interesting approach though. Once all the corner cases are solved and delt with, I wonder: are there any reasons not to choose zheap over the standard storage engine?

[+] comboy|5 years ago|reply

> row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b.

I thought there is a fixed amount of space allocated per row and varying length blobs are stored elsewhere, is my understanding wrong?

[+] jmnicolas|5 years ago|reply

Your write-up (or something similar) should be on their front page. Most projects assume you already know everything about them.

I don't remember the name but there was a NoSQL DB on HN a couple weeks ago. I had to Google and read different websites for a solid 10 minutes before I got that it was indeed a NoSQL DB and what was the differentiating factor from other DBs.

[+] aidos|5 years ago|reply

Oof. That sounds super hairy. Don’t you run into all sorts of weirdness with things like bitmaps expecting to find a specific version of a tuple at a specific location in the heap?

[+] sixdimensional|5 years ago|reply

This sounds like temporal tables, does it not - except deep inside the storage engine?? The historical copy goes to a separate table, the current row is in the current table.

I wonder if there is a logical overlap in the ideas between temporal table handling and this low level storage engine optimization.

[+] pbalau|5 years ago|reply

> I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.

But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?

[+] abhinav22|5 years ago|reply

Thanks for the explanation. In layman terms how much is the benefit from separating the old rows from the database vs the cost of accessing it from a separate file?

Just curious why it’s being done now as sounds like a major design decision that would have been considered a long time ago

[+] magicalhippo|5 years ago|reply

> replace the old row with the new one and write a copy of the old row to a separate file

Does it do this in the reverse order? If not, how does it prevent dataloss in case the main table update succeeds but writing to the "old copy" table fails?

[+] arthurcolle|5 years ago|reply

Have you stress tested it? How does this end up faring with large scale deployments?

[+] formerly_proven|5 years ago|reply

Undo logging vs redo logging (except both are MVCC).

[+] jacques_chester|5 years ago|reply

Confusingly, there is another new PostgreSQL storage effort called "Zedstore": https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5...

Disclosure: I work for VMware, who are sponsoring zedstore development via Greenplum.

[+] darksaints|5 years ago|reply

Are you working on zedstore? It is one of the postgres developments that I am anticipating the most. cstore_fdw looked pretty cool but it never lived up to the hype and had a lot of not so intuitive limitations. A storage engine built using the new table access APIs seems like the right foundation.

Any word on when you are expecting stability? I'd love to see this in RDS.

[+] fanf2|5 years ago|reply

There is a super informative blog post from last week at https://www.cybertec-postgresql.com/en/zheap-reinvented-post... - discussed at https://news.ycombinator.com/item?id=24710765

[+] throwdbaaway|5 years ago|reply

> What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE / INSERT operation which is of course not as efficient as an in-place UPDATE.

I would be wary of this. Innodb, for example, also has an optimistic (in-place) UPDATE mode, and a pessimistic UPDATE mode.

Repeatedly updating the same row under the pessimistic mode would end up stalling the database, even at a rather low QPS.

https://bugs.mysql.com/bug.php?id=53825 was originally reported by Facebook 10 years ago, and is still not fixed in 8.0.21 with VATS / CATS scheduling.

And then there is also the performance gap between the ideal case where undo log is still in memory, versus the pathological case where undo log needs to be fetched from disk.

The last thing postgres needs is something that looks good on paper / in benchmark, but has a bunch of issues in production.

[+] sega_sai|5 years ago|reply

It is nice that PG got another storage engine. I hope that will ease the implementation of the column based storage in official PG, as the row based storage is a major problem with very large datasets.

[+] ksec|5 years ago|reply

When can we expect Zheap to land? 14?

With the new pluggable Storage API, Are there any other new storage engine other than Zheap?

[+] macdice|5 years ago|reply

Great to see Antonin and others working on this! I did some work on some pieces of this (along with many others), and I gave a talk about an earlier prototype of the undo log machinery and how it fits at PGCon 2019 (which now seems like a whole lifetime ago!). Slides and recording here in case the background is interesting:

https://speakerdeck.com/macdice/transactions-in-postgresql-a...

[+] teej|5 years ago|reply

This seems neat!

Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE/INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.

My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.

[+] andreypopp|5 years ago|reply

Such workflows with scheduled DELETE/INSERT often mean that the data is "derived", there's unlogged tables feature for that in PostgreSQL. Table configured with unlogged are not being written to WAL and thus generate much much less I/O. The downside is that after a crash occurs the table might be empty (before it is repopulated by the DELETE/INSERT workflow again).

[+] merb|5 years ago|reply

well there is truncate for that. (not mvcc safe)

[+] yayzheap|5 years ago|reply

This should solve one of the problems Uber had with PostgreSQL reported by Christophe Pettus back in 2017[1].

1- https://thebuild.com/presentations/uber-perconalive-2017.pdf

Previous discussion from 2018 https://news.ycombinator.com/item?id=16526623

[+] petergeoghegan|5 years ago|reply

I am cautiously optimistic that my own recent work on version churn in B-Tree indexes will go a long way towards fixing those problems:

https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFAST...

This can be used without changing anything in the heap. It's less than a thousand lines of C. You could say that it's complementary to zheap, actually.

zheap makes it possible to update the same row many times without requiring new index entries, even when there is a long running transaction that holds back VACUUM. However, it does not avoid the problem of requiring a whole new set of index entries for all indexes in the event of even one indexed column being modified by updates.

Strictly speaking my patch doesn't "fix" that problem, either, but it comes pretty close. It teaches the indexes to "fight back" against version churn caused by updates that cannot use the HOT optimization. It makes non-HOT updates responsible for cleaning up their own mess -- no more negative externalities. In practice this seems to more or less fix the exact thing that the Uber blog post complained about, which was the "write amplification" incurred in all indexes when only one indexed column was changed by an update.

(I am omitting some subtleties here, but that's the general thrust of it.)

[+] AdamProut|5 years ago|reply

I wonder how this approach will compare with an LSM tree style index that has gained popularity recently (mostly via RocksDB). LSM trees are also write optimized and allow for some optimizations by batching up updates and merging them in bulk.

[+] ronlobo|5 years ago|reply

Look no further, after playing around with TimescaleDB, there are no updates anymore and we can put this problem into a vacuum :)

69 comments