top | item 39620486

(no title)

Spiwux | 2 years ago

Calling Postgres experts:

Why, exactly, do we need to put a memory cache such as Redis in front of Postgres? Postgres has its own in-memory cache that it updates on reads and writes, right? What makes Postgres' cache so much worse than a dedicated Redis?

discuss

order

tetha|2 years ago

Postgres can develop problematic behavior if you have high churn tables - tables with lots of deletes on them.

If you have many inserts and deletes on a table, the table will build up tombstones and postgres will eventually be forced to vacuum the table. This doesn't block normal operation, but auto vacuums on large tables can be resource intensive - especially on the storage/io side. And this - at worst - can turn into a resource contention so you either end up with an infinite auto vacuum (because the vacuum can't keep up fast enough), or a severe performance impact on all queries on the system (and since this is your postgres-as-redis, there is a good chance all of the hot paths rely on the cache and get slowed down significantly).

Both of these result in different kinds of fun - either your applications just stop working because postgres is busy cleaning up, or you end up with some horrible table bloat in the future, which will take hours and hours of application downtime to fix, because your drives are fast, but not that fast.

There are ways to work around this, naturally. You could have an expiration key with an index on it, and do "select * from cache order by expiration_key desc limit 1", and throw pg_partman at it to partition the table based on the expiration key, and drop old values by dropping partitions and such... but at some point you start wondering if using a system meant for this kinda workload is easier.

jasonwatkinspdx|2 years ago

The buffer pool in a rdbms ends up intimately connected with the concurrency control and durability protocols. There's also a variety of tradeoffs in how to handle conflicts between transactions (steal vs no steal, force vs no force, etc). You need deadlock detection or prevention. That creates a necessary minimum of complexity and overhead.

By comparison an in memory kv cache is much more streamlined. They basically just need to move bytes from a hash table to a network socket as fast as possible, with no transactional concerns.

The semantics matter as well. PostgreSQL has to assume all data needs to be retained. Memcached can always just throw something away. Redis persistence is best effort with an explicit loss window. That has enormous practical implications on their internals.

So in practical terms this means they're in different universes performance wise. If your workload is compatible with a kv cache semantically, adding memcached to your infrastructure will probably result in a savings overall.

vbezhenar|2 years ago

Because Redis is almost infinitely scalable while Postgres is not. You have limited vertical scalability budget for your database. The more things you put into your database, the more budget you spending on things that could be done elsewhere.

Sometimes it makes sense, when your workload is not going to hit the limits of your available hardware.

But generally you should be prepared to move everything you can out of the database, so database will not spend any CPU on things that could be computed on another computer. And cache is one of those things. If you can avoid hitting database, by hitting another server, it's a great thing to do.

Of course you should not prematurely optimize. Start simple, hit your database limits, then introduce cache.

dpedu|2 years ago

How would the architecture in the OP mesh with master-slave postgres setups? If I write a cache item how can I be certain the freshest entry is read back from the read-only slave? Can/do I pay a performance penalty on writes waiting for it to be synchronized? Is it better, when it comes to caching, to ignore the slave and send all read/write cache related queries to the master?

All of these questions go away or are greatly simplified with redis.

watermelon0|2 years ago

They don't really go away, because if you need read-only replicas with PostgreSQL, there is a good chance that you will also need read-only replicas with Redis.

Similarly to Postgres, Redis replication is also async, which means that replicas can be out-of-sync for a brief period of time.

mjdijk|2 years ago

IME -- and I've just replaced a Postgres-only unlogged cache table with Redis -- it's not about the storage or caching, but about the locking. Postgres needs to acquire (or at least check) a lock for reading as well as writing. Although some optimizations have been done for mostly-reading-workloads (search for postgres fast-path locking), you'll still run into lock contention problems relatively quickly.

renegade-otter|2 years ago

Add read replicas before doing cache "optimization", because cache timing bugs are a special kind of hell.

ww520|2 years ago

Machines used to have limited memory. Distributed caching can utilize many machines to form the overall cache. Nowadays machines have plenty of memory with numerous cores and fast bandwidth. The need for large network of cache servers has waned.

stonemetal12|2 years ago

Even though PG caches it is still doing all the things to run the query. It is like saying why does a 3d render take so long to render an image when the same image saved to a PNG opens so much faster.

The article talks about using Unlogged tables, they double write speed by forgoing the durability and safety of the WAL. It doesn't mention query speed because it is completely unaffected by the change.

hamandcheese|2 years ago

As far as I know there is no way to tell Postgres to keep a particular index or table in memory, which is one reason to be weary of using one PG instance for many varied workloads. You might solve this by earmarking workload-specific replicas, though.

If you can keep your entire working set in memory, though, then it probably doesn't matter that much.

gjvc|2 years ago

Being wary for too long can make one weary.

darth_avocado|2 years ago

Redis is completely in memory, therefore all data is in memory. Postgres on the other hand does have a cache of it's own, does not give you fine controls over what stays in cache. What stays in cache depends on data access patterns. E.g. I cannot make an entire table of my choosing to be in cache.

kapperchino|2 years ago

Because you can’t scale out just the cache part of Postgres, one machine can only have so much memory

10000truths|2 years ago

If you have a second machine, why not just put a Postgres read replica on it? Letting the WAL deal with replica consistency is much simpler than making the client responsible for keeping an external cache in sync, and you get the benefit of keeping everything in Postgres.

candiddevmike|2 years ago

Caching with postgres also lets you do cache invalidation via triggers!

dvfjsdhgfv|2 years ago

> What makes Postgres' cache so much worse than a dedicated Redis?

It's not worse, this is just a cheap way to increase performance without having to scale the main instance vertically.

VWWHFSfQ|2 years ago

> Why, exactly, do we need to put a memory cache such as Redis in front of Postgres?

Maybe you don't want to run the same expensive queries all the time to serve your json API?

There's a million reasons you might want to cache expensive queries somewhere upstream from the actual database.