I like this article. Lots of comments are stating that they are "using it wrong" and I'm sure they are. However, it does help to contrast the much more common, "use Postgres for everything" type sentiment. It is pretty hard to use Postgres wrong for relational things in the sense that everyone knows about indexes and so on. But using something like L/N comes with a separate learning curve anyway - evidenced in this case by someone having to read comments in the Postgres source code itself. Then if it turns out that it cannot work for your situation it may be very hard to back away from as you may have tightly integrated it with your normal Postgres stuff.
I've landed on Postgres/ClickHouse/NATS since together they handle nearly any conceivable workload managing relational, columnar, messaging/streaming very well. It is also not painful at all to use as it is lightweight and fast/easy to spin up in a simple docker compose. Postgres is of course the core and you don't always need all three but compliment each other very well imo. This has been my "go to" for a while.
"use Postgres for everything" is certainly wrong, eventually. It's still the second-best choice for every new project, and most products will never see the traffic levels that justify using something more specialized. Obviously, recall.ai hit the level of traffic where Postgres was no longer ideal. I bet they don't regret it for the other parts of their product.
Actually LISTEN/NOTIFY does also not scale the other way. Immich also moved to that pg for everything mentality (trying to remove redis dependencies). The problem: postgres needs a WAL flush for all notifications. I ran immich on my HDD-NAS. The result was constant noise because the pg backed socket.io backend issues constant keep alive messages.
There’s no reason this article and start with Postgres for everything can’t be true.
In the beginning having fewer parts to connect and maintain lets the needs and bottlenecks of the actual application emerge.
If it was listen/notify in such a scenario at some volume where optimizing it isn’t in the cards… so be it. It would be some time down the road before sharding a function into a specific subsystem like what you described.
Appreciate learning about the Postgres/Clickhouse/nats combo. If there might be an article if the three together that you liked would be happy to read and learn.
Honestly whatever kind of DB you are speaking about always be wary of "niche/side features" which don't fit it's core design goals, they tend to have unexpected limitations.
listen/notify isn't necessary a replacement for redis or other pub/sub systems, redis pub/sub and similar isn't necessary a replacement for idk. Kafka or similar queue/messaging system
but a lot of companies have (for modern standards) surprisingly small amounts of data, very even a increase by 2,3,4x still isn't that big. In that case listen/notify and similar might just work fine :shrug:
also same is true the other way around, depending on you application you can go redis only, as long as you data volume stays small enough and needs for transactional/sync are reasonable simple enough (with watch+exec, NX,XX options etc. and maybe some redis side lua scripts you can do quite a lot for data synchronization). Issue with that is that stylistically redis data sync/transaction code is often much more similar to writing atomic data-structures then to SQL transactions, and even for SQL transactions there is a trend of devs severely overestimating what they provide, so often you are better of not touching on it when you can avoid it, also BTW. redis has something very similar to sqlite or Notify where "basically" (oversimplified by a lot) there is only one set of writes done at a time ;) (and then afterwards distributed to replicas), just that outside of some micro lua scripts you don't really run much logic outside of some NX, XX checks etc. so it's not blocking much and it's "more or less" all just in memory not touching a WAL (again oversimplified).
Largely agree. Functionality wise if you don't have many jobs, using the database as the queue is fine.
However, I've been in several situations where scaling the queue brings down the database, and therefore the app, and am thus of the opinion you probably shouldn't couple these systems too tightly.
Maybe throw in a dedicated key-value store like Redis or Valkey.
Oh and maybe something S3 compatible like MinIO, Garage or SeaweedFS for storing bunches of binary data.
With all of that, honestly it should cover most of the common workloads out there! Of course, depends on how specialized vs generic you like your software to be.
I’ve been meaning to check out NATS - I’ve tended to default to Redis for pubsub. What are the main advantages? I use clickhouse and Postgres extensively
> However, it does help to contrast the much more common, "use Postgres for everything" type sentiment.
I think sentiment is to use "for everything in 99% business cases", which involves few 100GB of data with some thousands QPS, and could be handled by PG very well.
I think PG could relax the ordering thing with NOTIFYs since... it seems a bit silly, but NOTIFYs already are unsafe to use because there is no authorization around channel access, so one might as well use change data capture (logical replication, basically) instead.
This kind of issue always comes up when people put business logic inside the database. Databases are for data. The data goes in and the data goes out, but the data does not get to decide what happens next based on itself. That's what application code is for.
Postgres LISTEN/NOTIFY was a consistent pain point for Oban (background job processing framework for Elixir) for a while. The payload size limitations and connection pooler issues alone would cause subtle breakage.
It was particularly ironic because Elixir has a fantastic distribution and pubsub story thanks to distributed Erlang. That’s much more commonly used in apps now compared to 5 or so years ago when 40-50% of apps didn’t weren’t clustered. Thanks to the rise of platforms like Fly that made it easier, and the decline of Heroku that made it nearly impossible.
Hey folks, I ran into similar scalability issues and ended up building a benchmark tool to analyze exactly how LISTEN/NOTIFY behaves as you scale up the number of listeners.
Turns out that all Postgres versions from 9.6 through current master scale linearly with the number of idle listeners — about 13 μs extra latency per connection. That adds up fast: with 1,000 idle listeners, a NOTIFY round-trip goes from ~0.4 ms to ~14 ms.
To better understand the bottlenecks, I wrote both a benchmark tool and a proof-of-concept patch that replaces the O(N) backend scan with a shared hash table for the single-listener case — and it brings latency down to near-O(1), even with thousands of listeners.
IMO LISTEN/NOTIFY is badly designed as an interface to begin with because there is no way to enforce access controls (who can notify; who can listen) nor is there any way to enforce payload content type (e.g., JSON). It's very unlike SQL to not have a `CREATE CHANNEL` and `GRANT` commands for dealing with authorization to listen/notify.
If you have authz then the lack of payload content type constraints becomes more tolerable, but if you add a `CREATE CHANNEL` you might as well add something there regarding payload types, or you might as well just make it so it has to always be JSON.
With a `CREATE CHANNEL` PG could provide:
- authz for listen
- authz for notify
- payload content type constraints
(maybe always JSON if you CREATE
the channel)
- select different serialization
semantics (to avoid this horrible,
no good, very bad locking behavior)
- backwards-compatibility for listen/
notify on non-created channels
Thanks for attacking this issue (even if still in a research phase, that's definitely a needed start).
I'm amused at how op brags about the huge scale at which they operate, but instead of even considering fixing the issue (both for themselves and for others), they just switched to something else for pubsub.
I'd be interested as to how dumb-ol' polling would compare here (the FOR UPDATE SKIP LOCKED method https://leontrolski.github.io/postgres-as-queue.html). One day I will set up some benchmarks as this is the kind of thing people argue about a lot without much evidence either way.
My colleague did some internal benchmarking and found that LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. Our findings were pretty consistent with this blog post.
(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.
Would love to hear feedback from you and others building similar systems.
Polling is the way to go, but it's also very tricky to get right. In particular, it's non-trivial to make a reliable queue that's also fast when transactions are held open and vacuum isn't able to clean tuples. E.g. "get the first available tuple" might have to skip over 1000s of dead tuples.
Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.
I have implemented polling against a cluster of mixed mariadb/mysql databases which do not offer listen/notify. It was a pain in the neck to get right.
- The batch size needs to be adaptative for performance, latency, and recovering smoothly after downtime.
- The polling timeouts, frequency etc the same.
- You need to avoid hysteresis.
- You want to be super careful about not disturbing the main application by placing heavy load on the database or accidentally locking tables/rows
- You likely want multiple distributed workers in case of a network partition to keep handling events
It’s hard to get right especially when the databases at the time did not support SKIP LOCKED.
In retrospect I wish I had listened to the WAL. Much easier.
I use polling with back off up to one minute. So when a workload is done, it immediately polls for more work. If nothing found, wait for 5 seconds, still nothing 10 seconds, ... until one minute and from then on it polls every minute until it finds work again and the back off timer resets to 0 again.
With that experience behind you, would you have feedback for Chancy[1]? It aims to be a batteries-included offering for postgres+python, aiming for hundreds of millions of jobs a day, not massive horizontal worker scaling.
It both polls (configurable per queue) and supports listen/notify simply to inform workers that it can wake up early to trigger polling, and this can be turned off globally with a notifications=false flag.
Instead of LISTEN/NOTIFY you could listen to the wal / logical replication stream.
Or you could have a worker whose only job is to listen to the wal / logical replication stream and then NOTIFY. Being the only one to do so would not burden other transactions.
Or you could have a worker whose only job is to listen to the wal / logical replication stream and then publish on some non-PG pubsub system.
Ping requires something persistent to check. That requires creating tuples, and most likely deleting them after they’ve been consumed. That puts pressure on the database and requires vacuuming in ways that pubsub doesn’t because it’s entirely ephemeral.
Not to mention that pubsub allows multiple consumers for a single message, whereas FOR UPDATE is single consumer by design.
I found recently that you can write directly to the WAL with transactional guarantees, without writing to an actual table. This sounds like it would be amazing for queue/outbox purposes, as the normal approaches of actually inserting data in a table cause a lot of resource usage (autovacuum is a major concern for these use cases).
Can’t find the function that does that, and I’ve not seen it used in the wild yet, idk if there’s gotchas
Yeah until vendors butcher Postgres replication behaviors and prevent common paths of integrating these capabilities into other tools. Looking at you AWS
Many here recommend using Kafka or RabbitMQ for real-time notifications. While these tools work well with a relatively stable, limited set of topics, they become costly and inefficient when dealing with a large number of dynamic subscribers, such as in a messaging app where users frequently come and go. In RabbitMQ, queue bindings are resource-intensive, and in Kafka, creating new subscriptions often triggers expensive rebalancing operations. I've seen a use case for a messenger app with 100k concurrent subscribers where developers used RabbitMQ and individual queues for each user. It worked at 60 CPU on Rabbit side during normal situation and during mass reconnections of users (due to some proxy reload in infra) – it took up to several minutes for users to reconnect. I suggested switching to https://github.com/centrifugal/centrifugo with Redis engine (combines PUB/SUB + Redis streams for individual queues) – and it went to 0.3 CPU on Redis side. Now the system serves about 2 million concurrent connections.
Agreed, I am struggling to understand why "it does not scale" is not "we used it wrong and hit the point where it's a problem" here.
Like if it needs to be very consistent I would use an unlogged table (since we're worried about "scale" here) and then `FOR UPDATE SKIP LOCKED` like others have mentioned. Otherwise what exactly is notify doing that can't be done after the first transaction?
Edit: in-fact, how can they send an HTTP call for something and not be able to do a `NOTIFY` after as well?
One possible way I could understand what they wrote is that somewhere in their code, within the same transaction, there are notifies which conditionally trigger and it would be difficult to know which ones to notify again in another transaction after the fact. But they must know enough to make the HTTP call, so why not NOTIFY?
Yeah, the way I've always used LISTEN/NOTIFY is just to tell some pool of workers that they should wake up and check some transactional outbox for new work. False positives are basically harmless and therefore don't need to be transactional. If you're sending sophisticated messages with NOTIFY (which is a reasonable thing to think you can do) you're probably headed for pain at some point.
Assuming you skip select transaction, or require logging on it because your regulated industry had bad auditors, then every transaction changes something.
My thought as well. You could add notify commands to a temp table during the transaction, then run NOTIFY on each row in that temp table after the transaction commits successfully?
Sounds like a deliberate attempt to avoid spinning up Redis, Kafka, or an outbox system early on.. and then underestimated how quickly their scale would make it blow up. Story as old as time.
Because documentation doesn’t warn about this well-loved feature effectively ruins the ability to perform parallel writes, and because everything else in Postgres scales well.
I think it’s a reasonable assumption. Based on the second half of your comment, you clearly don’t think highly of “AI companies,” but I think that’s a separate issue.
Facebook’s wormhole seems like a better approach here - just tailing the MySQL bin log gets you commit safety for messages without running into this kind of locking behavior.
If I understood correctly, the global lock is so that notify events are emitted in order. Would it make sense to have a variant that doesn't make this ordering guarantee if you don't care about it, so that you can "notify" within transactions without locking the whole thing?
possibly, but i think at that point it would make more sense to move the business logic outside of the database (you can wait for a successful commit before triggering an external process via the originating app, or monitor the WAL with an external pub/sub system, or something else more clever than i can think of).
If I’m not mistaken LISTEN/NOTIFY doesn’t work with connection poolers, and you can’t have tens of thousands of connections to a Postgres database. Not sure you need a more elaborate analysis than that to reach the same conclusion.
Postgres is a great DB, but it's the wrong tool for a write-heavy, high-concurrency, real-time system with pub-sub needs.
You should split your system into specialized components:
- Kafka for event transport (you're likely already doing this).
- An LSM-tree DB for write-heavy structured data (eg: Cassandra)
- Keep Postgres for queries that benefit from relational features in certain parts of your architecture
IMO They don’t have a high concurrency DB writing system, they just think they do.
Recordings can and should be streamed to an object store. Parallel processes can do transcription on those objects; bonus: when they inevitably have a bug in transcription, retranscribing meetings is easy.
The output of transcription can be a single file also stored in the object store with a single completion message notification, or if they really insist on “near real-time”, a message on a queue for every N seconds. Much easier to scale your queue than your DB, eg Kafka partitions.
A handful of consumers can read those messages and insert into the DB. Benefit is you have a fixed and controllable write load into the database, and your client workload never overloads the DB because you’re buffering that with the much more distributed object store (which is way simpler than running another database engine).
Neither do foreign keys the moment you need to shard. Turns out that there's no free lunch when you ask your database to do "secret extra work" that's supposed to be transparent-ish to the user.
> Unique indexes slow inserts since db has to acquire a full table lock
An INSERT never results in a full table lock (as in "the lock would prevent other inserts or selects on the table)
Any expression used in the WHERE clause that isn't indexed will probably result in a Seq Scan. CASE expressions are no different than e.g. a function call regarding this.
A stable function marked as "STABLE" (or even immutable) can be optimized differently (e.g. can be "inlined"), so yes that's a good recommendation.
Can you provide more details? Inserting with unique indexes do not lock the table. Case statements are ok in where clause, use expression indexes to index it
For real-time notifications, I believe Nats (https://nats.io) or Centrifugo (https://centrifugal.dev) are worth checking out these days. Messages may be delivered to those systems from PostgreSQL over replication protocol through Kafka as an intermediary buffer. Reliable real-time messaging comes with lots of complexities though, like late message delivery, duplicate message delivery. If the system can be built around at most once guarantees – can help to simplify the design dramatically. Depends on the use case of course, often both at least once and at most once should co-exist in one app.
This is part of the basis for Supabase offering their realtime service, and broadcast, rather than supporting native LISTEN/NOTIFY. The scaling issues are well known.
Sounds like one centralized Postgres instance, am I understanding that correctly? Wouldn’t meeting bots be very easy to parallelize across single-tenant instances?
You had one problem with listen notify which was a fair one, but now you have a problem with http latency, network issues, DNS, retries, self-DDoS, etc.
it sounds like the impact of LISTEN/NOTIFY scaling issues was much greater on the overall DB performance than the actual load/scope of the task being performed (based on the end of the article), and they're aware that if they needed something more performant for that offloaded task, they have options (pub/sub via redis or w/e).
LISTEN/NOTIFY was always a bit of a puzzler for me. Using it means you can't use things like pgbouncer/pgpool and there are so many other ways to do this, polling included. I guess it could be handy for an application where you know it won't scale and you just want a simple, one-dependency database.
> I guess it could be handy for an application where you know it won't scale and you just want a simple, one-dependency database
That's where we use it at my work. We have host/networking deployment pipelines that used to have up to one minute latency on each step because each was ran on a one-minute cron. A short python script/service that handled the LISTENing + adding NOTIFYs when the next step was ready removed the latency and we'll never do enough for the load on the db to matter
You can setup notify to run as a trigger on an events table. The job that listens shouldn't need a pool, it's a long lived connection anyway. Now you can keep using pgbouncer everywhere else.
1) the Postgres documentation does not mention that Notify causes a global lock or lock of any sort (I checked). That’s crazy to me; if something causes a lock, the documentation should tell you it does and what kind. Performance notes also belong in documentation for dbs.
2) why the hell does notify require a lock in the first place? Reading the comment this design seems insane; there’s no good reason to queue up notifications for transactions that aren’t committed. Just add the notifications in commit order with no lock, you’re building a db with concurrency, get used to it.
The pattern I've always used for this, which I suspect is what they landed on, is to have an optimistic notification method in a separate message queue that says "something changed that's relevant to you". Then you can dedupe that, etc. Then structure the data to easily sync what's new, and let the client respond to that notification by calling the sync API. That even lets you use multiple notification methods for notification. None of that involves having to have the database coordinate notifications in the middle of a transaction.
Honestly this article is ridiculous. Most people do not have tens of thousands of concurrent writers. And most applications out there are read heavy, not write. Which means you probably have read replicas distributing loads.
Use LISTEN/NOTIFY. You will get a lot of utility out of it before you’re anywhere close to these problems.
I would phrase this as “know where your approach hits scaling walls”. You’re right that many people never need more than LISTEN/NOTIFY but the reason that advice became so popular was the wave of people who had jumped straight into running some complicated system like Kafka when they hadn’t done any analysis to justify it; it would be nice if the lesson we taught was that you should do some analysis rather than just picking one popular option.
I think the title is stating this:
"Postgres LISTEN/NOTIFY does not scale"
That means for moderate cases you do not even have to care about this. 99% of PostgreSQL instances out there are not big "scale".
As a sr. engineer is your responsibility to make a decision if you will build for "scale" from day zero or ignore this as you are mindful that this will not affect you until a certain point.
Listen/Notify is potentially lossy and should not be used. At one of my previous companies we used it for cache invalidation (a trigger on tables would sent notify messages to invalidation Redis keys for potentially affected cache entries). We ended up ripping it out and replacing it with NSQ.io as a reliable transport.
> When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits.
It only serializes commits where NOTIFY was issued as part of the transaction, right? Transactions which did not call NOTIFY should not be affected?
My understanding:
i think as postgres takes an exclusive lock to enqueue the notifications into a shared queue in PreCommit_Notify(), as the actual commit happens after notification was enqueued into the queue,as other transactions also try to notify but wait becacause of the lock ,so does the commit waits.
RBDMS are not designed for write-heavy applications, they are designed for read-heavy analysis. Also, an RDBMS is not a message queue or an RPC transport.
I feel like somebody needs to write a book on system architecture for Gen Z that's just filled with memes. A funny cat pic telling people not to use the wrong tool will probably make more of an impact than an old fogey in a comment section wagging his finger.
People have been using RDBMS' for write-heavy workflows for forever. Some people even use stored procs or triggers for getting complicated write operations to work properly.
Databases can do a lot of stuff, and if you're not hurting for DB performance it can be a good idea to just... do it in the database. The advantage is that, if the DB does it, you're much less likely to break things. Putting data constraints in application code can be done, but then you're just waiting for the day those constraints are broken.
RDBMS (you spelled it wrong) are good for many things. Postgres is a veritable swiss army knife - have a look through the manual, scroll through all the features you don't care about, be amazed it has so many.
RDBMS are an old fogey tool. It takes a really old fogey to suggest storing records at fixed byte intervals directly on the disk - is that your proposed alternative? Or perhaps you grew up in the microservices era and that's already become old fogey.
But those rules of thumb aren't true. People use Postgres for job queues and write-heavy applications.
You'd have to at least accompany your memes with empirics. What is write-heavy? A number you might hit if your startup succeeds with thousands of concurrent users on your v1 naive implementation?
Else you just get another repeat of everyone cargo-culting Mongo because they heard that Postgres wasn't web scale for their app with 0 users.
Got up to the TL;DR paragraph. This was a major red flag given the initial presentation of the discovery of a bottleneck:
'''
When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits.
'''
Am I missing something - this seems like something the original authors of the system should have done due diligence on before implementing a write heavy work load.
I think it's just difficult to predict how heavy is heavy enough to make this a problem. FWIW I had worked at a startup with a much more primitive data storage system where serialized commits were actually totally fine. The startup never outgrew that bottleneck.
If “doing due diligence” involves reading the source code of a database server to verify a design, I doubt many people writing such systems do due diligence.
The documentation doesn’t mention any caveats in this direction, and they had 3 periods of downtime in 4 days, so I don’t think it’s a given that testing would have hit this problem.
Transactional databases are not really the best tool for writing tons of (presumably) immutable records. Why are you using it for this? Why not Elastic?
Transactional databases are great, provided your write workload is low enough to fit on one server. If you have to scale up past that, you might have to use a different kind of database. But if transactions work for you, as they do for 99% of small-medium sites, they're amazing.
Multi-master transactional databases are an open area of research, as far as I'm aware, but read-only replication is a solved problem. Therefore your write traffic, including your transaction overhead, has to fit within one server's capacity, while your read traffic can scale horizontally as much as you like.
Oof. Now I’m looking into using an extension someone wrote to publish to Redis straight from Postgres, as a replacement for NOTIFY statements in triggers. Kind of a mess but any other way of waking up our app logic that walks change queues in Postgres seems worse
> The structured data gets written to our Postgres database by tens of thousands of simultaneous writers. Each of these writers is a “meeting bot”, which joins a video call and captures the data in real-time.
Maybe I missed it in some folded up embedded content, or some graph (or maybe I'm probably just blind...), but is it mentioned at which point they started running into issues? The quoted bit about "10s of thousands of simultaneous writers" is all I can find.
What is the qualitative and quantitative nature of relevant workloads? Depending on the answers, some people may not care.
I asked ChatGPT to research it and this is the executive summary:
For PostgreSQL’s LISTEN/NOTIFY, a realistic safe throughput is:
Up to ~100–500 notifications/sec: Handles well on most systems with minimal tuning. Low risk of contention.
~500–2,000 notifications/sec: Reasonable with good tuning (short transactions, fast listeners, few concurrent writers). May start to see lock contention.
~2,000–5,000 notifications/sec: Pushing the upper bounds. Requires careful batching, dedicated listeners, possibly separate Postgres instances for pub/sub.
>5,000 notifications/sec: Not recommended for sustained load. You’ll likely hit serialization bottlenecks due to the global commit lock held during NOTIFY.
osigurdson|7 months ago
I've landed on Postgres/ClickHouse/NATS since together they handle nearly any conceivable workload managing relational, columnar, messaging/streaming very well. It is also not painful at all to use as it is lightweight and fast/easy to spin up in a simple docker compose. Postgres is of course the core and you don't always need all three but compliment each other very well imo. This has been my "go to" for a while.
jelder|7 months ago
riedel|7 months ago
j45|7 months ago
In the beginning having fewer parts to connect and maintain lets the needs and bottlenecks of the actual application emerge.
If it was listen/notify in such a scenario at some volume where optimizing it isn’t in the cards… so be it. It would be some time down the road before sharding a function into a specific subsystem like what you described.
Appreciate learning about the Postgres/Clickhouse/nats combo. If there might be an article if the three together that you liked would be happy to read and learn.
dathinab|7 months ago
listen/notify isn't necessary a replacement for redis or other pub/sub systems, redis pub/sub and similar isn't necessary a replacement for idk. Kafka or similar queue/messaging system
but a lot of companies have (for modern standards) surprisingly small amounts of data, very even a increase by 2,3,4x still isn't that big. In that case listen/notify and similar might just work fine :shrug:
also same is true the other way around, depending on you application you can go redis only, as long as you data volume stays small enough and needs for transactional/sync are reasonable simple enough (with watch+exec, NX,XX options etc. and maybe some redis side lua scripts you can do quite a lot for data synchronization). Issue with that is that stylistically redis data sync/transaction code is often much more similar to writing atomic data-structures then to SQL transactions, and even for SQL transactions there is a trend of devs severely overestimating what they provide, so often you are better of not touching on it when you can avoid it, also BTW. redis has something very similar to sqlite or Notify where "basically" (oversimplified by a lot) there is only one set of writes done at a time ;) (and then afterwards distributed to replicas), just that outside of some micro lua scripts you don't really run much logic outside of some NX, XX checks etc. so it's not blocking much and it's "more or less" all just in memory not touching a WAL (again oversimplified).
ownagefool|7 months ago
However, I've been in several situations where scaling the queue brings down the database, and therefore the app, and am thus of the opinion you probably shouldn't couple these systems too tightly.
There are pros and cons, of course.
KronisLV|7 months ago
Maybe throw in a dedicated key-value store like Redis or Valkey.
Oh and maybe something S3 compatible like MinIO, Garage or SeaweedFS for storing bunches of binary data.
With all of that, honestly it should cover most of the common workloads out there! Of course, depends on how specialized vs generic you like your software to be.
goodkiwi|7 months ago
riku_iki|7 months ago
I think sentiment is to use "for everything in 99% business cases", which involves few 100GB of data with some thousands QPS, and could be handled by PG very well.
cryptonector|7 months ago
fathomdeez|7 months ago
unknown|7 months ago
[deleted]
v5v3|7 months ago
I.e. use Kafka unless you have a explicit reason not to?
So why Nats?
sorentwo|7 months ago
It was particularly ironic because Elixir has a fantastic distribution and pubsub story thanks to distributed Erlang. That’s much more commonly used in apps now compared to 5 or so years ago when 40-50% of apps didn’t weren’t clustered. Thanks to the rise of platforms like Fly that made it easier, and the decline of Heroku that made it nearly impossible.
cpursley|7 months ago
nightpool|7 months ago
alberth|7 months ago
JoelJacobson|7 months ago
Turns out that all Postgres versions from 9.6 through current master scale linearly with the number of idle listeners — about 13 μs extra latency per connection. That adds up fast: with 1,000 idle listeners, a NOTIFY round-trip goes from ~0.4 ms to ~14 ms.
To better understand the bottlenecks, I wrote both a benchmark tool and a proof-of-concept patch that replaces the O(N) backend scan with a shared hash table for the single-listener case — and it brings latency down to near-O(1), even with thousands of listeners.
Full benchmark, source, and analysis here: https://github.com/joelonsql/pg-bench-listen-notify
No proposals yet on what to do upstream, just trying to gather interest and surface the performance cliff. Feedback welcome.
cryptonector|7 months ago
IMO LISTEN/NOTIFY is badly designed as an interface to begin with because there is no way to enforce access controls (who can notify; who can listen) nor is there any way to enforce payload content type (e.g., JSON). It's very unlike SQL to not have a `CREATE CHANNEL` and `GRANT` commands for dealing with authorization to listen/notify.
If you have authz then the lack of payload content type constraints becomes more tolerable, but if you add a `CREATE CHANNEL` you might as well add something there regarding payload types, or you might as well just make it so it has to always be JSON.
With a `CREATE CHANNEL` PG could provide:
infogulch|7 months ago
https://www.postgresql.org/message-id/flat/CAM527d_s8coiXDA4...
https://www.postgresql.org/message-id/flat/175222328116.3157...
WhyNotHugo|7 months ago
I'm amused at how op brags about the huge scale at which they operate, but instead of even considering fixing the issue (both for themselves and for others), they just switched to something else for pubsub.
leontrolski|7 months ago
Wasn't aware of this AccessExclusiveLock behaviour - a reminder (and shameless plug 2) of how Postgres locks interact: https://leontrolski.github.io/pglockpy.html
qianli_cs|7 months ago
(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.
Would love to hear feedback from you and others building similar systems.
[1] https://github.com/dbos-inc/dbos-transact-py
singron|7 months ago
Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.
broken_broken_|7 months ago
- The batch size needs to be adaptative for performance, latency, and recovering smoothly after downtime.
- The polling timeouts, frequency etc the same.
- You need to avoid hysteresis.
- You want to be super careful about not disturbing the main application by placing heavy load on the database or accidentally locking tables/rows
- You likely want multiple distributed workers in case of a network partition to keep handling events
It’s hard to get right especially when the databases at the time did not support SKIP LOCKED.
In retrospect I wish I had listened to the WAL. Much easier.
cpursley|7 months ago
RedShift1|7 months ago
TkTech|7 months ago
It both polls (configurable per queue) and supports listen/notify simply to inform workers that it can wake up early to trigger polling, and this can be turned off globally with a notifications=false flag.
[1]: https://github.com/tktech/chancy
cryptonector|7 months ago
Or you could have a worker whose only job is to listen to the wal / logical replication stream and then NOTIFY. Being the only one to do so would not burden other transactions.
Or you could have a worker whose only job is to listen to the wal / logical replication stream and then publish on some non-PG pubsub system.
aurumque|7 months ago
sorentwo|7 months ago
Not to mention that pubsub allows multiple consumers for a single message, whereas FOR UPDATE is single consumer by design.
cpursley|7 months ago
https://github.com/cpursley/walex?tab=readme-ov-file#walex (there's a few useful links in here)
williamdclt|7 months ago
Can’t find the function that does that, and I’ve not seen it used in the wild yet, idk if there’s gotchas
Edit: found it, it’s pg_logical_emit_message
denysonique|7 months ago
BowBun|7 months ago
FZambia|7 months ago
odie5533|7 months ago
CaliforniaKarl|7 months ago
* It gives an indication of how much you need to grow before this Postgres functionality starts being a blocker.
* Folks encountering this issue—and its confusing log line—in the future will be able to find this post and quickly understand the issue.
Gigachad|7 months ago
andrewstuart|7 months ago
The post author is too focused on using NOTIFY in only one way.
This post fails to explain WHY they are sending a NOTIFY. Not much use telling us what doesn’t work without telling us the actual business goal.
It’s crazy to send a notify for every transaction, they should be debounced/grouped.
The point of a NOTIFY is to let some other system know something has changed. Don’t do it every transaction.
0xCMP|7 months ago
Like if it needs to be very consistent I would use an unlogged table (since we're worried about "scale" here) and then `FOR UPDATE SKIP LOCKED` like others have mentioned. Otherwise what exactly is notify doing that can't be done after the first transaction?
Edit: in-fact, how can they send an HTTP call for something and not be able to do a `NOTIFY` after as well?
One possible way I could understand what they wrote is that somewhere in their code, within the same transaction, there are notifies which conditionally trigger and it would be difficult to know which ones to notify again in another transaction after the fact. But they must know enough to make the HTTP call, so why not NOTIFY?
thom|7 months ago
tomrod|7 months ago
hombre_fatal|7 months ago
nick_|7 months ago
parthdesai|7 months ago
soursoup|7 months ago
zerd|7 months ago
callamdelaney|7 months ago
It’s unsurprising to me that an AI company appears to have chosen exactly the wrong tool for the job.
kristianc|7 months ago
bravesoul2|7 months ago
SQS may have been a good "boring" choice for this?
TheTaytay|7 months ago
I think it’s a reasonable assumption. Based on the second half of your comment, you clearly don’t think highly of “AI companies,” but I think that’s a separate issue.
NightMKoder|7 months ago
cshimmin|7 months ago
GuinansEyebrows|7 months ago
bjornsing|7 months ago
calderwoodra|7 months ago
baristaGeek|7 months ago
You should split your system into specialized components: - Kafka for event transport (you're likely already doing this). - An LSM-tree DB for write-heavy structured data (eg: Cassandra) - Keep Postgres for queries that benefit from relational features in certain parts of your architecture
ryanjshaw|7 months ago
Recordings can and should be streamed to an object store. Parallel processes can do transcription on those objects; bonus: when they inevitably have a bug in transcription, retranscribing meetings is easy.
The output of transcription can be a single file also stored in the object store with a single completion message notification, or if they really insist on “near real-time”, a message on a queue for every N seconds. Much easier to scale your queue than your DB, eg Kafka partitions.
A handful of consumers can read those messages and insert into the DB. Benefit is you have a fixed and controllable write load into the database, and your client workload never overloads the DB because you’re buffering that with the much more distributed object store (which is way simpler than running another database engine).
baristaGeek|7 months ago
merb|7 months ago
polote|7 months ago
shivasaxena|7 months ago
Becomes a problem if you are inserting 40 items to order_items table.
Spivak|7 months ago
shivasaxena|7 months ago
What I already know
- Unique indexes slow inserts since db has to acquire a full table lock
- Case statements in Where break query planner/optimizer and require full table scans
- Read only postgres functions should be marked as `STABLE PARALLEL SAFE`
hans_castorp|7 months ago
An INSERT never results in a full table lock (as in "the lock would prevent other inserts or selects on the table)
Any expression used in the WHERE clause that isn't indexed will probably result in a Seq Scan. CASE expressions are no different than e.g. a function call regarding this.
A stable function marked as "STABLE" (or even immutable) can be optimized differently (e.g. can be "inlined"), so yes that's a good recommendation.
franckpachot|7 months ago
1a527dd5|7 months ago
My other reference for a slightly different problem is https://www.thatguyfromdelhi.com/2020/12/what-postgres-sql-c...
FZambia|7 months ago
cryptonector|7 months ago
spoaceman7777|7 months ago
mattxxx|7 months ago
mulmen|7 months ago
to11mtm|7 months ago
caleblloyd|7 months ago
h1fra|7 months ago
GuinansEyebrows|7 months ago
sleepy_keita|7 months ago
nightfly|7 months ago
That's where we use it at my work. We have host/networking deployment pipelines that used to have up to one minute latency on each step because each was ran on a one-minute cron. A short python script/service that handled the LISTENing + adding NOTIFYs when the next step was ready removed the latency and we'll never do enough for the load on the db to matter
nhumrich|7 months ago
valenterry|7 months ago
grumple|7 months ago
1) the Postgres documentation does not mention that Notify causes a global lock or lock of any sort (I checked). That’s crazy to me; if something causes a lock, the documentation should tell you it does and what kind. Performance notes also belong in documentation for dbs.
2) why the hell does notify require a lock in the first place? Reading the comment this design seems insane; there’s no good reason to queue up notifications for transactions that aren’t committed. Just add the notifications in commit order with no lock, you’re building a db with concurrency, get used to it.
bhollis|7 months ago
deadbabe|7 months ago
Use LISTEN/NOTIFY. You will get a lot of utility out of it before you’re anywhere close to these problems.
acdha|7 months ago
konsalexee|7 months ago
That means for moderate cases you do not even have to care about this. 99% of PostgreSQL instances out there are not big "scale".
As a sr. engineer is your responsibility to make a decision if you will build for "scale" from day zero or ignore this as you are mindful that this will not affect you until a certain point.
freeasinbeer2|7 months ago
What were the TPS numbers? What was the workload like? How big is the difference in %?
supportengineer|7 months ago
Features that seem harmless at small scale can break everything at large scale.
edoceo|7 months ago
However, in 2025 I'd pick Redis or MQTT for this kind of role. I'm typically in multi-lamg environments. Is there something better?
fmajid|7 months ago
daitangio|7 months ago
https://github.com/daitangio/pque
I evaluated Listen/notify but it seems to loose messages if no one is listening, so its use case seems pretty limited to me (my 2 cents).
Anyway, If you need to scale, I suggest an ad hoc queue server like rabbitmq.
winterrx|7 months ago
Matthias247|7 months ago
> When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits.
It only serializes commits where NOTIFY was issued as part of the transaction, right? Transactions which did not call NOTIFY should not be affected?
DumBthInker007|7 months ago
redskyluan|7 months ago
For startups, Postgres is a fantastic first choice. But plan ahead: as your workload grows, you’ll likely need to migrate or augment your stack.
westurner|7 months ago
cellis|7 months ago
gwbas1c|7 months ago
> tens of thousands of simultaneous writers
I'm surprised they aren't sharding at this scale. I wonder why?
0xbadcafebee|7 months ago
I feel like somebody needs to write a book on system architecture for Gen Z that's just filled with memes. A funny cat pic telling people not to use the wrong tool will probably make more of an impact than an old fogey in a comment section wagging his finger.
const_cast|7 months ago
Databases can do a lot of stuff, and if you're not hurting for DB performance it can be a good idea to just... do it in the database. The advantage is that, if the DB does it, you're much less likely to break things. Putting data constraints in application code can be done, but then you're just waiting for the day those constraints are broken.
immibis|7 months ago
RDBMS are an old fogey tool. It takes a really old fogey to suggest storing records at fixed byte intervals directly on the disk - is that your proposed alternative? Or perhaps you grew up in the microservices era and that's already become old fogey.
hombre_fatal|7 months ago
You'd have to at least accompany your memes with empirics. What is write-heavy? A number you might hit if your startup succeeds with thousands of concurrent users on your v1 naive implementation?
Else you just get another repeat of everyone cargo-culting Mongo because they heard that Postgres wasn't web scale for their app with 0 users.
kccqzy|7 months ago
vb-8448|7 months ago
doc_manhat|7 months ago
''' When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits. '''
Am I missing something - this seems like something the original authors of the system should have done due diligence on before implementing a write heavy work load.
kccqzy|7 months ago
Someone|7 months ago
The documentation doesn’t mention any caveats in this direction, and they had 3 periods of downtime in 4 days, so I don’t think it’s a given that testing would have hit this problem.
whatevaa|7 months ago
seunosewa|7 months ago
winterrx|7 months ago
aryav07|7 months ago
dumbfounder|7 months ago
incoming1211|7 months ago
immibis|7 months ago
Multi-master transactional databases are an open area of research, as far as I'm aware, but read-only replication is a solved problem. Therefore your write traffic, including your transaction overhead, has to fit within one server's capacity, while your read traffic can scale horizontally as much as you like.
Kwpolska|7 months ago
maxdo|7 months ago
jedwards1211|7 months ago
randall|7 months ago
wordofx|7 months ago
anonu|7 months ago
cool writeup!
threecheese|7 months ago
unknown|7 months ago
[deleted]
fatih-erikli-cg|7 months ago
[deleted]
winterissnowing|7 months ago
[deleted]
aaa12365|7 months ago
ilitirit|7 months ago
Maybe I missed it in some folded up embedded content, or some graph (or maybe I'm probably just blind...), but is it mentioned at which point they started running into issues? The quoted bit about "10s of thousands of simultaneous writers" is all I can find.
What is the qualitative and quantitative nature of relevant workloads? Depending on the answers, some people may not care.
I asked ChatGPT to research it and this is the executive summary:
cap11235|7 months ago
[deleted]