SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue

[+] mark242|2 years ago|reply

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

It ultimately doesn't matter because of the low volume they're dealing with, but gang, "just slap a queue on it" gets you the same results as "just slap a cache on it" if you don't understand the tool you're working with. If they knew that some jobs would take hours and some jobs would take seconds, why would you not immediately spin up four queues. Two for the short jobs (one acting as a DLQ), and two for the long jobs (again, one acting as a DLQ). Your DLQ queues have a low TTL, and on expiration those messages get placed back onto the tail of the original queues. Any failure by your consumer, and that message gets dropped onto the DLQ and your overall throughput is determined by the number * velocity of your consumers, and not on your queue architecture.

This pg queue will last a very long time for them. Great! They're willing to give up the easy fanout architecture for simplicity, which again at their volume, sure, that's a valid trade. At higher volumes, they should go back to the drawing board.

[+] btilly|2 years ago|reply

If they knew that some jobs would take hours and some jobs would take seconds, why would you not immediately spin up four queues. Two for the short jobs (one acting as a DLQ), and two for the long jobs (again, one acting as a DLQ). Your DLQ queues have a low TTL, and on expiration those messages get placed back onto the tail of the original queues.

Here is why I would not recommend that.

Do that and you have to rewrite your system around predictions how long each job will take, deal with 4 sources of failure, and have more complicated code. All this to maintain the complication of an ultimately unneeded queueing system. You call it an "easy fanout architecture for simplicity." I call it, "an entirely unnecessary complication that they have no business having to deal with."

If they get to a volume where they should go back to the drawing board, then they can worry about it then. This would be a good problem to have. But there is no need to worry about it now.

[+] mekoka|2 years ago|reply

> In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ.

What do you mean by "broken"? Are you implying that the behavior they're describing is not the way the consumer library is supposed to work? They linked to RabbitMQ's documentation basically saying that's exactly how it works. Also, where do you get the sense that they've misconfigured it? You've made these statements, but did not exactly enlightened us as to how one should set things up to have consumers handle exactly one job at a time. That was their only problem (Edit: an answer by @whakim https://news.ycombinator.com/item?id=35530108 is providing more light on this).

The rest of your answer sanctimoniously presumes that they don't know how to use the tool, but your own proposed solution is moot, as it seems to address a different problem, not the one that they have (1 job per consumer max).

[+] ryanjshaw|2 years ago|reply

> I've never seen this in years of dealing with RabbitMQ.

Did you do long running jobs like they did? It's a stereotype, but I don't think they used the technology correctly here -- you're not supposed to hold onto messages for hours before acknowledging. They should have used RabbitMQ just to kick off the job, immediately ACKing the request, and job tracking/completion handled inside... a database.

[+] aidos|2 years ago|reply

It may be a misconfiguration but I’m fairly sure you couldn’t change this behaviour in the past. Each worker would take a job in advance and you could not prevent it (I might be misremembering but I think I checked the source at the time).

In my experience, RabbitMQ isn’t a good fit for long running tasks. This was 10 years ago. But honestly, if you have a short number of long running tasks, Postgres is probably a better fit. You get transactional control and you remove a load of complexity from the system.

[+] dpflan|2 years ago|reply

Your last comment is the key, they had an issue and not the scale so a simpler approach works, but then I imagine that this company, which is a new company and growing, will have a future blogpost about switching from pg queue to something that fits their scale...

[+] seunosewa|2 years ago|reply

Not exactly. For performance (I guess) each worker fetches an extra job while it's working on the current job. If the current job happens to be very long, then the extra job it fetched will be stuck waiting for a long time.

Your multiple queue solution might work but it is most efficient to have just one queue with a pool of workers where where each worker doesn't pop a job unless it's ready to process it immediately. In my experience, this is the optimal solution.

[+] tracker1|2 years ago|reply

Yeah, my first thought was curiosity about their volume needs. DB based queues are fine if you don't need more than a few messages a second of transport. For that matter, I've found Azure's Storage Queues probably the simplest and most reliable easy button for queues that don't need a lot of complexity. Once you need more than that, it gets... complicated.

Also, sharing queues for multiple types of jobs just feels like frustration waiting to happen.

[+] raverbashing|2 years ago|reply

It might be because RabbitMQ libraries and tooling is awfully unintuitive and confusing (without some higher level library on top).

At some point using what you understand is easier

[+] stolsvik|2 years ago|reply

Their solution does not preclude fanout. Fetching work items from such a work queue by multiple nodes/servers should be no problem. One solution that also would be good for monitoring would be to have a state column, and a "handled by node" column. So, in a transaction, find a row that is not taken, take it by setting the state to PROCESSING, and set the handled_by_node to the nodename. Shove in a timestamp for good measure. When it is done, set the state to DONE, or delete the row.

Monitor by having some health check evaluating that no row stays in PROCESSING for too long, and that no row stays NOT_STARTED for too long, etc. Introspect by making a nice little HTML screen that shows this work queue and its states.

As I wrote in another comment, this is somewhat similar to a "work queue pattern" I've described here: https://mats3.io/patterns/work-queues/

[+] rawoke083600|2 years ago|reply

This! In all my years of working with RabbitMQ is been super solid! Throwing all that out in exchange for a db queue, honestly doesnt seem smart (with respect)

[+] echelon|2 years ago|reply

I'm at a point where I built a low volume queue in MySQL and need to rip it out and replace it with something that does 100+ QPS, exactly once dispatch / single worker processing, job priority level, job topics, sampling from the queue without dequeuing, and limited on failure retry.

I can probably bolt some of these properties onto a queue that doesn't support all the features I need.

[+] whalesalad|2 years ago|reply

Sounds like a prefetch issue. Or auto ack.

Rabbit is a phenomenal tool but you need to know how to use it.

[+] mastermedo|2 years ago|reply

What about messages that uncover a bug in the consumer/message format? After X failed attempts, the data shouldn’t be enqueued anymore, but rather logged somewhere for engineers to inspect.

[+] ftkftk|2 years ago|reply

My thoughts exactly half way through the article.

[+] datavirtue|2 years ago|reply

This.

[+] nemothekid|2 years ago|reply

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

[+] hinkley|2 years ago|reply

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

[+] MuffinFlavored|2 years ago|reply

> Even the most unoptimized database should be able to handle this.

Anybody had any success running a queue on top of... sqlite?

With the way the sqlite file locking mechanisms work, are you basically guaranteed really low concurrency? You can have lots of readers but not really a lot of writers, and in order to pop a job off of the queue you need to have a process spinning waiting for work, move its status from "to do" to "in progress" and then "done" or "error", which is sort of "write" heavy?

> An EXCLUSIVE lock is needed in order to write to the database file. Only one EXCLUSIVE lock is allowed on the file and no other locks of any kind are allowed to coexist with an EXCLUSIVE lock. In order to maximize concurrency, SQLite works to minimize the amount of time that EXCLUSIVE locks are held.

> You can avoid locks when reading, if you set database journal mode to Write-Ahead Logging (see: http://www.sqlite.org/wal.html).

[+] aetherson|2 years ago|reply

And, on the other hand, people shouldn't kid themselves about the ability of Postgres to handle millions of jobs per day as a queue.

[+] simonw|2 years ago|reply

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

[+] sa46|2 years ago|reply

Here are a couple of tips if you want to use postgres queues:

- You probably want FOR NO KEY UPDATE instead of FOR UPDATE so you don't block inserts into tables that have a foreign key relationship with the job table. [1]

- If you need to process messages in order, you don't want SKIP LOCKED. Also, make sure you have an ORDER BY clause.

My main use-case for queues is syncing resources in our database to QuickBooks. The overall structure looks like:

    BEGIN; -- start a transaction

    SELECT job.job_id, rm.data
    FROM qbo.transmit_job job
      JOIN resource_mutation rm USING (tenant_id, resource_mutation_id)
    WHERE job.state = 'pending'
    ORDER BY job.create_time
    LIMIT 1 FOR NO KEY UPDATE OF job NOWAIT;

    -- External API call to QuickBooks.

    -- If successsful:
    UPDATE qbo.transmit_job
    SET state = 'transmitted'
    WHERE job_id = $1;

    COMMIT;

This code will serialize access to the transmit_job table. A more clever approach would be to serialize access by tenant_id. I haven't figured out how to do that yet (probably lock on a tenant ID first, then lock on the job ID).

Somewhat annoyingly, Postgres will log an error if another worker holds the row lock (since we're not using SKIP LOCKED). It won't block because of NOWAIT.

CrunchyData also has a good overview of Postgres queues: [2]

[1]: https://www.migops.com/blog/2021/10/05/select-for-update-and...

[2]: https://blog.crunchydata.com/blog/message-queuing-using-nati...

[+] dilyevsky|2 years ago|reply

Not doing SKIP LOCKED will make it basically single threaded, no? I’m of the opinion that you should just use Temporal if you don’t need inter-job order guarantees

[+] eckesicle|2 years ago|reply

Postgres is probably the best solution for every type of data store for 95-99% of projects. The operational complexity of maintaining other attached resources far exceed the benefit they realise over just using Postgres.

You don’t need a queue, a database, a blob store, and a cache. You just need Postgres for all of these use cases. Once your project scales past what Postgres can handle along one of these dimensions, replace it (but most of the time this will never happen)

It also does wonders for your uptime and SLO.

[+] fullstop|2 years ago|reply

We collect messages from tens of thousands of devices and use RabbitMQ specifically because it is uncoupled from the Postgres databases. If the shit hits the fan and a database needs to be taken offline the messages can pool up in RabbitMQ until we are in a state where things can be processed again.

[+] alberth|2 years ago|reply

> Postgres is probably the best solution for every type of data store for 95-99% of projects.

I'd say it's more like:

- 95.0%: SQLite

- 4.9%: Postgres

- 0.1%: Other

[+] adverbly|2 years ago|reply

Careful with using postgres as a blob store. That can go bad fast...

[+] jrochkind1|2 years ago|reply

While it makes sense to use postgres for a queue where latency isn't a big issue, I've always thought that the latency needs of many kinds of caches are such that postgres wouldn't suffice, and that's why people use (say) redis or memcached.

But do you think postgres latency is actually going to be fine for many things people use redis or memcached for?

[+] armatav|2 years ago|reply

Honestly. This is 100% correct.

[+] code-e|2 years ago|reply

As the maintainer of a rabbitmq client library (not the golang one mentioned in the article) the bit about dealing with reconnections really range true. Something about the AMQP protocol seems to make library authors just... avoid dealing with it, forcing the work onto users, or wrapper libraries. It's a real frustration across languages, golang, python, JS, etc. Retry/reconnect is built in to HTTP libraries, and database drivers. Why don't more authors consider this a core component of a RabbitMQ client?

[+] borplk|2 years ago|reply

In many scenarios a DB/SQL-backed queue is far superior to the fancy queue solutions such as RabbitMQ because it gives you instantaneous granular control over 'your queue' (since it is the result set of your query to reserve the next job).

Historically people like to point out the common locking issues etc... with SQL but in modern datbases you have a good number of tools to deal with that ("select for update nowait").

If you think about it a queue is just a performance optimisation (it helps you get the 'next' item in a cheap way, that's it).

So you can get away with "just a db" for a long time and just query the DB to get the next job (with some 'reservations' to avoid duplicate processing).

At some point you may overload the DB if you have too many workers asking the DB for the next job. At that point you can add a queue to relieve that pressure.

This way you can keep a super dynamic process by periodically selecting 'next 50 things to do' and injecting those job IDs in the queue.

This gives you the best of both worlds because you can maintain granular control of the process by not having large queues (you drip feed from DB to queue in small batches) and the DB is not overly burdened.

[+] jrib|2 years ago|reply

> One of our team members has gotten into the habit of pointing out that “you can do this in Postgres” whenever we do some kind of system design or talk about implementing a new feature. So much so that it’s kind of become a meme at the company.

love it

[+] autospeaker22|2 years ago|reply

We do just about everything with one or more Postgres databases. We have workers that query the db for tasks, do the work, and update the db. Portals that are the read-only view of the work being performed, and it's pretty amazing how far we've gotten with just Postgres and no real tuning on our end. There's been a couple scenarios where query time was excessive and we solved by learning a bit more about how Postgres worked and how to redefine our data model. It seems to be the swiss army knife that allows you to excel at most general cases, and if you need to do something very specific, well at that point you probably need a different type of database.

[+] sass_muffin|2 years ago|reply

I find it funny how sometimes there are two sides to the same coin, and articles like these rarely talk about engineering tradeoffs. Just one solution good, other solution bad. I think it is a mistake for a technical discussion to not talk in terms of tradeoffs.

Obviously it makes sense to not use complex tech when simple tech works, especially at companies with a lower traffic volume. That is just practical engineering.

The inverse, however, can also be true. At super high volumes you run into issues really quickly. Just got off a 3 hour site-wide outage due to the database unable to keep up with the unprecedented queue load, and the db system basically ground to a halt. The proposed solution is actually to move off of a dedicated db queue for SQS.

This was a system running that has run well for about 10 years. Granted there was an unprecedented queue volume for this system, but sometimes a scaling ceiling is hit, and it is hit faster than you might expect from all these comments saying to always use a db always, even with all the proper indexing and optimizations.

[+] smallerfish|2 years ago|reply

We've inadvertently "load tested" our distributed locking / queue impl on postgres in production, and so I know that it can handle hundreds of thousands of "what should I run / try to take lock on task" queries per minute, with a schema designed to avoid bloat/vacuuming, tuned indices, and reasonably beefy hardware.

[+] concerned_|2 years ago|reply

RabbitMQ may have been overkill for the need, but it's also clear that there was an implementation bug which was missed.

Db queues are simple to implement and so given the volume it's one way to approach working around an mq client issue.

Personally, and I mean personally, I have found messaging platforms to be full of complexity, fluff, and non-standard "standards", it's just alot of baggage and in the case of messaging alot of bugs.

I have seen Kafka deployed and ripped out a year later, and countless bugs in client implementations due to developer misunderstanding, poor documentation, and unnecessary complexity.

For this reason, I refer to event driven systems as "expert systems" to be avoided. But in your life "there will be queues"

[+] chime|2 years ago|reply

If you don't want to roll your own, look into https://github.com/timgit/pg-boss

[+] andrewstuart|2 years ago|reply

I wrote a message queue in Python called StarQueue.

It’s meant to be a simpler reimagining of Amazon SQS.

It has an HTTP API and behaves mostly like SQS.

I wrote it to support Postgres, Microsoft’s SQL server and also MySQL because they all support SKIP LOCKED.

At some point I turned it into a hosted service and only maintained the Postgres implementation though the MySQL and SQL server code is still in there.

It’s not an active project but the code is at https://github.com/starqueue/starqueue/

After that I wanted to write the worlds fastest message queue so I implemented an HTTP message queue in Rust. It maxed out the disk at about 50,000 messages a second I vaguely recall, so I switched to purely memory only and in the biggest EC2 instance I could run it on it did about 7 million messages a second. That was just a crappy prototype so I never released the code.

After that I wanted to make the simplest possible message queue so I discovered that Linux atomic moves are the basis of a perfectly acceptable message queue that is simply file system based. I didn’t put it into a message queue, but close enough to be the same I wrote an SMTP buffer called Arnie. It’s only about 100 lines of Python. https://github.com/bootrino/arniesmtpbufferserver

[+] jasonlotito|2 years ago|reply

So, this article contains a serious issue.

What is the prefetch value for RabbitMQ mean? > The value defines the max number of unacknowledged deliveries that are permitted on a channel.

From the Article: > Turns out each RabbitMQ consumer was prefetching the next message (job) when it picked up the current one.

that's a prefetch count of 2.

The first message is unacknowledged, and if you have a prefetch count of 1, you'll only get 1 message because you've set the maximum number of unacknowledged messages to 1.

So, I'm curious what the actual issue is. I'm sure someone checked things, and I'm sure they saw something, but this isn't right.

tl;dr: prefetch count of 1 only gets one message, it doesn't get one message, and then a second.

Note: I didn't test this, so there could be some weird issue, or the documentation is wrong, but I've never seen this as an issue in all the years I've used RabbitMQ.

[+] stuff4ben|2 years ago|reply

That's my thinking as well. Seems like they're not using the tool correctly and didn't read the documentation. Oh well, let's switch to Postgres because "reasons". And now to get the features of a queuing system, you have to build it yourself. Little bit of Not Invented Here syndrome it sounds like.

[+] mannyv|2 years ago|reply

The RabbitMQ stuff seems pretty straightforward:

Channel prefetch:

https://www.rabbitmq.com/confirms.html

"Once the number reaches the configured count, RabbitMQ will stop delivering more messages on the channel unless at least one of the outstanding ones is acknowledged"

consumer prefetch:

https://www.rabbitmq.com/consumer-prefetch.html

So a prefetch count of 1 = 1 un-ACKed message -> what they want

[+] binaryBandicoot|2 years ago|reply

Agreed !

If prefetch was the issue; they could have even used AMQP's basic.get - https://www.rabbitmq.com/amqp-0-9-1-quickref.html#basic.get

[+] hangonhn|2 years ago|reply

I'm wondering if they made the mistake of acknowledging the message before the processing was done. From the article it sounds like their jobs take a long time to run so they may have acked the message to stop RabbitMQ from delivering the message to another worker for retry but IIRC there is a setting that allows you to extend the "lease" time on a message before retry.

[+] ketchupdebugger|2 years ago|reply

Maybe this is a case of an engineer who just wanted to put 'implementing a queue using postqres' on their resume.

> took half a day to implement + test

so seems like there are maybe 2 or 3 services using rabbitmq

[+] dapearce|2 years ago|reply

Love to see it. We (CoreDB) recently released PGMQ, a message queue extension for Postgres: https://github.com/CoreDB-io/coredb/tree/main/extensions/pgm...

[+] CobaltHorizon|2 years ago|reply

This is interesting because I’ve seen a queue that was implemented in Postgres that had performance problems before: the job which wrote new work to the queue table would have DB contention with the queue marking the rows as processed. I wonder if they have the same problem but the scale is such that it doesn’t matter or if they’re marking the rows as processed in a way that doesn’t interfere with rows being added.

[+] sevenf0ur|2 years ago|reply

Sounds like a poorly written AMQP client of which there are many. Either you go bare bones and write wrappers to implement basic functionality or find a fully fleshed out opinionated client. If you can get away with using PostgreSQL go for it.

[+] gorjusborg|2 years ago|reply

I'm all for simplifying stacks by removing stuff that isn't needed.

I've also used in-database queuing, and it worked well enough for some use cases.

However, most importantly: calling yourself a maxi multiple times is cringey and you should stop immediately :)

[+] TexanFeller|2 years ago|reply

Using a DB as an event queue opens up many options not easily possible with traditional queues. You can dedupe your events by upserting. You can easily implement dynamic priority adjustment to adjust processing order. Dedupe and priority adjustment feels like an operational superpower.

[+] throwaway74567|2 years ago|reply

You can get dedup with some queues, like SQS

361 comments