top | item 31743047

(no title)

motakuk | 3 years ago

I agree that multi-component architecture is harder to deploy. We did our best and prepared tooling to make deployment an easy thing.

Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.

Besides deployment, there are two main priorities for OnCall architecture: 1) It should be as "default" as possible. No fancy tech, no hacking around 2) It should deliver notifications no matter what.

We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.

It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.

The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.

discuss

gen220|3 years ago

I think your decisions were reasonable, as is the opinion of the person you're responding to.

To be fair, even in its current form, it should be possible to operate this system with sqlite (i.e. no db server) and in-process celery workers (i.e. no rabbit MQ) if configured correctly, assuming they're not using MySQL-specific features in the app.

Using a message bus, a persistent data store behind a SQL interface, and a caching layer are all good design choices. I think the OP's concern is less with your particular implementations, and more with the principle of preventing operators from bringing their own preferred implementation of those interfaces to the table.

They mentioned that it makes sense because you were a standalone product, so stack portability was less of a concern. But as FOSS, you're opening yourself up to different standards on portability.

It requires some work on the maintainer to make the application tolerant to different fulfillments of the same interfaces. But it's good work. It usually results in cleaner separation of concerns between application logic and caching/message bus/persistence logic, for one. It also allows your app to serve a wider audience: for example, those who are locked-in to using Postgres/Kafka/Memcached.

raffraffraff|3 years ago

Nothing wrong with that. I managed 7+ Sensu "clusters" at a previous job, and it's stack was a ruby server, Redis and RabbitMQ. But I completely ditched RabbitMQ and used Redis for the queue and data. Simpler, more performant and more reliable (even if the feature was marked experimental). Our alerts were really spammy, and we had ~8k servers (each running a bunch of containers) per cluster, so these things were busy. Each cluster was 3x small nodes (6gb memory, 2CPU) Memory usage was miniscule, typically <300mb. Any box could be restarted without any impact because Redis just operated in (failover) mode and Sensu was horizontally scalable.

I get why you would add a relational DB to the mix. Personally, I'd like a Rabbit-free option.

Deritio|3 years ago

Hearing your message bus assumption sounds like one of the most ridiculous claims I heard.

Sorry but why is rabbitmq really necessary?

slotrans|3 years ago

You don't need Rabbit, Celery, or Redis. You should be able to replace MySQL with SQLite. Then it would be radically easier to deploy.

throwaway892238|3 years ago

A MySQL database cluster, and a local copy of a SQL database on a single file on a single filesystem, are not close to the same thing. Except they both have "SQL" in the name.

One of them allows a thousand different nodes on different networks to share a single dataset with high availability. The other can't share data with any other application, doesn't have high availability, is constrained by the resources of the executing application node, has obvious performance limits, limited functionality, no commercial support, etc etc.

And we're talking about a product that's intended for dealing with on-call alerts. The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

I know the HN hipsters are all gung-ho for SQLite, but let's try to reign in the hype train.

sergiomattei|3 years ago

It’s curious to see people questioning the stack choices of apps they haven’t built yet and problems they haven’t faced either.

They chose this stack, it works for them. They’ve put it through its paces in production.

It’s as boring as it gets.