Looking forward to trying this out. I've always felt that PagerDuty was absurdly expensive for the feature set they were offering. It costs something at least $250 per user for organization larger than 5 person - even if you're not an engineer who is ever directly on call. At my previous company, IT had to regularly send surveys to employees to assess if they really needed to have a PagerDuty account. Alerts are a key information in an organization that runs software in production and you shouldn't have to pay $250 / month just to be able to have some visibility into it. I'm hoping Grafana OnCall is able to fully replace PagerDuty.
* Outsource in-house infra to cloud. This begets lock-in as every engineer is doing heaven knows what with Lambda. Still need that huge infra team to manage AWS.
* Outsource in-house metrics and visibility to SignalFx, Splunk, DataDog, NewRelic, etc. Still need a team to manage it. Costs get raised by more than double because we're beholden, so now we need to fund 20+ engineer quarters to migrate everything ASAP.
* Feature flagging system built in house works like a charm and needs one engineer for maintenance. Let's fund a team to migrate it all to LaunchDarkly. Year+ later and we still don't have proper support or rollout and their stuff doesn't work as expected.
Madness.
Expensive madness.
SaaS won't magically reduce your staffing needs. Open source solutions won't reduce your staffing needs either, but they'll make costs predictable. As these tools become more prevalent and standard, you can even hire experts for them.
> I've always felt that PagerDuty was absurdly expensive for the feature set they were offering
For anyone out there in the same spot, I'll say that I switched my last company to Atlassian's OpsGenie and it was a 10x cost savings for the same feature set.
I knew Pagerduty was going down the toilet when their sales folks started aggressively pitching BS products nobody really needed before their IPO. They couldn’t even release a proper incident management tool.
I really hope this project gets good enough to ditch PD. PD should literally lay off most of its staff and just maintain the existing product, cut costs and focus mostly on integrations. There is no way they have any other future.
Agreed. Cost really is the big selling point of Grafana Cloud - it’s far, far cheaper than most competitors, and good enough. Not as good as NewRelic, DataDog, etc., but you get good enough metrics, logs, alerts, distributed tracing, and now incident management, at an excellent price.
Matvey Kukuy, ex-CEO of Amixr and a head of the OnCall project here. We've been working hard for a few months to make this OSS release happen. I believe it should make incident response features (on-call rotations, escalations, multi-channel notifications) and best practices more accessible to the wider audience of SRE and DevOps engineers.
Hope someone will be able to finally sleep well at night being sure that OnCall will handle escalations and will alert the right person :)
Please join our community on a GitHub! The whole Grafana OnCall team is help you and to make this thing better.
> it should make incident response features (on-call rotations, escalations, multi-channel notifications) and best practices more accessible to the wider audience
I love Grafana, don't get me wrong, but I have the sensation they are now in that position where, companies that got a massive capital injection and, therefore, a massive increase of work power, release too much and too soon.
It doesn't have anything to do, of course, with the fact that this morning we suddenly found that all our dashboards stopped working because we were upgraded to Grafana v9, for which there is not a stable release nor documentation for breaking changes.
I apologize for the disruption we caused you when rolling out Grafana 9. We are working on improving our releases to Grafana Cloud and also on making sure that errors due to breaking changes in a major release won't affect customers in the future. As a Grafana Cloud customer, you shouldn't need to read docs about breaking changes when we upgrade your instance.
It's surprising how seemingly difficult it is to build a good on-call scheduling system. Everything I tried so far (not naming the companies here) felt like the UX was the last thing on the developers' minds. Which is tolerable during business hours but really annoying at 2am.
Is there some hidden complexity or is it just a consequence of engineers building a product for other engineers? Also, any tips what worked for you?
Have had lots of bad experiences with that from Pagerduty at least. Want to generate a schedule far in advance, so people know when they will be oncall and can plan/switch.
Of course, in a few months we may have some new people having joined, some quit, or other circumstances. A single misclick when fixing that can invalidate the whole schedule and generate another. Infuriating.
Or the UI itself, might have become better tha last two years, but having to click "next week" tens of times to see when I was scheduled (since I wasnt just interested in my next scheduled time but all of them) were annoying.
A bit disappointed by the architecture -- it's a Django stack with MySQL, Redis, RabbitMQ, and Celery -- for what is effectively AlertManager (a single golang binary) with a nicer web frontend + Grafana integration + etc.
I'm curious why/if this architecture was chosen. I get that it started as a standalone product (Amixr), but in the current state it is hard to rationalize deploying this next to Grafana in my current containerless setting.
Besides deployment, there are two main priorities for OnCall architecture:
1) It should be as "default" as possible. No fancy tech, no hacking around
2) It should deliver notifications no matter what.
We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.
It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.
The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.
> Django stack with MySQL, Redis, RabbitMQ, and Celery
MySQL is a weird if not slightly disturbing choice. Other than that it's a boring, battle-tested stack that is relatively easy to scale. I agree that Go is nicer, but I'm biased by several years of dealing with horrific Flask / Django projects.
One of the most frustrating aspects of being a software engineer is dealing with others that love to over-engineer. Unfortunately, they make enough noise that complex solutions are necessary that it gets managers scared about taking any easier, simpler solutions.
This. I find open source projects written in Go or Rust are usually more pleasant to work with than Java, Django or Rails, etc. They have less clunky dependencies, are less resource-hungry, and can ship with single executables which make people's life much easier.
That's a tried and true stack, and a very good one for maintaining sane levels of reliability, consistency, durability etc. Resource wise, at least with Celery, RabbitMQ and Django, they're also pretty lean.
It even ships in containers along with Docker Compose files and Helm charts, which would suit the deployment use cases of 99% of users. I understand that you're not using containers, but I don't think that's a limitation that many are inflicting upon themselves as of late, and if pressed, installing Docker Compose takes about 5 minutes and you don't have to think about it again.
not gonna argue that a single binary is the ultimate deploy solution but running a django app is not that difficult (although i am biased cause i do that for a living).
i love django projects but mysql, celery and rabbitmq -- no thanks.
Hey HN, Ildar here, one of the co-founders of Amixr and one of the software engineers behind Grafana OnCall. Finally we open-sourced the product I'm really excited about that. Please try it out and leave your feedback
Its very much our aim to make this mix of self-hosted and cloud services as easy as going all-cloud; but I agree we're not quite there yet.
Do you mind if I ask what isn't super-easy about linking self-hosted loki search queries with SaaS-Prometheus? You should be e.g. able to add a Prometheus data source to your local Grafana (or securely expose your Loki to the internet and add a Loki data source to your Cloud Grafana)
But apparently when you are grafana everything looks like a dashboard UI?
Joke aside I will have a look but I didn't like the screenshots before already. I like the dashboardy thing for dashboards but otherwise it's not a really good UI system for everything else.
I would give a huge marketing bullshit award for the following sentence:
<<We offered Grafana OnCall to users as a SaaS tool first for a few reasons. It’s a commonly shared belief that the more independent your on-call management system is, the better it will be for your entire operation. If something goes wrong, there will be a “designated survivor” outside of your infrastructure to help identify any issues. >>
They tried to ensure that you use their SaaS offering because they care more about your own good than yourself. So humanist...
The point isn't that their infrastructure is more reliable than yours, but that it's decoupled from yours. If you run your monitoring on the same infra as production, it's liable to go down when production does, i.e. just when you need it most. This is a real reason to outsource monitoring to a SaaS, just like there are real reasons to self-host.
I mean, obviously they chose to address the segment of the market they could get more money out of first; I'm not contesting that. But the bit you quoted is low-grade bullshit at best. Hardly award-winning.
Why is that unfortunate? Unless you're looking to make proprietary changes to Grafana Oncall and host it as a SAAS, it's the same as running any other GPL software.
juliennakache|3 years ago
echelon|3 years ago
"Business should focus on its core competency"
* Outsource in-house infra to cloud. This begets lock-in as every engineer is doing heaven knows what with Lambda. Still need that huge infra team to manage AWS.
* Outsource in-house metrics and visibility to SignalFx, Splunk, DataDog, NewRelic, etc. Still need a team to manage it. Costs get raised by more than double because we're beholden, so now we need to fund 20+ engineer quarters to migrate everything ASAP.
* Feature flagging system built in house works like a charm and needs one engineer for maintenance. Let's fund a team to migrate it all to LaunchDarkly. Year+ later and we still don't have proper support or rollout and their stuff doesn't work as expected.
Madness.
Expensive madness.
SaaS won't magically reduce your staffing needs. Open source solutions won't reduce your staffing needs either, but they'll make costs predictable. As these tools become more prevalent and standard, you can even hire experts for them.
CSMastermind|3 years ago
For anyone out there in the same spot, I'll say that I switched my last company to Atlassian's OpsGenie and it was a 10x cost savings for the same feature set.
pm90|3 years ago
I really hope this project gets good enough to ditch PD. PD should literally lay off most of its staff and just maintain the existing product, cut costs and focus mostly on integrations. There is no way they have any other future.
motakuk|3 years ago
jlg23|3 years ago
yashap|3 years ago
motakuk|3 years ago
Matvey Kukuy, ex-CEO of Amixr and a head of the OnCall project here. We've been working hard for a few months to make this OSS release happen. I believe it should make incident response features (on-call rotations, escalations, multi-channel notifications) and best practices more accessible to the wider audience of SRE and DevOps engineers.
Hope someone will be able to finally sleep well at night being sure that OnCall will handle escalations and will alert the right person :)
Please join our community on a GitHub! The whole Grafana OnCall team is help you and to make this thing better.
knicholes|3 years ago
the_duke|3 years ago
Seems like the /main is the culprit.
[1] https://grafana.com/docs/oncall/main/.
Tao3300|3 years ago
Is that a net positive?
redfoxx22|3 years ago
[deleted]
pachico|3 years ago
It doesn't have anything to do, of course, with the fact that this morning we suddenly found that all our dashboards stopped working because we were upgraded to Grafana v9, for which there is not a stable release nor documentation for breaking changes.
Luckily they rolled back our account.
danlimerick|3 years ago
anyfactor|3 years ago
AGPL 3.0
Equiet|3 years ago
Is there some hidden complexity or is it just a consequence of engineers building a product for other engineers? Also, any tips what worked for you?
matsemann|3 years ago
Of course, in a few months we may have some new people having joined, some quit, or other circumstances. A single misclick when fixing that can invalidate the whole schedule and generate another. Infuriating.
Or the UI itself, might have become better tha last two years, but having to click "next week" tens of times to see when I was scheduled (since I wasnt just interested in my next scheduled time but all of them) were annoying.
pphysch|3 years ago
I'm curious why/if this architecture was chosen. I get that it started as a standalone product (Amixr), but in the current state it is hard to rationalize deploying this next to Grafana in my current containerless setting.
motakuk|3 years ago
Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.
Besides deployment, there are two main priorities for OnCall architecture: 1) It should be as "default" as possible. No fancy tech, no hacking around 2) It should deliver notifications no matter what.
We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.
It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.
The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.
vhold|3 years ago
https://prometheus.io/docs/introduction/overview/#architectu...
https://kubernetes.io/docs/concepts/overview/components/
skullone|3 years ago
mkl95|3 years ago
MySQL is a weird if not slightly disturbing choice. Other than that it's a boring, battle-tested stack that is relatively easy to scale. I agree that Go is nicer, but I'm biased by several years of dealing with horrific Flask / Django projects.
goodpoint|3 years ago
Complexity comes at a steep price when something critical (e.g. OnCall) breaks and you have to debug it in a hurry.
Shoving everything in a container and closing the lid does not help.
alex_dev|3 years ago
lazyant|3 years ago
MarquesMa|3 years ago
Just think about Gitea vs GitLab.
heavyset_go|3 years ago
It even ships in containers along with Docker Compose files and Helm charts, which would suit the deployment use cases of 99% of users. I understand that you're not using containers, but I don't think that's a limitation that many are inflicting upon themselves as of late, and if pressed, installing Docker Compose takes about 5 minutes and you don't have to think about it again.
minusf|3 years ago
i love django projects but mysql, celery and rabbitmq -- no thanks.
unknown|3 years ago
[deleted]
martypitt|3 years ago
A minor note, if anyone from Grafana is around - a bunch of the links on the bottom of the announcement go to a 404.
motakuk|3 years ago
ildari|3 years ago
pphysch|3 years ago
dString|3 years ago
A quick look at OnCall suggests it is more for managing fired alerts than firing alerts.
Their own screenshot has AlertManager as an alert source.
sandstrom|3 years ago
For example, we need to run Loki ourselves, for security / privacy reasons, but wouldn't mind using hosted versions of Tempo, Prometheus and OnCall.
Right now it isn't super-easy to link e.g. self-hosted loki search queries with SaaS-Prometheus.
netingle|3 years ago
Do you mind if I ask what isn't super-easy about linking self-hosted loki search queries with SaaS-Prometheus? You should be e.g. able to add a Prometheus data source to your local Grafana (or securely expose your Loki to the internet and add a Loki data source to your Cloud Grafana)
Deritio|3 years ago
Im annoyed by their license choice.
But apparently when you are grafana everything looks like a dashboard UI?
Joke aside I will have a look but I didn't like the screenshots before already. I like the dashboardy thing for dashboards but otherwise it's not a really good UI system for everything else.
unknown|3 years ago
[deleted]
raffraffraff|3 years ago
NonNefarious|3 years ago
Of course the article isn't much better. It reads like a joke, the joke being that "on-call management" doesn't mean anything.
goodpoint|3 years ago
Maledictus|3 years ago
machinerychorus|3 years ago
https://pushover.net/
ndom91|3 years ago
googletron|3 years ago
ucosty|3 years ago
greatgib|3 years ago
<<We offered Grafana OnCall to users as a SaaS tool first for a few reasons. It’s a commonly shared belief that the more independent your on-call management system is, the better it will be for your entire operation. If something goes wrong, there will be a “designated survivor” outside of your infrastructure to help identify any issues. >>
They tried to ensure that you use their SaaS offering because they care more about your own good than yourself. So humanist...
ezrast|3 years ago
I mean, obviously they chose to address the segment of the market they could get more money out of first; I'm not contesting that. But the bit you quoted is low-grade bullshit at best. Hardly award-winning.
JimXugle|3 years ago
https://goalert.me/
this_was_posted|3 years ago
for someone at grafana; noticed a dead link in the post: https://grafana.com/docs/oncall/main/
nojito|3 years ago
ucosty|3 years ago
acatton|3 years ago
josephcsible|3 years ago
bbkane|3 years ago