For a product that's been around 12 years, I've been surprised at how minimally featured PagerDuty is.
Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.
PD schedule checking and trade negotiation becomes yet another thing in the long list of things I need to do when taking a day off. HR system request off, Department Outlook calendar update, PagerDuty coverage check, Outlook out-of-office status & auto-replies, Slack set away, update status AND pause notifications.
I suppose that's because as an on-call developer I am not the user. The user, management who bought the product, gets KPIs & pretty graphs, so they are happy.
My least favorite thing about PagerDuty is the phone call notification. I drive a car from 2001, and with a cheap bluetooth upgrade, I can do all of these with my voice while driving:
- Get directions to anywhere on the continent
- Send and receive texts to my friends
- Answer and take a call from a human
But if PagerDuty calls me, Stephen Hawking's speech synthesizer brusquely yells at me and demands I take my hands off the wheel and press a button on my phone to acknowledge the alert. No voice recognition, no ability to kick off an automated play. It's a time portal to 1997! Even the _banks_ have friendlier phone automation these days!
> Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.
Do you shut down your service for Labor Day? I don't.
I do agree that trading on-call shifts is not very easy within the UI. Part of me dreams of being able to make enough advantaged trades to end up never on-call, like the padre who doubled his holdings in a WW2 POW camp: https://www.ft.com/content/c523efe6-9973-11e1-9a57-00144feab...
Hey everyone, Matvey, ex-CEO of Amixr is here. Me and Ildar Iskhakov started this project three years ago because we used to be on-call ourselves and needed better tools. It was an amazing journey from 0 to 1. Tons of coding, first customers, fundraising, iterating, and finally the honor to join Grafana Labs and build Grafana OnCall! I'll be happy to answer your questions if you have any.
It's great to see more competition in this space. Generally speaking, what I miss in these "incident management" products is also an integrated, flawless way to handle incidents when they're happening. I'm talking about:
1. Quickly creating a proper chat
2. Quickly creating an incident document where you can pin chat messages and use it in the post-mortem. Ideally, pinning some graphs that you'd extract from your observability solutions
3. Having a status page to put a small description for non-technical stakeholders.
PagerDuty covers some of this. Monzo's Response [1] and now incident.io [2] try to cover it too. I'd like to have this experience end-to-end.
This looks really neat. We don't use Grafana today. We're running CloudWatch/insights and Squadcast for alerting, but deep integration with the monitoring tool looks cool. Is this usable with self-hosted or AWS managed Grafana?
Product looks great but those API request limits are too low, because alerts rain when you are having incidents and rate limiting all of them is harmful. That's why other products have deduplication keys / aliases so you don't miss important ones.
I'd think that receiving even 1/5th the rate limit in a 5 minute window would be disorienting enough to render alerting effectively useless.
I'd question the configuration which fires that many alerts in that time frame, and suggest improving alert aggregations and dependencies to get the number down to one or a handful of meaningful alerts.
I was once in a job where I was solo on call for tens of thousands of cores globally and at worst we had like 2000 alerts in a week. These limits seem quite high to me.
Technically Splunk On-call. But I have a few pain points with it, and I miss pagerduty.
If you want to see what teams you are on as the current logged in user, the only way to do it as far as what support told me, is to search for yourself and then check that result.
I've been seeing them recommended more and more, and myself have been keeping a passive eye on BetterUptime (which has an on-call feature): https://betteruptime.com/incident-management
Their free and lower prices tiers offer a lot of what others have on their top/most expensive tiers. Also, integrations with various alert sources are just easier in most cases. I spent I don't know how long trying to get OpsGenie to work before I gave up.
I may be biased as a co-founder of Spike.sh, but I think we have one of the best designed incident management products out there. We've focused on making it easy to create on-call schedule and overrides, and added templates for escalation, on-call and alert rules.
I use VictorOps (Now Splunk On-Call) currently and it does the job. Its shift override functionality is quite confusing to get your head around at first but makes sense after the first few times.
I've also used OpsGenie (Atlassian now) and really enjoyed it. The amount of integrations they have is staggering.
I'm a grafana fan and a current user of PagerDuty. Maybe there's more to the story but after reading the post I feel like using a calendar integration to manage on-call schedules is the wrong approach. Calendar events are a result of overlaying a rotation on a date range: they're the output, not the input. I'm sure the designers here have looked at how PD enables creating and editing rotations. Curious to know their views on it.
A few more screenshots of the "Scheduling" options would've been great...
We're (more or less) using OpsGenie's free tier, however their scheduling never really "clicked" with me... not sure if i'm special in that regard, however i find the UI/UX pretty... weird...
I'm not sure what this is competing with in it's current incarnation.
I need corresponding mobile phone applications for any alert product I intend to use that can override DND/volume etc. on my phone so I can get woken up at night and respond to problems.
[+] [-] steveBK123|4 years ago|reply
Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.
PD schedule checking and trade negotiation becomes yet another thing in the long list of things I need to do when taking a day off. HR system request off, Department Outlook calendar update, PagerDuty coverage check, Outlook out-of-office status & auto-replies, Slack set away, update status AND pause notifications.
I suppose that's because as an on-call developer I am not the user. The user, management who bought the product, gets KPIs & pretty graphs, so they are happy.
[+] [-] dharmab|4 years ago|reply
- Get directions to anywhere on the continent
- Send and receive texts to my friends
- Answer and take a call from a human
But if PagerDuty calls me, Stephen Hawking's speech synthesizer brusquely yells at me and demands I take my hands off the wheel and press a button on my phone to acknowledge the alert. No voice recognition, no ability to kick off an automated play. It's a time portal to 1997! Even the _banks_ have friendlier phone automation these days!
[+] [-] ethbr0|4 years ago|reply
[+] [-] jldugger|4 years ago|reply
Do you shut down your service for Labor Day? I don't.
I do agree that trading on-call shifts is not very easy within the UI. Part of me dreams of being able to make enough advantaged trades to end up never on-call, like the padre who doubled his holdings in a WW2 POW camp: https://www.ft.com/content/c523efe6-9973-11e1-9a57-00144feab...
[+] [-] motakuk|4 years ago|reply
[+] [-] joaoqalves|4 years ago|reply
1. Quickly creating a proper chat 2. Quickly creating an incident document where you can pin chat messages and use it in the post-mortem. Ideally, pinning some graphs that you'd extract from your observability solutions 3. Having a status page to put a small description for non-technical stakeholders.
PagerDuty covers some of this. Monzo's Response [1] and now incident.io [2] try to cover it too. I'd like to have this experience end-to-end.
1 - https://github.com/monzo/response 2 - https://incident.io/
[+] [-] SeriousM|4 years ago|reply
[+] [-] bilalq|4 years ago|reply
[+] [-] tex0|4 years ago|reply
[+] [-] CSDude|4 years ago|reply
> Alerts from the whole team 500 5 minutes
> API requests per API key 300 5 minutes
Product looks great but those API request limits are too low, because alerts rain when you are having incidents and rate limiting all of them is harmful. That's why other products have deduplication keys / aliases so you don't miss important ones.
https://grafana.com/docs/grafana-cloud/oncall/oncall-api-ref...
[+] [-] deeblering4|4 years ago|reply
I'd question the configuration which fires that many alerts in that time frame, and suggest improving alert aggregations and dependencies to get the number down to one or a handful of meaningful alerts.
[+] [-] dharmab|4 years ago|reply
[+] [-] CameronNemo|4 years ago|reply
Care to link to the docs? I'm interested.
[+] [-] named-user|4 years ago|reply
[+] [-] halfmatthalfcat|4 years ago|reply
[+] [-] Forfold|4 years ago|reply
[+] [-] therealdrag0|4 years ago|reply
[+] [-] saminzadeh|4 years ago|reply
[+] [-] haliskerbas|4 years ago|reply
If you want to see what teams you are on as the current logged in user, the only way to do it as far as what support told me, is to search for yourself and then check that result.
[+] [-] dvtrn|4 years ago|reply
[+] [-] fredman|4 years ago|reply
Disclaimer: I work at xMatters.
[+] [-] bilalq|4 years ago|reply
Their free and lower prices tiers offer a lot of what others have on their top/most expensive tiers. Also, integrations with various alert sources are just easier in most cases. I spent I don't know how long trying to get OpsGenie to work before I gave up.
[+] [-] armiiller|4 years ago|reply
[+] [-] kenrose|4 years ago|reply
[+] [-] varlogix|4 years ago|reply
I may be biased as a co-founder of Spike.sh, but I think we have one of the best designed incident management products out there. We've focused on making it easy to create on-call schedule and overrides, and added templates for escalation, on-call and alert rules.
[+] [-] itsjloh|4 years ago|reply
I've also used OpsGenie (Atlassian now) and really enjoyed it. The amount of integrations they have is staggering.
[+] [-] bgm1975|4 years ago|reply
[+] [-] abhishekjha|4 years ago|reply
[+] [-] aiisjustanif|4 years ago|reply
[+] [-] bboreham|4 years ago|reply
[+] [-] markbnj|4 years ago|reply
[+] [-] vvoyer|4 years ago|reply
It's a team calendar to share recurring tasks as a team. Things like PR reviews, who's on support, or who's qualifying leads.
It has far less features than PagerDuty or Grafana OnCall but it serves well a bunch of customers looking for a simple tool to manage team schedules.
[+] [-] moepstar|4 years ago|reply
We're (more or less) using OpsGenie's free tier, however their scheduling never really "clicked" with me... not sure if i'm special in that regard, however i find the UI/UX pretty... weird...
[+] [-] kungfufrog|4 years ago|reply
I need corresponding mobile phone applications for any alert product I intend to use that can override DND/volume etc. on my phone so I can get woken up at night and respond to problems.
[+] [-] marcoboffi|4 years ago|reply