top | item 29164386

Grafana OnCall: an easy-to-use on-call management tool

232 points| sciurus | 4 years ago |grafana.com | reply

74 comments

order
[+] steveBK123|4 years ago|reply
For a product that's been around 12 years, I've been surprised at how minimally featured PagerDuty is.

Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.

PD schedule checking and trade negotiation becomes yet another thing in the long list of things I need to do when taking a day off. HR system request off, Department Outlook calendar update, PagerDuty coverage check, Outlook out-of-office status & auto-replies, Slack set away, update status AND pause notifications.

I suppose that's because as an on-call developer I am not the user. The user, management who bought the product, gets KPIs & pretty graphs, so they are happy.

[+] dharmab|4 years ago|reply
My least favorite thing about PagerDuty is the phone call notification. I drive a car from 2001, and with a cheap bluetooth upgrade, I can do all of these with my voice while driving:

- Get directions to anywhere on the continent

- Send and receive texts to my friends

- Answer and take a call from a human

But if PagerDuty calls me, Stephen Hawking's speech synthesizer brusquely yells at me and demands I take my hands off the wheel and press a button on my phone to acknowledge the alert. No voice recognition, no ability to kick off an automated play. It's a time portal to 1997! Even the _banks_ have friendlier phone automation these days!

[+] ethbr0|4 years ago|reply
Every delightful, successful developer product is eventually doomed to become JIRA.
[+] jldugger|4 years ago|reply
> Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.

Do you shut down your service for Labor Day? I don't.

I do agree that trading on-call shifts is not very easy within the UI. Part of me dreams of being able to make enough advantaged trades to end up never on-call, like the padre who doubled his holdings in a WW2 POW camp: https://www.ft.com/content/c523efe6-9973-11e1-9a57-00144feab...

[+] motakuk|4 years ago|reply
Hey everyone, Matvey, ex-CEO of Amixr is here. Me and Ildar Iskhakov started this project three years ago because we used to be on-call ourselves and needed better tools. It was an amazing journey from 0 to 1. Tons of coding, first customers, fundraising, iterating, and finally the honor to join Grafana Labs and build Grafana OnCall! I'll be happy to answer your questions if you have any.
[+] joaoqalves|4 years ago|reply
It's great to see more competition in this space. Generally speaking, what I miss in these "incident management" products is also an integrated, flawless way to handle incidents when they're happening. I'm talking about:

1. Quickly creating a proper chat 2. Quickly creating an incident document where you can pin chat messages and use it in the post-mortem. Ideally, pinning some graphs that you'd extract from your observability solutions 3. Having a status page to put a small description for non-technical stakeholders.

PagerDuty covers some of this. Monzo's Response [1] and now incident.io [2] try to cover it too. I'd like to have this experience end-to-end.

1 - https://github.com/monzo/response 2 - https://incident.io/

[+] SeriousM|4 years ago|reply
Hi! Thanks for sharing this news. Will this be available for on-premise installations, and when?
[+] bilalq|4 years ago|reply
This looks really neat. We don't use Grafana today. We're running CloudWatch/insights and Squadcast for alerting, but deep integration with the monitoring tool looks cool. Is this usable with self-hosted or AWS managed Grafana?
[+] tex0|4 years ago|reply
Is there automatic planning of upcoming shifts and compensation accounting? Or do you have to do that manually?
[+] CSDude|4 years ago|reply
> Alerts from each integration 300 5 minutes

> Alerts from the whole team 500 5 minutes

> API requests per API key 300 5 minutes

Product looks great but those API request limits are too low, because alerts rain when you are having incidents and rate limiting all of them is harmful. That's why other products have deduplication keys / aliases so you don't miss important ones.

https://grafana.com/docs/grafana-cloud/oncall/oncall-api-ref...

[+] deeblering4|4 years ago|reply
I'd think that receiving even 1/5th the rate limit in a 5 minute window would be disorienting enough to render alerting effectively useless.

I'd question the configuration which fires that many alerts in that time frame, and suggest improving alert aggregations and dependencies to get the number down to one or a handful of meaningful alerts.

[+] dharmab|4 years ago|reply
I was once in a job where I was solo on call for tens of thousands of cores globally and at worst we had like 2000 alerts in a week. These limits seem quite high to me.
[+] CameronNemo|4 years ago|reply
That's why other products have deduplication keys / aliases so you don't miss important ones.

Care to link to the docs? I'm interested.

[+] named-user|4 years ago|reply
How else do you think they are gonna make money?
[+] halfmatthalfcat|4 years ago|reply
Is there really anybody else in the "Pager" category of SaaS products other than PagerDuty that have any traction?
[+] therealdrag0|4 years ago|reply
We use OpsGenie. not sure how widely it’s used but given its Atlasian I’d guess a non-trivial amount.
[+] haliskerbas|4 years ago|reply
Technically Splunk On-call. But I have a few pain points with it, and I miss pagerduty.

If you want to see what teams you are on as the current logged in user, the only way to do it as far as what support told me, is to search for yourself and then check that result.

[+] bilalq|4 years ago|reply
We started using Squadcast: https://squadcast.com

Their free and lower prices tiers offer a lot of what others have on their top/most expensive tiers. Also, integrations with various alert sources are just easier in most cases. I spent I don't know how long trying to get OpsGenie to work before I gave up.

[+] kenrose|4 years ago|reply
PD does two big things:

  1. Alerting: Phones you when your servers are down.
  2. Incident Management: Help coordinate a response across multiple people.
For the first, there's also:

  - OpsGenie (owned by Atlassian)
  - Squadcast
  - VictorOps (now Splunk On Call)
  - xMatters
  - PagerTree
For the second, there's a bunch of new contenders:

  - Datadog now has an IM product
  - Blameless
  - Rootly
  - Incident.io
  - FireHydrant
[+] varlogix|4 years ago|reply
Spike.sh - https://spike.sh

I may be biased as a co-founder of Spike.sh, but I think we have one of the best designed incident management products out there. We've focused on making it easy to create on-call schedule and overrides, and added templates for escalation, on-call and alert rules.

[+] itsjloh|4 years ago|reply
I use VictorOps (Now Splunk On-Call) currently and it does the job. Its shift override functionality is quite confusing to get your head around at first but makes sense after the first few times.

I've also used OpsGenie (Atlassian now) and really enjoyed it. The amount of integrations they have is staggering.

[+] bgm1975|4 years ago|reply
There’s Splunk OnCall (formerly known as VictorOps). It’s a very decent solution.
[+] abhishekjha|4 years ago|reply
Also what happens if pagerduty goes down?
[+] markbnj|4 years ago|reply
I'm a grafana fan and a current user of PagerDuty. Maybe there's more to the story but after reading the post I feel like using a calendar integration to manage on-call schedules is the wrong approach. Calendar events are a result of overlaying a rotation on a date range: they're the output, not the input. I'm sure the designers here have looked at how PD enables creating and editing rotations. Curious to know their views on it.
[+] vvoyer|4 years ago|reply
Shameless plug, if you're looking for a simple shift scheduling calendar connected to Slack, I built this: https://turnshift.app.

It's a team calendar to share recurring tasks as a team. Things like PR reviews, who's on support, or who's qualifying leads.

It has far less features than PagerDuty or Grafana OnCall but it serves well a bunch of customers looking for a simple tool to manage team schedules.

[+] moepstar|4 years ago|reply
A few more screenshots of the "Scheduling" options would've been great...

We're (more or less) using OpsGenie's free tier, however their scheduling never really "clicked" with me... not sure if i'm special in that regard, however i find the UI/UX pretty... weird...

[+] kungfufrog|4 years ago|reply
I'm not sure what this is competing with in it's current incarnation.

I need corresponding mobile phone applications for any alert product I intend to use that can override DND/volume etc. on my phone so I can get woken up at night and respond to problems.

[+] marcoboffi|4 years ago|reply
but is it possible to send sms/phone call directly from grafana oncall ? If yes, is there a pricing ?