top | item 35074196

(no title)

vinayan3 | 3 years ago

Surprised to see how many differences in disagreements between Datadog and Metrist if Datadog is down or not.

Anyone from Metrist able to explain this?

discuss

lngarner|3 years ago

Hi! Thanks for asking. Basically, Status pages get updated manually, and people decide whether and when an outage is sufficiently bad to warrant a status page update. We monitor actual functionality and will capture smaller glitches that either escape human attention altogether or never get escalated to the point where the status page is updated.

In more detail, this can be for three reasons: 1.) We use functional testing so we're simply showing what aspects of the platform are working and what's not. Due to definitions of "outages" and such in SLA's, vendors like Datadog might not disclose/categorize certain dysfunctions as outages and so they won't show them on their status page. In other words, some outages might be more "minor" and they won't include them on the status page. 2.) Status pages are manual, Metrist is automatic. DD might not have updated or even be fully aware of the outage. Our tests are just showing the objective data as it's happening. 3.) Everyone experiences outages differently. This data from the demo is Metrist's experience with Datadog and can be slightly different from other people (another reason why status pages can be vague). That's why we have an orchestrator that allows people to set up personalized monitoring so they can know exactly how a vendor is affecting them in real-time. And if an outage is relevant to and affecting them.

Does that answer your question? LMK if I can follow up with more info. :)

TrueGeek|3 years ago

> Status pages get updated manually

This bugs me to no end. I don't want to name names but I had a devops service that was returning an odd error implying I was doing something wrong. Status page said everything was good. After several hours I emailed to be told it was actually down, they were aware, and were working on it. It eventually gets fixed, they email back, and all is well. The status page never did show any downtime.

vinayan3|3 years ago

Thanks for responding and providing details.

One follow up is there are instances where Datadog report outages but Metrist says it's green.

Is that because the functional tests are still working but some other part of Datadog was reported as down?

ozten|3 years ago

My guess would be that Metrist made one or more API calls that failed within a time-slice (hopefully more than one failure). They then mark the entire day orange or red and compare it to AWS's green. Which is true, for the entire day their status symbol was probably green.

The AWS team has a hard challenge of reporting availability and deciding when a system is not green across dozens of API use cases per service, hundreds of services, hundreds of data centers, dozens of availability zones, and millions of clients.

Metrist has no visibility into services internal SLA, SLO, and SLIs. [1]

[1] https://cloud.google.com/blog/products/devops-sre/sre-fundam...

capableweb|3 years ago

Metrist seems to consistently rate "downtime" different than the various services, for better or worse.

Here are some examples where the SaaS says they are down/degraded, but Metrist thinks they're up:

https://app.metrist.io/demo/jira

https://app.metrist.io/demo/circleci

Here is another where Metrist thinks the service is down, but self-reportedly up:

https://app.metrist.io/demo/newrelic

lngarner|3 years ago

Thanks for pointing that out! Since status pages are updated manually, we monitor actual functionality. We often see that pages functionally recover long before the status pages update that everything is in working order. Again, because it's manual and status pages are often more for marketing than development purposes. And also we're in "Show HN" and may not be 100% perfect ;) but we stick to the above explanation :)