Monitoring Is a Pain

[+] jillesvangurp|2 years ago|reply

It's hard because it's being pushed and done by people with different agendas that also aren't necessarily the right agendas to be prioritizing.

- mistake #1, assuming it's a technical thingy that some techie person should do and they should just get on with it and do it. Whatever it is. The mistake here is that without guidance this person is going to be selecting some random tools mostly focused on operational things (logging, infrastructure). These have low value for other stakeholders. Yes you need that but these are also commodities.

- mistake #2, each of the stakeholders adds their preferred tools to the mix and the end result is a bunch of poorly integrated tools with a lot of complexity.

- mistake #3, assuming that selecting a tool means the job is done. It's not. Most of these tools assume that you have some notion of how to use the tool and what you want out of the tool. Getting the tool is a the beginning, not the end.

The combination is lethal. Lots of complexity, lots of data, lots of cost, and not a whole lot of value being delivered.

The fix is an observability mindset that exists in product, tech, and business departments. Select 1 tool, not 2, 5, or 10. The more tools, the more fragmented the effort and the more duplication of effort and mistakes. Have a plan, engineer for the system to be observable, have a notion of things that are important to track, etc. Don't dump the responsibility on some techie but actually reflect a bit on what you are looking to get out of this and what this is worth to you.

[+] KaiserPro|2 years ago|reply

The Way(tm) that I was taught/experienced is as follows:

o Logs are there for ignoring. You need them precisely twice: once when you are developing, and once when the thing's gone to shit, but they are never verbose enough when you need them.

o Using logs to derive metrics is an expensive fools errand pushed by splunk and the cloud equivalents(ie cloudwatch and the like). Its slow, inaccurate and horrendously expensive.

o using logs for monitoring is a fools errand. Its always too slow, and really really fucking brittle.

o metrics are king.

o pull model metrics is an antipattern

o Graphite + grafana is still actually quite good, although time resolution isn't there.

o You need to raid your metrics stores

o We had a bunch of metrics servers in a raid 1, which were then in a raid 0 for performance, all behind loadbalancers and DNS Cnames with a really low TTL.

o Cloudwatch metrics are utterly shite

o Cloudwatch is actually entirely shit.

o tracing is great, and brilliant for performance monitoring.

o Xray from AWS is good, but only really for lambdas.

o tracing is fragile and doesn't really plug and play end to end, unless you have the engineering discipline to enforce "the one true" tracing system everywhere

but what do you monitor?

http://widgetsandshit.com/teddziuba/2011/03/monitoring-theor... this still is canonical.

In short, everything should have a minimum set of graphs, CPU, Memory, connections, upstream service response times hits per second and query time, at a minimum.

You can then aggregate those metrics into a "service health" gauge, where you set a minimum level of service (ie no response time greater than 600ms, and no 5xx/4xx errors or similar) red == the service isn't performing within spec, yellow == its close to being outside spec, green == its inside spec.

if you are running a monolith, then each subsection needs to have a "gauge". for microservice people, every microservice. You can aggregate all those gauges into "business services" to make a dashboard that even CEOs can understand.

[+] perlgeek|2 years ago|reply

I think it hasn't really been settled if pushing or pulling metrics is the anti-pattern, it seems to change every 5 to 10 years which one is currently hot.

[+] betaby|2 years ago|reply

> o pull model metrics is an antipattern

And sadly somehow whole Prometheus/Cloud is built on idea of pulling GET /metrics I personally also think it's an antipattern, yet such design is dominant. Streaming telemetry via GRPC is rarity.

[+] pnt12|2 years ago|reply

I don't know a lot about monitoring, why do you say pulling is an anti pattern?

My understanding is that with the push pattern you submit a metric when it's available, with the pull pattern you make them available via an interface.

I've read the first is not as performant, as it leads to submitting lots of metrics. Although I can think of an alternative, which is storing them and pushing batches periodically?

[+] guhidalg|2 years ago|reply

This matches my experience in Azure. I'd add a few more:

o Log program inputs, outputs, and state changes. If you log something like "We are in function X", that's useless without a stack trace showing how we got to that state.

o Assume your developers are going to forgo logging "the right thing" and instrument the ** out of your application infrastructure (request/response processing, DB accesses, 3rd-party API calls, process lifecycle events, etc...).

o Make logging easy. I may get flamed but I think dependency injecting loggers is an anti-pattern. Logging is such a cross-cutting concern that its easier to have a static Logger or Metric object that is configured at startup. You do need some magic to prevent context from spilling from one request to another, but at least C# has these language features (AsyncLocal<T>) and I assume others do too.

o Your monitoring alert configuration should be in source control.

[+] citrin_ru|2 years ago|reply

> Logs are there for ignoring. You need them precisely twice: once when you are developing, and once when the thing's gone to shit

good logs are very valuable both when things are still working and when there is an outage.

E. g. I with nginx even on a loaded servers error.log is small and it is possible to skim it from time to time to ensure there are no unexpected problems. access log is useful for ops tasks to - when possible I use tab separated access log which can be queried using clickhouse-local.

Sometimes software write bad (useless) logs, but this is not a problem with logs in general.

Also it may be useful to collect metrics for number of messages on different levels (info/warn/error/crit) - sudden changes in rate of errors is something which should be investigated.

[+] kccqzy|2 years ago|reply

While I agree generally, I think it's useful to distinguish between freeform logs and structured logs. Freeform logs are those typical logs developers form by string concatenation. Structured logs have a schema, and are generally stored in a proper database (maybe even a SQL database to allow easy processing). Imagine that each request in a RPC system results in a row in Postgres. Those are very useful and you can derive metrics from them reasonably well.

[+] klysm|2 years ago|reply

I strongly disagree with the link you’ve provided saying that queue bases systems are hard to measure the health of. I think the queue size is a good way to monitor the health of things, assuming you have some not super pathological workload. If your queues have a bounded size then alerting at various points towards that upper bound makes perfect sense

[+] kbar13|2 years ago|reply

i agree with your statement on logs and i see where you're coming from with cloudwatch. i think if you're coming from graphite and grafana then it makes total sense why you would dislike cloudwatch. cloudwatch requires a bit more planning than the intuitive metrics you can typically throw at and query from graphite. it also definitely does feel weird to have to ship multiple dimensions of seemingly the same metric to do the queries that you really want. however, once you design the rollups of metrics, you get everything you need, plus you don't have to worry about operating a mission-critical monitoring stack. and you can still build your dashboards in grafana and alert as you like.

i've never really found tracing to be useful - i've used pprof and an internally developed version which uploads profiles from production instances and shows you a heatmap for stuff that require deeper digging than metrics

[+] kunley|2 years ago|reply

Two big minuses of Graphite:

- not scalable, even at a moderately sized level; at one work we had graphite workers trying to flush-write the data for dozens of hours on a regular basis, and numerous engineers spent a lot of time trying to optimize that with no good outcome

- no labels, so you must flatten them into a metric name

[+] VectorLock|2 years ago|reply

>pull model metrics is an antipattern

While I agree there are a lot of Prometheus people who would disagree.

[+] brad0|2 years ago|reply

Could you elaborate on Cloudwatch metrics?

[+] trabant00|2 years ago|reply

Good luck trying to monitor using debugging tools like logs and metrics. Devs thinking they can do Ops because reasons. What you really need is state monitoring. You know, the old Nagios style that everybody thinks is out of date because, again, reasons. Just start by having e2e tests for everything and add intermediate tests when you discover points of failure and bottlenecks. And don't give me that "Nagios does not scale" bullshit. You don't need to and can't monitor 9 trillion things. You also have Icinga and other options that scale for state monitoring.

In place of argumentation which would be too long, let me give you an analogy. Monitoring is telling you the fire alarm went off. The exact reason is left for a investigation and that's where saving data like logs and metrics helps. But first you need to evacuate the building and call the fire department. If instead of detecting the smoke (e2e) you try to monitor all the data like pressure in gas pipes, all the pipes fittings, how many lighters are in the building and where, etc, you will wake up every night 10 times for nothing and when a fire starts you will burn sound asleep.

Oh, and don't try to scale Prometheus, that's what things like victoriametrics are for.

[+] wspeirs|2 years ago|reply

>If instead of detecting the smoke (e2e) you try to monitor all the data like pressure in gas pipes, all the pipes fittings, how many lighters are in the building and where, etc, you will wake up every night 10 times for nothing

Without this you'll always be awoken by smoke, and at that point it's too late... there's already a fire. However, if you can monitor other things (gas pipes, lighters, etc), you might be able to remediate a problem before it starts smoking or burning.

There's no one-size-fits-all. Start with some obvious failure (e2e tests/checks; aka smoke), and when you root-cause, add in additional checks to hopefully catch things before the smoke. However, this also requires you update (or remove) these additional checks as your software and infra change... that's what most people forget, and then alert fatigue sets in, and everything becomes a non-alert.

[+] pphysch|2 years ago|reply

> What you really need is state monitoring. You know, the old Nagios style that everybody thinks is out of date because, again, reasons. Just start by having e2e tests for everything and add intermediate tests when you discover points of failure and bottlenecks.

"State monitoring" are just different metrics. State of X at time Y was Z. Blackbox Exporter covers a lot of ground here. If you need more advanced application instrumentation, you can build that, but there is real value in having all of your metrics in one place instead of siloing different bits out to Nagios.

[+] buro9|2 years ago|reply

Monitoring all the things can be hard.

There are a lot of different solutions, that have different trade-offs for different volumes of data, cardinality of data, how out of order / async the data may arrive, whether you need push or pull delivery, etc, etc.

These apply to you and only you know the parameters and considerations of your application, and the environment it runs in (perhaps you come with the baggage of "this is instrumented in x and I can't change that" in addition to volume and cardinality considerations, perhaps it's a point-of-sale in a shop that only connects to the internet once a day for upload of the data).

But for the simplest solution, you're probably talking Prometheus + Loki/Elastic + Tempo + Pyroscope/Polar Signals + Grafana. This is simple as in "all pieces are proven and work in the small and get you fairly far"... but you still have to compose the solution out of those parts. And later you may wish to graduate to Grafana Cloud when the pain of running it all yourself is taking more time and effort than running the systems you're supposed to be working on. Oh, and if you really want to lean in on traces for everything, take look at Honeycomb too.

As to "the monitoring costs more than the app"... yup, at Cloudflare for a very very long time there was little monitoring of DNS requests or billing of it... because the act of counting it and recording it cost more than just serving the request. This principle scales well (or badly if you're looking at your credit card bill)... and is also why those who make these systems really have to keep in mind how to make the instrumentation cost less (eBPF?) as well as the ingestion and retention cost less (various methods, including aggregation, sampling, etc).

[+] hinkley|2 years ago|reply

My company does a lot of things the dumbest way possible but they’re always been good at tracing service requests and telemetry. Any time correlation IDs were missing was considered a regression. And we are quick to agree to action items that increase visibility from a postmortem. The one complaint I have is we don’t move things out of Splunk into graphs as fast as I’m comfortable with.

But the thing I see we are all missing for telemetry is a sense of coverage. We changed backends last year and as we were wrapping that up I found big chunks of code that were missing coverage. Meanwhile we found a bunch of stats that nobody looks at, meaning the SNR goes down over time. There are only so many dashboards I can watch.

So some sense of code blocks missing telemetry, and data that is collected but has no queries, queries that exist but never run (that dashboard nobody remembers making and isn’t in the runbooks anywhere) would be a great goal for a next level of maturity.

[+] awinter-py|2 years ago|reply

Strong yes, particularly this line:

> the cost of monitoring an application can easily exceed the cost of hosting the application even for simple applications

(especially if you are looking at saas monitoring for a single-box or serverless thing)

I suspect SIEM is the most established use case for ingesting logs to something actionable; wonder if there are concepts there which port generally.

Datomic's approach to change capture seemed very good when it came out -- I wish this were a thing I could cheaply turn on and off on SQL columns. As things are, every ORM seems to resist having a table with no primary key so 'log important things to the DB' is hard in practice.

In postgres, mvcc makes it expensive to edit a scalar, so it is impractical to store counts. (There is an old uber blogpost I think about why they use mysql for this use case).

Specifically for alerting, I wish I could define application logic statistics in my application rather than PromQL / datadog json. Would be much easier to express 'warn me if this ever fails', 'expect 50% failure', 'warn me if too many retries' type stuff.

On the 'looking at logs' side, the hosted monitoring tools I've used are often uncomfortably slow, even when my data is small. I understand it's a hard problem but I miss graphite / whisper.

(not even getting into mixpanel / posthog type product-level observability)

[+] mst|2 years ago|reply

> every ORM seems to resist having a table with no primary key

perl's DBIx::Class doesn't honestly give a shit.

I understand there are many perfectly understandable reasons why people dislike perl itself, but it's really annoying to use ORMs in more popular languages and keep finding things perl developers have been able to take for granted for 15+ years that Just. Aren't. There.

I'm not asking anybody to like perl who doesn't already, it's definitely an acquired taste, but I think it's understandable for me to wish people developing libraries for other languages would get around to stealing some of DBIx::Class' less commonly available features (I may eventually get around to doing so myself but I strongly suspect that for any given language there are quite a few people out there who could do a better job of it, faster, were they to decide it sounded like sufficient fun to spend time on it, so unless/until -I- decide it sounds like sufficient fun I'm going to keep mentioning it in the hopes somebody else does ;).

[+] rjbwork|2 years ago|reply

>I wish this were a thing I could cheaply turn on and off on SQL columns.

You can turn it on and off for whole tables in systems that have decided to implement SQL:2011 such as MSSQL, Oracle, MariaDB, etc. and then issue time travel queries to see what the state of the database was at any given point in time.

Keyword here is "Temporal Table".

[+] lukeasrodgers|2 years ago|reply

What do you mean "expensive to edit a scalar"? If we're thinking of the same Uber article on mysql -> postgres, IIRC it had to do with write amplification related to updates to indexed columns. I don't think postgres has any problems storing counts, but would be interested to know if I'm wrong.

[+] j3s|2 years ago|reply

> I miss graphite / whisper.

as someone who had to manage a system like this, i certainly don't. it doesn't scale well, it requires way too much upfront design thought (since metrics are stored in arbitrary dirs), and it's a massive pain to install and manage. not to mention, it's barely developed anymore so much needed features (like carbon cache restarts not taking a million years bc of the slow sequential flush to disk) will probably never see the light of day.

for better or worse, the industry moved on. speaking personally, prometheus is so much better than graphite ever was, in every conceivable way.

[+] roflyear|2 years ago|reply

Well, "easily" - I am not totally sure, but it can happen without much trouble. I guess depends on your definition of easily.

[+] jeffbee|2 years ago|reply

One thing I like to evangelize is the similarity between logging and tracing. A trace could be assembled from properly-formatted distributed log messages, and some tracing systems are implemented this way. If you have a working tracing setup, and you trace every request as some people do, then using a span event [1] can be an easier way to get information about what happened during the request than it would be to find all the associated debug logs.

1: https://opentelemetry.io/docs/concepts/signals/traces/#span-... A span event is data that is associated with a point in time rather than an interval.

[+] phillipcarter|2 years ago|reply

> The problem with me and tracing is nobody uses it. When I monitor the teams usage of traces, it is always a small fraction of the development team that ever logs in to use them. I don't know why the tool hasn't gotten more popularity among developers.

I think there's two primary reasons:

1. Historically it was too hard to get easy value early. That's probably changed with OpenTelemetry since it has so much automatic instrumentation you can install. Even 3 years ago it just wasn't feasible to install an agent or write 5 lines of code and then get most of the libraries you use instrumented for you. Now it is, and it's fully portable across tools.

2. Cost, which still isn't solved yet. Tail-based sampling is still too hard, but you need it if you want to do stuff like always capture 100% of error traces but get only a statistically representative sample of 200 OK traces. There's some solutions for that today (e.g., my employer Honeycomb has such a tool), but it's still locked away with vendors and too hard to use.

I do expect the narrative on tracing to change in the coming years, though. OpenTelemetry is growing in popularity and it has enough features and stability for a wide range of people to adopt it. I think the cost issue will also be dealt with over time.

[+] wspeirs|2 years ago|reply

>Cost, which still isn't solved yet.

I'd argue it _is_ solved... store your logs in S3. At ~$0.02/GB you can store a _lot_ of logs for like $20. The problem is that _most_ solutions (Honeycomb included) are SaaS-based solutions, and so they have to charge a margin on top of whatever provider is charging them.

You just need a tool (like log-store.com) that can search through logs when they're stored in S3!

[+] doctorpangloss|2 years ago|reply

If I'm going to pay for a vendor APM, who's going to outpace NewRelic? They already do all this stuff - automatic instrumenting, "tail based sampling," etc.

[+] prpl|2 years ago|reply

Another thing - Traces (and often metrics for that matter) are often treated as something that is investigated one at a time. There's a lot of tools out there to help visualize a trace. They are not, usually, treated as something that should be analyzed at scale, and the tooling there is not great.

I think PromScale looked promising for some of those problems but it's also a deprecated product.

I think the state of the art is/will be ingesting traces into Iceberg for some form of analysis with whatever execution engine you want (Trino/Spark/etc...). A lot of places are doing this.

OpenTelemetry is clearly the future of monitoring, probably with a lot of lifting from Prometheus and other things.

[+] fbdab103|2 years ago|reply

>When I monitor the teams usage of traces, it is always a small fraction of the development team that ever logs in to use them...

Like fire extinguishers, they are nice to have when you need it.

[+] rsync|2 years ago|reply

Higher value needs to be placed on "feel" and the ineffable heuristics that human brains automatically build up as they operate a system.

This is one of the reasons I strongly dislike the idea of remote or telepresence inspections of factories. You need to just walk the factory floor and use all of your senses (including smell). If you do this enough, you'll know when something is wrong. The pitch of the machines, the vibrations, the hum ... and you'll catch it before your rules do.

A personal example:

rsync.net has almost zero fraud detection for payments and signups. However, I, personally, have been watching the real-time order flow[1] for decades and I just know. A fraudulent signup just sticks out like a flashing red light. I could not teach this to you and I could not codify my heuristics.

I have spent decades thinking about how to augment - or replace - these real-time human observations with warnings and alerts and thresholds and I am deeply skeptical of their value relative to human interaction with the functioning system.

[1] ... with tail ...

[+] giovannibonetti|2 years ago|reply

When it comes to monitoring, the first thing I look for is a good data store. In the open-source world, probably the best one is Clickhouse, and there are multiple solutions built on top of it like SigNoz [1] and Uptrace [2].

[1] https://github.com/SigNoz/signoz [2] https://github.com/uptrace/uptrace

[+] serverlessmom|2 years ago|reply

Definitely +1 for SigNoz, much more cohesive view of metrics, logs, and traces

[+] lrem|2 years ago|reply

I share the love of tracing. Internally at Google I sometimes stare at traces traversing hundreds of RPCs. It's about the only tool where I can very quickly go "wait, this shouldn't be the costly part" and "ugh, why are these sequential", then look exactly what the particular requests were and where the QoS was set wrong.

[+] XorNot|2 years ago|reply

The rule I follow personally for log levels is:

DEBUG - anything which would otherwise be a 1 line code comment

INFO - all business process relevant actions. Anytime something is committed to a database is a fairly good place to start.

WARN - abnormal outcomes which collectively might indicate a problem. i.e. a warning log should be printed when something succeeds but takes much longer then it should, or if I'm going to fail an action which will be retried later. Anything which logs at this level should be suppressible by reconfiguring the application to define the behavior as normal.

ERROR - Anything which should work, didn't, and isn't recoverable. Actions requested by the user which don't succeed should log at this level, since we're failing to meet a user request.

[+] doctorpangloss|2 years ago|reply

> You can now track all sorts of things across the stack and compare things like "how successful was a marketing campaign". "Hey we need to know if Big Customer suddenly gets 5xxs on their API integration so we can tell their account manager." "Can you tell us if a customer stops using the platform so we know to reach out to them with a discount code?" These are all requests I've gotten and so many more, at multiple jobs.

But now you can ask ChatGPT GPT4 to author PromQL and Jaeger queries for you, which it is incredibly good at.

The real, actual pain point - not cost, not maintenance, etc. - between the bottom line affecting stuff like customer success management and metrics was always turning the English language problem into the thing that goes into the computer. And that is seriously going away, the only people who think it's not are the experienced ones who, for some reason, months into the distribution of this insane technology, haven't tried asking for a PromQL query yet!

[+] gofreddygo|2 years ago|reply

> the English language problem

This is so true, and IMO underlooked for other flashy capabilities. GPT transformers (even bad ones) have nailed english grammar. This opens up a lot of doors for a lot of small businesses.

[+] SoftTalker|2 years ago|reply

I've outsourced monitoring to my users. They let me know when something isn't working.

[+] awestroke|2 years ago|reply

Your increased churn rate is probably cheaper than a Datadog plan

[+] knorker|2 years ago|reply

Crowd sourced consensus driven monitoring based on multiple instances of general AI.

[+] spentu|2 years ago|reply

I also have done this for some projects. Sadly they seem to be unreliable to degree of using alerts with ms Teams..

[+] hagen1778|2 years ago|reply

How do you silence noisy alerts?

[+] ckozlowski|2 years ago|reply

This was a good read. But the question that I kept waiting to be asked was "What do I need to monitor?"

In my opinion, the way to avoid many of the problems and complexities that author lists is to start with the goals. What are the business/mission goals? What is the workload (not the monitoring tool) trying to accomplish? What does the thing do?

Once you have that, you can go about selecting the right tool for the job. The author notes that there's no consensus on what logs are for. Or is buried under signal-to-noise ratio for metrics. But when you start with a goal, and then determine a KPI you want to obtain to measure that, you'll then be able to ascertain where a log, metric, trace, or combination thereof can reflect that. And you only need create the ones needed to measure that. The scope of the problem becomes far more manageable.

It make take a couple tries to get right, and a user may find that they need to add more over time. But the monitoring doesn't need to change often unless the deliverables or architecture of the workload does. And creating those should be part of the development process.

By starting with goals and working from there, monitoring becomes a manageable task. The myriad of other logs and metrics created can be safely stored and kept quiet until they're needed to troubleshoot or review an incident. That falls into observability.

Observability is something you have. Monitoring is something you do.

I haven't addressed the difficulties the author states in creating various traceability mechanisms, how to parse all of these logs from various platforms, or some of the other technical challenges he states. There are usually solutions for these (usually...), some easier than others. But by narrowing the scope down to what's needed to achieve your given insights, the problem set becomes only what's required to achieve that aim, and not trying to build for every conceivable scenario.

[+] llama052|2 years ago|reply

Honestly we've had great luck with HA prometheus instances shipping data to Thanos. All in kubernetes so we get to use pre-built operators. We don't really have to touch it other than updates.

We are a 3 man team and we are holding around ~4m active series with around a year of downsampled data in a storage bucket. Doesn't really cost us anything other than some servers on a cluster. We also use Loki for logs and Tempo for tracing in the same pattern.

One of the reasons I like Kubernetes, it's rigid enough compared to the wild west of VM land, and utilizing operators and tooling can get you solutions that just work provided you have enough core knowledge to fill in blanks.

[+] perpil|2 years ago|reply

If you want to do monitoring and logging at scale, the majority of the best practices are captured in this article: https://aws.amazon.com/builders-library/instrumenting-distri...

These are the key insights we learned running hundreds of services at Amazon scale and you'll learn something whether you're just starting, you think you know it all or you need some actionable ways of improving your operational excellence.

[+] jeffbee|2 years ago|reply

Notable they are not discussing centralized collection and indexing of debug logs, they implicitly leave the logs on the disk of the machine where they were produced, and go out to read them when called for. This is an important lesson because centralizing and indexing logs is very foolish unless you are a Splunk shareholder.

[+] hagen1778|2 years ago|reply

Monitoring is a pain because we need to monitor more and more complex things. And not only that. Mostly, the problems described by OP are connected to the scale. So the main source of the "pain" is the volume of data we need to process with monitoring systems.

That's why Graphite is fading off - it just can't keep up with the growth of data volume.

Prometheus is a great tool, but now it is not enough of one server. So you are either looking for workarounds with federation, or running more complex monitoring systems like Thanos, Cortex, VictoriaMetrics.

[+] mvdtnz|2 years ago|reply

I just think that this is an incredibly whiny post. These are all solved problems but he brushes away the solutions with a hand wave in order to continue being relentlessly negative.

> Maybe you start with an ELK stack, but running Elasticsearch is actually a giant pain in the ass.

And that's the last he says of it. But I've seen ELK stacks in organizations from 30 employees / 5,000 users all the way to 5,000 employees / millions of users and it scales well. I'm sure it continues to scale above that.

Or for Prometheus scaling options,

> Adopt a hierarchical federation which is one Prometheus server scraping higher-level metrics from another server. [...] The complexity jump here cannot be overstated. You go from store everything and let god figure it out to needing to understand both what metrics matter and which ones matter less, how to do aggregations inside of Prometheus and you need to add out-of-band monitoring for all these new services. I've done it, it's doable but it is a pain in the ass.

It's not a pain, it's like an afternoon of work. I don't see why you need to be deciding which metrics matter - collect all of them. And if you don't know how to do aggregations in Prom, learn! You'll need to know this to use it anyway.

[+] dbg31415|2 years ago|reply

I think you can monitor things "as a developer" and you can, and should... but let's face it, nobody is really watching that stuff.

Better to tell the marketing teams to set up their own monitors and metrics.

It could be Google Analytics, it could be UptimeRobot. Or whatever they want to use.

Good for them to set monitors of every page that gets more than 1% traffic.

Good to monitor for the HTTP Response Code (especially if there's a WAF involved), simple content check (like the page title, to make sure something didn't get redirected), "too many hops test" for more than 2 hops to get to a destination, page load speed, and of course uptime. Most tools also will alert you if your SSL Cert or Domain is about to expire. Kinda useful. Good to monitor on every major vanity URL redirect as well... so like mysite.com/some-promo.

Setting this stuff up is SO EASY... but few people do it correctly / consistently -- so it's good to ask them to check once a month if they have all the monitors they need. Just grab the GA data and set a monitor on any page URL that gets more than 1% traffic... easy to just drop the data from GA into a CSV and import it into UptimeRobot. 2 minutes a month and you have amazing monitoring. Really important for monitoring to be run from a 3rd party location -- not from the same server, or behind the same firewall as your corporate network, but run from a location similar to what your real users are seeing.

From the dev side... having good logs so you can diagnose issues. But Let's face it, nobody has time to monitor and respond to things unless it's been escalated. So just let the marketing team / ecommerce team escalate the things they care about, and just make sure you've got enough logs and ways to diagnose the issues they report.

[+] fpanzer|2 years ago|reply

OP needs to install naemon and chill. Monitoring means alerting, and not collecting metrics. Monitoring means being notified of outages. Metrics are for troubleshooting and post mortems.

If you monitor CPU or RAM usage, you are doing it wrong. Oh and by the way, you want your ram to be full.

209 comments