The Modern Observability Problem

[+] a_c|3 years ago|reply

People are seeing observability as a separate subject from product making. I see it as two side of the same "does this thing work" coin. To a product person, does it work mean are people using the tool to solve their problem. To engineer, it means the tool is solving the problem correctly. The former is answered by various analytics tool while the latter is answered by QA and observability tools.

IMO this is doubling effort. Given all the logging,metrics and traces, can we not tell how good the product is performing? Given we know how a feature is used by users, all the clicks and APIs, can we not guaranteed future changes not breaking a user journey? I'm at the moment prototyping a tool to capture my thinking in product and software development. Having OtEL is making my idea so much easier to realise

[+] phillipcarter|3 years ago|reply

One major issue to reconcile is the granularity of data and retention. For engineers incentivized to fix problems, they need extremely high granularity, which is expensive to store long-term. Product Managers typically care more about aggregations that they want to track over a long period of time. In theory you could source these concerns from the same place, but it's challenging to do it, since you'd need to process, store, and query that data in different places. This doesn't make it impossible, and I think people should try it - but it may be too hard for a lot of organizations.

[+] mkl95|3 years ago|reply

> People are seeing observability as a separate subject from product making. I see it as two side of the same "does this thing work" coin.

I have noticed the same thing with performance and UX. Most SaaS vendors only care about client / frontend UX and neglect the backend aspect of it. After a few years of intense development, they own good looking software that is also mindbogglingly slow and in some extreme cases unusable.

[+] jahewson|3 years ago|reply

Mixing up these two concerns is a bad idea. Your product manager now has to convince engineers to implement the metrics they want. Data that is irrelevant to engineers now needs dragging along to places inside the system so that events are useful to both engineers who car about data and latency and product managers who care about individual user behaviour.

So while there is indeed overlap, the two types of metrics involve different people, different teams, different concerns, different data, different privacy concerns, and different reliability/uptime concerns.

[+] vrnvu|3 years ago|reply

Observability is the other side of statistics.

[+] YZF|3 years ago|reply

We run a pretty complicated SaaS system.

All these tools have their limitations (and we have all of them, we use Prometheus, we have tracing, we have logs - your entire stack of everything ;) ). There is a limit to your ability to tell what's going on inside a black box based on those, sometimes they'll answer the question you're interested in, and sometimes they will not. As pointed somewhere else in the comments tracing every single interaction in your system doesn't work/scale and often the one failure you care about is not going to leave a trace. Similarly with metrics at some point just measuring everything with the right labels becomes too expensive. More than once I'm looking for a specific metric to help troubleshoot something and we don't have it (despite having a ton of metrics for everything). Alerting on metrics can be very tricky because you may not have good context, some requests might be slow because they're big, some might be fast, finding rules that tell you when the system isn't behaving is extremely difficult. Usually it's the users/customers that are going to tell you that.

Adding metrics, tracing, alerts, dashboards etc. etc. takes time/effort. This needs to be weighed against time spent on other things that can improve the quality of the product. Like design, testing, etc. Really understanding what the requirements are and how the system behaves. Just because Google or Meta set that balance somewhere doesn't mean you need to. Likely your system is significantly smaller and less complex.

This is not a new problem, logging and other methods of observability have been with us since the beginning of time, and it's always been something that needs to be approached with balance. There's some logging that adds value and there's some point where it is counter-productive. When things break, more than often the logging just gives you a starting point for debugging- not the answer.

My personal philosophy is invest in quality early on and you will reduce your operational costs. Simpler and more reliable software needs less monitoring and conversely no amount of monitoring is going to turn poor quality software into reliable software. There are many domains where the software just has to be right (lessay in your car or airplane) and you can't rely on someone monitoring the software to go and fix things if they go wrong... That said, it's always about the balance. You shouldn't care about the fashion of the day or what Google does. You need to decide where the balance is for your product that optimizes things over its lifetime with the given constraints. Every project is going to have a different balance.

[+] edenfed|3 years ago|reply

If you are thinking on adopting OpenTelemetry, you should check out Odigos: https://github.com/keyval-dev/odigos (I’m the author). This tools handles instrumentation for any application (even including Go) and also manages the collectors pipeline.

[+] nickstinemates|3 years ago|reply

Increasingly, the network is a failure boundary people take for granted.

Micro/3rd party services exacerbate this problem. You may see latency increase for a particular call, but what tells you why latency is increasing? What's measuring all of these tech choices, how do you know your 3rd party api is serving you traffic reliably?

[+] dboreham|3 years ago|reply

I've become a proponent of OtEL tracing in recent months having used it successfully to diagnose some performance issues in multi-language, multi-service systems. I've found it also useful in single-process scenarios where heavy use of "async" prevails. Async-ish things (Kotlin coroutines and Scala futures in this particular case) make it hard to reason about the linear behavior of code using traditional debugging tools, I find. Disclosure: I've also made a couple of very small contributions to the project.

[+] afandian|3 years ago|reply

I’m curious, do you have more details?

[+] preseinger|3 years ago|reply

OpenTelemetry is moribund and a technological dead-end. Godspeed to any org that builds on it.

[+] mianos|3 years ago|reply

Too much telemetry is more a problem that not enough in my recent experience. I am 100% sure the line I need to find out what happened is there in kibana but every extra filter term to make a trillion lines of log output down to a specific time and sequence, adds a risk of filtering out exactly what I want.

[+] phillipcarter|3 years ago|reply

What you're describing is why tracing (it's not just for distributed systems!) and tail sampling are employed in practice over just tons of log lines. Structuring the data your generate and sampling it is how to approach this. If you sample 100% of errors (or some other meaningful signal), 1% or so of the rest - just so you have a good baseline to compare - and attach metadata for a tool to reweight counts you can more-or-less have your cake and eat it too.

[+] Joel_Mckay|3 years ago|reply

In general, Erlang/Elixir solved this challenge in an interesting way: with peer state awareness and channels (functionally equivalent to most micro-services use cases). It is commonly the back-end secret in a lot of low-latency game back-ends.

If you are stuck on a polyglot loving project, than RabbitMQ or Kafka can bolt on most functions with the standard AMQP services. Erlang/Elixir is weird, but is a single kind of weird... which even has built-in profiling tools without external dependencies.

Best of luck, =)

[+] preseinger|3 years ago|reply

Erlang asserts two simplifying assumptions about distributed computation:

1. Actors can define incoming message queues with unbounded capacity

2. Actors can always crash in response to any given error condition

These assumptions produce a coherent system model, which is unfortunately incompatible with any physical system that can implement it -- unbounded queues are a fiction -- and which is effectively non-deterministic and unpredictable -- crash-only software makes it impossible to assert any guarantees on a callstack.

Erlang represents a sub-optimal local maximum. It's no panacea.

[+] KronisLV|3 years ago|reply

I've generally seen three approaches to this:

  - use cloud offerings because it's easy to integrate with them and they're one of the better options out there; which isn't viable in some contexts, or when you don't have money allocated for this
  - setup the full Elastic Stack or Sentry, or something enterprise like that and have your stack be composed of multiple interconnected pieces of software, that need constant maintenance or even people to constantly manage them, as well as a non-insignificant amount of resources
  - go for a lightweight offering, like JavaMelody for Java applications, or some of the simpler fully featured stacks, like Apache Skywalking and try to make do with their more limited feature sets and possibly more limited documentation

For me, Apache Skywalking feels "good enough", although definitely not perfect: https://skywalking.apache.org/

The Docker Compose stack for it doesn't look as complicated as that of Sentry, it's basically an almost monolithic piece of software like Zabbix is and it works okay. The UI is reasonably sane to navigate and you have agents that you can connect with most popular languages out there.

That said, the UI sometimes feels a bit janky, the documentation isn't exactly ideal and the community could definitely be bigger (niche language support). Also, ElasticSearch as the data store feels too resource intensive, I wonder if I could move to MySQL/MariaDB/PostgreSQL for smaller amounts of data.

Then again, if I could make monitoring and observability someone else's problem, I'd prosper more, so it depends on your circumstances.

[+] latifk|3 years ago|reply

What I like about ETL tools like Dagster and Prefect is that you get observability “for free”. You can set the granularity by deciding what is a task/job/flow/op and how they’re grouped together. And then in one UI you get logs, metrics, a waterfall view with timed executions, all kinds of useful information.

It’s so useful that sometimes I’m tempted to reach for it in non-ETL contexts. My problem is that these tools generally don’t mesh well with real-time streaming requirements.

[+] harryvederci|3 years ago|reply

Adding telemetry plus yet another service to fix the complexity of having a lot of services sounds a bit like de-escalating a conflict by slapping someone in the face.

It always depends on the specific use case of course, but maybe it's also worth investigating if reducing the overall complexity of the system + amount of microservices could be a solution.

[+] unknown|3 years ago|reply

[deleted]

[+] geodel|3 years ago|reply

Well, you are gonna shutdown this budding cottage industry of Observability Consultants :)

[+] P5fRxh5kUvp2th|3 years ago|reply

heh, seriously.

53 comments