top | item 23360794

A simple way to get more value from metrics

280 points| waffle_ss | 5 years ago |danluu.com | reply

68 comments

order
[+] bitcharmer|5 years ago|reply
You'd be surprised how many serious tech shops have close to zero performance metrics collected and utilised.

I've done this in fintech a few times already and the best stack that worked from my experience was telegraf + influxdb + grafana. There are many things you can get wrong with metrics collection (what you collect, how, units, how the data is stored, aggregated and eventually presented) and I learned about most of them the hard way.

However, when done right and covering all layers of your software execution stack this can be a game changer, both in terms of capacity planing/picking low hanging perf fruit and day to day operations.

Highly recommend giving it a try as the tools are free, mature and cover a wide spectrum of platforms and services.

[+] chucky_z|5 years ago|reply
I'd like to followup on this and say InfluxDB is fantastic -- with the caveat that you know exactly what you want, and use it with that in mind. It's great at storing some metrics with high volume, but I've found the OSS version falls over hard in some circumstances (high cardinality in tags absolutely murders performance, making it nearly unusable).

I'm in the middle of doing a test-run of vmagent + victoria-metrics as a mostly-replacement (see: not replacing the cases where it's known exactly what's wanted, and used explicitly for that), and victoria-metrics fully support telegraf pushes which is a big bonus.

Also they're apparently going to be dropping InfluxQL entirely, for Flux, which is weird to me? The docs don't outright state this, but InfluxDB 2.0 has only Flux examples in the docs I can find. It's a better language, however the learning curve is not trivial.

[+] benraskin92|5 years ago|reply
Might also want to give Prometheus a try as it's extremely simple to setup, and it is very well supported in the open source community.
[+] chrisweekly|5 years ago|reply
Agreed; same exp (w many years of web perf optimization as ui arch and/or perf strategy consulting gigs. Even after all these years, simply using WebPageTest to capture test runs and share the results is often a transformational experience.
[+] duckmysick|5 years ago|reply
> There are many things you can get wrong with metrics collection

Can you share the important things that are overlooked?

[+] jamessun|5 years ago|reply
"I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing."
[+] Simulacra|5 years ago|reply
The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka!” (I found it!) but “That’s funny …” — Isaac Asimov
[+] m463|5 years ago|reply
But that's not as fun as doing a scheme to rust cross-compiler on kubernetes
[+] resu_nimda|5 years ago|reply
Starting the article off with "I did this in one day" - complete with a massive footnote disclaiming that it obviously took a lot more than one day - kinda ruined it for me. Why even bother with that totally unnecessary claim?
[+] eshyong|5 years ago|reply
My read on it is the author is saying that seemingly small changes can have big impacts. I agree it could have been worded better, though I doubt he's trying to promote himself as a genius (as other people are saying) because he clearly highlights the effort his team put into the project in the footnote.
[+] brmgb|5 years ago|reply
I was really off-put by it too.

"I did it by myself in one day, well actually it was one week but had I known the stack I would likely have done it in one day. Oh, and by the way, after that week, there was yet another month of work involving at least two other persons from my team and then even more work from other teams. But let's not dwell on boring details".

It's nearly as infuriating as the "Appendix: stuff I screwed up" which doesn't contain actual screw up. It's a shame because the rest of the writing is interesting and doesn't need to be propped up.

[+] caiobegotti|5 years ago|reply
It's kind of a personal marketing thing these days to have this maverick/hero aura of genius instead of the "unproductive" but real and hard grinding work to get something done and delivered. It worked for a few so thousands try the same and we are here now, I guess?
[+] derivativethrow|5 years ago|reply
That strikes me as a short tolerance for feeling something is ruined. He appropriately highlighted the real time estimate of the more involved work in a footnote. He didn't literally mean all of the work was one day, he's trying to convey a larger point about outsized engineering returns from comparatively small person-hours of work.

Were you able to move past this to read the rest of the article? Because it's a very good article.

[+] waheoo|5 years ago|reply
The writing style is dense. I suspect a voice fresh out of academia.

The post about salary reads much better so might just be an experience thing.

https://youtu.be/vtIzMaLkCaM

[+] roskilli|5 years ago|reply
There's a lot of interest in this space with respect to analytics on top of monitoring and observability data.

Anyone interested in this topic might want to check out an issue thread on the Thanos GitHub project. I would love to see M3, Thanos, Cortex and other Prometheus long term storage solutions all be able to benefit from a project in this space that could dynamically pull back data from any form of Prometheus long term storage using the Prometheus Remote Read protocol: https://github.com/thanos-io/thanos/issues/2682

Spark and Presto both support predicate push down to a data layer, which can be a Prometheus long term metrics store, and are able to perform queries on arbitrary sets of data.

Spark is also super useful for ETLing data into a warehouse (such as HDFS or other backends, i.e. see the BigQuery connector for Spark[1] that could write a query from say a Prometheus long term store metrics and export it into BigQuery for further querying).

[1]: https://cloud.google.com/dataproc/docs/tutorials/bigquery-co...

[+] tixocloud|5 years ago|reply
Thanks for sharing. It’s interesting to see the space gaining steam. What sort of things are people looking at?
[+] gigatexal|5 years ago|reply
This is a really awesome blog. The post about programmer salaries is insightful: https://danluu.com/bimodal-compensation/
[+] julianeon|5 years ago|reply
I was thinking the answer to 'why are programmers paid more than other white-collar professions that are similarly profitable' is: because programmers control the means of production.

I might be a great telecomm tech, a genius even, but once I'm out of a job, I can't build my own telecomm system - that would cost billions. I have to go back to some other telecomm system to start making money again.

But, at a startup, a kicked-out senior engineer can actually pretty much exactly recreate the company; they can do the equivalent of a laid-off telecomm employee starting a new, almost-as-good (except for branding) telecomm company.

No billions in infrastructure required: within a month or two, the cloned company could be near-indistinguishable from the original.

So companies have to pay employees more like partners, instead of employees, because either they pay them as equals or they'll be forced to compete against them, as equal rivals.

[+] sprt|5 years ago|reply
Interesting, wonder if he's looked at the data from levels.fyi. Although it's surely not representative.
[+] renewiltord|5 years ago|reply
Just so I understand, the simple way the headline talks about was "collect all metrics, but store the anal fraction you care about in an easily accessible place; delete the raw data every week"?

Title didn't live up to article imho. But I get it. Thanks for sharing your methods.

[+] simonw|5 years ago|reply
Love the section in this about using "boring technology" - and then writing about how you used it, to help counter the much more common narrative of using something exciting and new.
[+] heliodor|5 years ago|reply
If we consider Graphite, InfluxDB, and Prometheus, at this point in the monitoring industry's evolution, we have the capability to easily criss-cross metrics generated in the format of one of these systems to store them in one of the other ones.

The missing piece remains to be able to query one system with the query language of the others. For example, query Prometheus using Graphite's query language.

[+] sa46|5 years ago|reply
Speaking of high cardinality metrics, what are good options that aren’t as custom as map reduce jobs and a bit more real time?

We killed our influx cluster numerous times with high cardinality metrics. We migrated to Datadog which charges based on cardinality so we actively avoid useful tags that have too much cardinality. I’m investigating Timescale since our data isn’t that big and btrees are unaffected by cardinality.

[+] staysaasy|5 years ago|reply
The boring technology observation (here referring to the challenge of getting publicity for "solving a 'boring' problem") is really true.

It extends very well to something that we constantly hammer home on my team: using boring tools is often best because it's easier to manage around known problems than forge into the unknown, especially for use-cases that don't have to do with your core business. Extreme & contrived example: it's much better to build your web backend in PHP over Rust because you're standing on the shoulders of decades of prior work, although people will definitely make fun of you at your next webdev meetup.

(Functionality that is core to your business is where you should differentiate and push the boundaries on interesting technology e.g. Search for Google, streaming infrastructure for Netflix. All bets are off here and this is where to reinvent the wheel if you must – this is where you earn your paycheck!)

[+] mv4|5 years ago|reply
Thank you for sharing this. I recently started working on metrics at a FAANG and saw some of the challenges you mentioned... the fact that you were able to get good results so quickly is super inspiring!
[+] tixocloud|5 years ago|reply
Interesting. I thought this would already be a solved problem at FAANG.
[+] chrchang523|5 years ago|reply
Minor nit: long -> double -> long cannot introduce more rounding error than long -> double, if the same long type is at both ends.
[+] dirtydroog|5 years ago|reply
What's the standard for metrics gathering, push or pull? I prefer pull, but depending on the app it can mean you need to build in a micro HTTP server so there's something to query. That can be a PITA, but pushing a stat on every event seems wasteful, especially if there's a hot path in the code.
[+] bbrazil|5 years ago|reply
I don't think there's any clear standard. There's many confusions about push vs pull that make the discussions hard to follow, as they often make apples to oranges comparisons. For example the push you're talking about in your comment is events, whereas a fair comparison for Prometheus would be with pushing metrics to Graphite. https://www.robustperception.io/which-kind-of-push-events-or... covers this in more detail.

Taking your example you could push without sending a packet on every event by instead accumulating a counter in memory, and pushing out the current total every N seconds to your preferred push-based monitoring system. You could even do this on top of a Prometheus client library, some of the official ones even as a demo allow pushing to Graphite with just two lines of code: https://github.com/prometheus/client_python#graphite

In my personal opinion, pull is overall better than push but only very slightly. Each have their own problems you'll hit as you scale, but those problems can be engineered around in both cases.

Disclaimer: Prometheus developer

[+] halbritt|5 years ago|reply
The hot new technology for metrics is Prometheus and it's ilk which is pull-based.
[+] chris_f|5 years ago|reply
There have been a lot of articles posted recently about the 'old' web, and while I like the concept I still have a hard time finding quality information in many of the directories and webrings posted. The level of research and density of information in this blog is very good.
[+] neoplatonian|5 years ago|reply
This is a great post! We should have more of these out there. Does anyone have any recommendations for similar posts for Node.js (instead of JVM)?

Or any good resource which discusses possible optimizations in the infra stack at a more theoretical, abstract, generalizable level?

[+] dmos62|5 years ago|reply
I'm not sure I understood the solution there. Storing only 0.1%-0.01% of the interesting metrics data makes sense in the same way that you'd take a poll of a small fraction of the population to make guesses about the whole?
[+] yellowstuff|5 years ago|reply
I believe he means they are storing all the data, but for a subset of types of data, sorta like extracting just a few columns out of a big table. Presumably someone on some other team gets use out of having access to the 99.9% of data stored that is not relevant to "performance or capacity related queries."
[+] wwarner|5 years ago|reply
Would be a very natural AWS dashboard.
[+] Aperocky|5 years ago|reply
Cloudwatch is pretty awesome.
[+] m0zg|5 years ago|reply
Funny how "groundbreaking" stuff like this is outside e.g. Google, where you could collect and query metrics in realtime for more than a decade now.
[+] dandare|5 years ago|reply
> since i like boring, descriptive, names..

I feel like I have an inception. Should "boring, descriptive, names" be the default in all IT?

[+] ertian|5 years ago|reply
The problem with that is that you end up with tons of confusing name collisions.