top | item 23906165

Monitoring your own infrastructure using Grafana, InfluxDB, and CollectD

385 points| serhack_ | 5 years ago |serhack.me | reply

292 comments

order
[+] avthar|5 years ago|reply
Echoing the sentiment expressed by others here, for a scalable time-series database that continues to invest in its community and plays well with others, please check out TimescaleDB.

We (I work at TimescaleDB) recently announced that multi-node TimescaleDB will be available for free, specifically as a way to keep investing in our community: https://blog.timescale.com/blog/multi-node-petabyte-scale-ti...

Today TimescaleDB outperforms InfluxDB across almost all dimensions (credit goes to our database team!), especially for high-cardinality workloads: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

TimescaleDB also works with Grafana, Prometheus, Telegraf, Kafka, Apache Spark, Tableau, Django, Rails, anything that speaks SQL...

[+] linsomniac|5 years ago|reply
I've been doing monitoring of our ~120ish machines for 3-4 years now using Influx+Telegraf+Grafana, and have been really happy with it. Prior to that we were using collectd+graphite and with 1 minute stats it was adding some double-digits %age utilization on our infrastructure (I don't remember exactly how much, but I want to say 30% CPU+disk).

Influxdb has been a real workhorse. We suffered through some of their early issues, but since then it's been extremely solid. It just runs, is space efficient, and very robust.

We almost went with Prometheus instead of Influx (as I said, early growing pains), but I had just struggled through managing a central inventory and hated it, so I really wanted a push rather than pull architecture. But from my limited playing with it, Prometheus seemed solid.

[+] carlosdp|5 years ago|reply
Gotta say though, having rolled grafana and prometheus and such on my own plenty of times before, if you are a startup and can afford Datadog, use Datadog.
[+] dijit|5 years ago|reply
I just costed a datadog deployment (based on your comment) and it would cost me my yearly salary every month.

No thanks. :/

[+] Axsuul|5 years ago|reply
As a counterpoint: I have been rolling my own Grafana/Prometheus for many years for my startup and it's been pretty trivial
[+] ciguy|5 years ago|reply
This is good advice, however you will also want to make sure you have a plan to get off datadog when you grow. Datadog is one of the easiest to use and most comprehensive out of the box. But it gets really expensive as you begin to scale up and add servers, cloud accounts, services etc....

At a certain scale, rolling your own monitoring and alerting becomes cost effective again as Datadog begins to charge an arm and a leg. I've seen Datadog bills that could easily pay for 2 full time engineers.

[+] jakozaur|5 years ago|reply
DataDog per host pricing can be very expensive. Metrics are provided by many platforms. If you need logs too, you may look at Sumo Logic which got way cheaper metrics in typical use case.

Disclaimer: I work at Sumo Logic.

[+] scaryclam|5 years ago|reply
This is not my experience at all. We've got it setup and monitoring all sorts of things. After the initial setup, the only reason we've needed to touch it is when we've introduced new things and wanted to update the config. In fact, data dog was far more of a pain, and far less useful than prometheus has been.
[+] Hates_|5 years ago|reply
I'd be interested to know a bit of detail as we're looking into Grafana/Promethus a bit.
[+] pachico|5 years ago|reply
I approached InfluxDB since it looked promising. It did actually served its purpose when it was simple and Telegraf was indeed handy. Now that I have more mature requirements I can't wait to move away from it. It gets frozen frequently, it's UI Chronograph is really rubbish, functions are very limited and managing continuous queries is tiresome.

I'm now having better results and experience storing data in ClickHouse (yes, not a timeseries dB).

From time to time I also follow what's coming in InfluxDB 2.0 but I must confess that 16 betas in 8 months are not very promising.

It might just be me.

[+] overcast|5 years ago|reply
Really, REALLY tried to love InfluxDB. But its systems requirements, performance, and features are poor compared to things like TimescaleDB.
[+] candiddevmike|5 years ago|reply
Influx is really shitting the bed with how they're handling the InfluxDB 2.0 release. The docs are a mess, and the migration tool seems like the result of a weekend hackathon. They're leaving a lot of customers with long term metrics in a tough spot.

If you're thinking about using Influx for long term data storage, look elsewhere. The company continuously burns customer goodwill by going against the grain, and bucking the ecosystem by trying to control the entire stack.

[+] goliatone|5 years ago|reply
Im on the same boat, I tried very hard. Went through a few version upgrades with data incompatibility and other issues. Used it for personal projects as well as for different projects at work. I truly regret ever recommending it to different teams at work.

I ended up moving away from it to TimescaleDB and I’m pretty happy so far.

[+] majkinetor|5 years ago|reply
I use InfluxDb on number of big gov production services, and never had a single issue with it. I use it for all metrics (applicative and infrastructure).

Its easy to deploy and its x-platform. Not sure what are all the comments here about - maybe for really huge loads which I don't have experience at, I usually use separate influxdb per service.

[+] mbell|5 years ago|reply
We had major issues with scaling InfluxDB. We use clickhouse (graphite table engine) now and it is more than order of magnitude more resource efficient.
[+] julioo|5 years ago|reply
What about storage? We are running influxdb and we are looking for alternative. But a point where Influx is good is storage.
[+] RedShift1|5 years ago|reply
InfluxDB is pretty good if you don't need to do any advanced querying like grouping by month or formulas. Its strong points are low diskspace footprint and very fast queries even over long periods of time.
[+] svd4anything|5 years ago|reply
Is that really true? Do you have any reference for that, I’m curious how different it is. FWIW I use InfluxDB and haven’t really thought it was a huge resource hog.
[+] viraptor|5 years ago|reply
Does any of those matter for small-scale monitoring (say <= 10 hosts)? I've got influx sitting pretty much idle in that kind of environments and any data preprocessing / collation in grafana works just fine.
[+] jakobdabo|5 years ago|reply
I still can't find any alternative to the old, RRD-based Munin. It is so simple. You want to add a new server to monitor? Just install there the node part, enable any required additional plugins (just by creating a couple of soft-links), add one-line configuration to the main server with the new node's IP address, and you are done.

Also, the aesthetics of the UX, you see all the graphs in one single page[1], no additional clicks are required - a quick glance with a slow scroll and you can see if there were any unusual things during the last day/week.

[1] - publicly available example, found by googling - https://ansible.fr/munin/agate/agate/index.html

[+] Papric0re|5 years ago|reply
[Offtopic (a bit)] Lots of you are talking about metric monitoring. But do you have recommendations when it comes to (basic) security Monitoring? I would usually go for the Elastic-Stack for that purpose, especially because Kibana offers lots of features for security monitoring. But I feel like these stacks are so big and bloated. I basically need something to monitor network traffic (Flows and off-Database retention of PCAPs) and save some security logs (I'm not intending on alerting based on logs, just for retention). But being able to have a network overview, insight into current connections (including history) is a very useful thing. Can anybody recommend something, that's maybe a bit lighter than an entire Elastic-Stack?
[+] hnarn|5 years ago|reply
Some people are probably going to throw some shade on me for saying this since it's so out of fashion but in my mind, when it comes to some types of basic monitoring (SNMP monitoring of switches/linux servers, disk space usage, backups running and handling them when they don't) then Nagios does get the job done. It's definitely olives and not candy[1] but it's stable, modular, relatively easy to configure (when you get it) and it just keeps chugging away.

If anyone has any Nagios questions, I'd be happy to answer them. I'm a Nagios masochist.

(Also, I can recommend Nagflux[2] to export performance data, metrics, to InfluxDB because noone should have to touch RRD files)

[1]: https://lambdaisland.com/blog/2019-08-07-advice-to-younger-s...

[2]: https://github.com/Griesbacher/nagflux

[+] ddevault|5 years ago|reply
I found the software in this stack to be very bloated and difficult to maintain. Large, complicated software has a tendency to fall flat on its face when something goes wrong, and this is a domain where reliability is paramount. I wrote about a different approach (Prometheus + Alertmanager) for sourcehut, if you're curious:

https://sourcehut.org/blog/2020-07-03-how-we-monitor-our-ser...

[+] aequitas|5 years ago|reply
For those who still remember Graphite, the team over at Grafana labs have started maintaining Graphite-web and Carbon since 2017 and it is still in active development getting improvements and feature updates. It might not scale as well as any of the other solutions, but for medium size or homelab setups it's still a nice solution if you don't like PromQL or InfluxQL.

https://grafana.com/oss/graphite/

[+] x87678r|5 years ago|reply
I love prometheus. Its simple and the built in charts are enough without having to use Grafana on top.
[+] halfmatthalfcat|5 years ago|reply
The question is how to do long term storage though. Something I've had a bit of trouble rationing about.

Right now all of my metrics are sitting in a PVC with a 30d retention period, so we're probably fine but for longer term cold storage the options aren't great unless you want to run a custom Postgres instance with the Timescale plugin or something else more managed.

[+] speedgoose|5 years ago|reply
I don't. It's quite difficult to do complex queries and its query language has a few gotcha. I think it's quite good and one of the best solutions today, but I look forward to something as simple and as fast, but with a proper query language.
[+] roland35|5 years ago|reply
InfluxDB and Grafana worked great for us when I created a live monitoring system for a fleet of prototype test robots. It was simple to set up new data streams. We started with Graphite but switched to InfluxDB for it's flexibility (Grafana works with both!)

I would add to the guide that you need to be careful about formatting the lines into InfluxDB because where you out the space and commas determines what is indexed or not! Also data types should be specific (ie make sure you are setting integer vs float correctly).

[+] majkinetor|5 years ago|reply
You can quickly do this on Windows:

    cinst influxdb1 /Service
    cinst grafana
    start $Env:ChocolateyInstall\lib\grafana\tools\grafana-*\bin\grafana-server.exe
    git clone https://github.com/majkinetor/psinflux
    import-module ./psinflux
    1..10 | % { $x = 10*$_ + (Get-Random 10); Send-Data "test1 value=$x"; sleep 1 }
[+] rattray|5 years ago|reply
I've only ever used third party monitoring tools, but hope to set up a startup again soon and want to do OSS if I can.

Can anyone comment on Prometheus vs Timescale? What are the tradeoffs? Or would I use Prometheus on top of Timescale?

[+] k-rus|5 years ago|reply
You can use Prometheus on top of TimescaleDB. Timescale builds connector and entire workflow to run Prometheus on top of TimescaleDB and support Grafana in flexible way. Sorry for the promo :) check for details in https://github.com/timescale/timescale-prometheus
[+] valyala|5 years ago|reply
Prometheus is a monitoring system (it collects metrics and evaluates alerts on them), while TimescaleDB is a database. While it is possible to use TimescaleDB as a remote storage for Prometheus, I'd recommend taking a look at other remote storage integrations as well [1]. Some of them natively support PromQL (like VictoriaMetrics, m3db, Cortex), others may be easier to setup and operate. For instance, VictoriaMetrics works out of the box without complex configuration. Disclaimer: I work for VictoriaMetrics :).

[1] https://prometheus.io/docs/operating/integrations/#remote-en...

[+] ecoqba11|5 years ago|reply
We switched from InfluxDB to TimescaleDB for our IoT solutions. InfluxDB is very difficult to work with large datasets and enterprise/region compliance. We ingest around 100MB data per day and growing.
[+] mhall119|5 years ago|reply
That doesn't actually sound like a large dataset. Can you describe what kind of problems you faced with InfluxDB?
[+] second--shift|5 years ago|reply
I've looked at these before, and I remember a few years ago when Grafana was really starting to get big, but I guess I have a bona-fide question: Who really needs this?

I manage a small homelab infra, but also an enterprise infra at work with >1,000 endpoints to monitor, and I/we use simple shell scripts, text files, and rsync/ssh. We monitor cpu load, network load, disk/io load, all the good stuff basically. The monitor server is just a DO droplet and our collectors require zero overhead.

The specs list and setup costs in time and complexity are steep with a Grafana stack - is there any value besides just the visual? I know they have the ability to do all manner of custom plugins, dashboards, etc, but if you just care about the good stuff (uptime+performance), what does Grafana give you that rsync'ing sar data can't?

PS: we have a graphical parser of the data written using python and matplotlib. very lightweight, and we also get pretty graphs to print and give to upstairs.

[+] viraptor|5 years ago|reply
What sar+rsync doesn't provide:

- app-specific metrics

- quick and easy way to build a number of graphs searching for correlation (how to slice the data to get results that explain issues)

- log/metrics correlation

- unified way to build alerts

- ad-hoc changes - while you're in the middle of an incident and want to get information that's just slightly different from existing, or filter out some values, or overlay a trend - how long would it take in your custom solution vs grafana?

And finally - grafana exists. Why would I write a custom graph generator from a custom data store if I can setup a collector + influx + grafana in a fraction of that time and get back more?

[+] gjulianm|5 years ago|reply
I'm not experienced with the CollectD stack, but I use Prometheus + Grafana to monitor probes. My two cents:

- Fairly lightweight. Prometheus deals with quite a lot of series without much memory or CPU usage.

- Integration with a lot of applications. Prometheus lets me monitor not only the system, but other applications such as Elastic, Nginx, PostgreSQL, network drivers... Sometimes I need an extra exporter, but they tend to be very light on resources. Also, with mtail (which is again super lightweight) I can convert logs to metrics with simple regexes.

- Number of metrics. For instance, several times I needed to diagnose an outage and I need a metric that I didn't think about, and turns out that the exporter I was using did actually store it, it was just that I didn't include it in the dashboard. As an example, the default node exporter has very detailed I/O metrics, systemd collectors, network metrics... They're quite useful.

- Metric correctness. Prometheus appears to be at least decent at dealing with rate calculations and counter resets. Other monitoring systems are worse and it wasn't weird to find a 20000% CPU usage alert due to a counter reset.

- Alerts. Prometheus can generate alerts with quite a lot of flexibility, and the AlertManager is a pretty nice router for those alerts (e.g., I can receive all alerts in a separate mail, but critical alerts are also sent in a Slack channel).

- Community support. It seems the community is adopting the Prometheus format for exposing metrics, and there are packages for Python, Go and probably more languages. Also, the people who make the exporters tend to also make dashboards, so you almost always have a starting point that you can fine-tune later.

- Ease of setup. It's just YAML files, I have an Ansible role for automation but you can go with just installing one or two packages in clients and adding a line to a configuration file in the Prometheus master node.

- Ease of use. It's incredibly easy to make new graphs and dashboards with Prometheus and Grafana, no matter if they're simple or complex.

For me, the main points that make me use Prometheus (or any other monitoring config above simple scripts) is alerting and the amount of metrics. If you just need to monitor CPU load and simple stats, maybe Prometheus is too much, but it's not that hard to set up anyways.

[+] Nextgrid|5 years ago|reply
The biggest advantage is the near-real-time aspect of this. During an outage, having live metrics is essential. Does your custom system allow you to see live metrics as they happen, or do you need to re-run your aggregation scripts every 5 minutes?
[+] dewey|5 years ago|reply
At work and in my personal projects I use a Prometheus + Grafana setup and it's very easy to set up. Not sure what you mean with the complex and setup costs for a Grafana stack?

Alerting with the Prometheus AlertManager is also pretty straight forward and I'm looking at dashboards every day to see if everything is running smoothly or tracking down what's not working well if there are any issues. Grafana dashboards are always the second thing I look at after an alert fires somewhere and it has been invaluable.

[+] VectorLock|5 years ago|reply
It sounds like you home rolled a version of Grafana, CollectD et al. which probably took more time and effort than just installing Grafana CollectD et al.
[+] ggregoire|5 years ago|reply
I think you are overestimating the time and complexity to install and set up prometheus + grafana on a box, node exporter on your hosts and copy/paste a grafana dashboard for node exporter (which is your use case).

It gets complex only when you start monitoring your apps (i.e. using a prometheus client library to generate and export custom app metrics) and create custom grafana dashboards for these metrics. Or if you need to monitor some niche technology without its own existing prometheus exporter. Then yes, you need to read the docs, think about what you need to monitor and how, write code...

[+] lifeisstillgood|5 years ago|reply
I love this attitude (simplest thing that can possibly work) and am trying to write book about how to run a whole syseng department in this way.

In my view grafana shines in app data collection. And what we want a lot of is that (my simplest thing that can possibly is just have a web server accepting metrics ala carbon/graphite)

So one is likely to have grafana already lying around when one does the infra monitoring - I guess your choice is now how to leverage your set up to do app monitoring ?

[+] abhishekjha|5 years ago|reply
This is exactly what I have been planning to do for monitoring my two Raspberry pis. I am still debating on the metrics collector though. My workplace uses monitord for AINT and telegraf for wavefront. I have no idea how well does collectord works.
[+] zytek|5 years ago|reply
VictoriaMetrics eats other TSDBs for breakfast.

PromQL support (with extensions) and clustered / HA mode. Great storage efficiency. Plays well for monitoring multiple k8s clusters, works great with Grafana, pretty easily deployed on k8s.

No affiliation, just a happy user.

[+] mekster|5 years ago|reply
I just don't get why VictoriaMetrics doesn't get more visibility.

Maybe they need a PR person.