top | item 30854734

Grafana Mimir – Horizontally scalable long-term storage for Prometheus

262 points| devsecopsify | 3 years ago |grafana.com

119 comments

order
[+] nosequel|3 years ago|reply
Grafana Labs needs to make a convincing comparison chart of some kind between Mimir, Thanos, and Cortex. Thanos and Cortex are both mature projects and are both CNCF Incubating projects. Why would anyone switch to a new prometheus long-term storage solution from those?

*EDIT*: I see from another reply there is a basic comparison to Cortex here: https://grafana.com/blog/2022/03/30/announcing-grafana-mimir... To the Mimir folks, I'd love to see something similar Mimir v. Thanos.

[+] sciurus|3 years ago|reply
It looks like this is a fork of Cortex driven by the maintainers employed by Grafana Labs, done so they can change the license to one that will prevent cloud providers like Amazon from offering it without contributing changes back.

This is interesting, since Amazon offers both hosted Grafana and Cortex today. I was under the impression Amazon and Grafana Labs were successfully collaborating (unlike e.g. AWS and Elastic), but seems like that's not the case.

[+] fishpen0|3 years ago|reply
> Cortex is used by some of the world’s largest cloud providers and ISVs, who are able to offer Cortex at a lower cost because they do not invest the same amount in developing the project.

> ...

> All CNCF projects must be Apache 2.0-licensed. This restriction also prevents us from contributing our improvements back to Cortex.

I read this as "Amazon has destroyed the CNCF by not playing nice"

[+] smw|3 years ago|reply
Seems like people should throw VictoriaMetrics into comparisons like this, as well?
[+] mekster|3 years ago|reply
You're forgetting VictoriaMetrics that's presumably the best choice for Prometheus long term storage.

Such a solid solution exists and yet another competitor? Not sure why they didn't just buy VictoriaMetrics and possibly rebrand it.

[+] mr-karan|3 years ago|reply
Folks looking for a solution to storing Prometheus metrics from multiple places, definitely consider exploring Victoriametrics.

I'm running a single Victoriametrics instance which has 230bn metrics, consuming ~4GB of memory and barely 200m of CPU utilization (only spikes to ~1.5cores when it flushes these datapoints from RAM to disk). I've previously[1] shared my experience of setting up Victoriametrics for long term Prometheus storage back in 2020 and since then this product has just kept getting better.

Over time, I switched to `vmagent` and `vmalert` as well which offer some nice little things (like did you know, you can't break up the scrape config of Prometheus into multiple files? `vmagent` does that happily). The whole setup is very easy to manage for an Ops person (as compared to Thanos/Cortex. Yet to checkout Mimir though!) as well. I've barely had to tweak any default configs that come in Victoriametrics and I even increased the retention of metrics from a month to multiple months after gaining confidence in prod.

[1]: https://zerodha.tech/blog/infra-monitoring-at-zerodha/

[+] halfmatthalfcat|3 years ago|reply
How does this stack up with https://github.com/thanos-io/thanos, which I've used to pretty good success.

The only criticism I have of Thanos though was the amount of moving pieces to maintain.

[+] netingle|3 years ago|reply
(Tom here; I started the Cortex project on which Mimir is based and lead the team behind Mimir)

Thanos is an awesome piece of software, and the Thanos team have done a great job building an vibrant community. I'm a big fan - so much so we used Thanos' storage in Cortex.

Mimir builds on this and makes it even more scalable and performance (with a sharded compactor and query engine). Mimir is multitenant from day 1, whereas this is a relatively new thing in Thanos I believe. Mimir has a slightly different deployment model to Thanos, but honestly even this is converging.

Generally: choosing Thanos is always going to be a good choice, but IMO choosing Mimir is an even better one :-p

[+] witcher|3 years ago|reply
(Bartek here: I co-started Thanos and maintain it with other companies)

Thanks for this - it's a good feedback. It's funny you mentioned that, because we actively try to reduce the number of running pieces e.g while we design our query sharding (parallelization) and pushdown features.

As Cortex/Mimir shows it's hard - if you want to scale out every tiny functionality of your system you end up with twenty different microservices. But it's an interesting challenge to have - eventually it comes to trade-offs we try to make in Thanos between simplicity, reliability and cost vs ultra max performance (Mimir/Cortex).

[+] pracucci|3 years ago|reply
Mimir has a microservices architecture. However, Mimir supports two deployment modes: monolithic and microservices.

In monolithic mode you deploy Mimir as a single process and all microservices (Mimir components) run inside the same process. Then you scale it out running more replicas. Deployment modes are documented here: https://grafana.com/docs/mimir/latest/operators-guide/archit...

[+] eatonphil|3 years ago|reply
There isn't a link to the project on the page (that I could find) so it almost looked like it's not open source. But here it is: https://github.com/grafana/mimir.
[+] notamy|3 years ago|reply
You have to find the "Download" button and click it, it's very non-obvious :< The entire page seems to be designed to funnel you into signing up for their paid service, which makes sense, but still doesn't feel great...
[+] MindTooth|3 years ago|reply
How does this compare to https://www.timescale.com/promscale

I’m looking into choosing a backend for my metrics and always open for suggestions.

[+] vineeth0297|3 years ago|reply
Hey!

Promscale PM here :)

Promscale is the open source observability backend for metrics and traces powered by SQL.

Whereas Mimir/Cortex is designed only for metrics.

Key differences:

1. Promscale is light in architecture as all you need is Promscale connector + TimescaleDB to store and analyse metrics, traces where as Cortex comes with highly scalable micro-services architecture this requires deploying 10's of services like ingestor, distributor, querier, etc.

2. Promscale offers storage for metrics, traces and logs (in future). One system for all observability data. whereas the Mimir/Cortex is purpose built for metrics.

3. Promscale supports querying the metrics using PromQL, SQL and traces using Jaeger query and SQL. whereas in Cortex/Mimir all you can use is PromQL for metrics querying.

4. The Observability data in Cortex/Mimir is stored in object store like S3, GCS whereas in Promscale the data is stored in relational database i.e. TimescaleDB. This means that Promscale can support more complex analytics via SQL but Cortex is better for horizontal scalability at really large scales.

5. Promscale offers per metric retention, whereas Cortex/Mimir offers a global retention policy across the metrics.

I hope this answers your question!

[+] SuperQue|3 years ago|reply
One interesting question I have is regards to global availability.

With our current Thanos deployment, we can tie a single geo regional deployment together with a tiered query engine.

Basically like this:

"Global Query Layer" -> "Zone Cluster Query Layer" -> "Prom Sidecar / Thanos Store"

We can duplicate the "Global Query Layer" in multiple geo regions with their own replicated Grafana instances. If a single region/zone has trouble we can still access metrics in other regions/zones. This avoids Thanos having any SPoFs for large multi-user(Dev/SRE) orgs.

[+] ddreier|3 years ago|reply
This is one of my favorite things about Thanos. We run Prometheus in multiple private datacenters, multiple AWS regions across multiple AWS accounts, and multiple Azure regions across multiple subscriptions. We have three global labels: cloud, region, and environment. With Thanos's Store/Querier architecture we have a single Datasource in Grafana where we can quickly query any metric from any environment across the breadth of our infrastructure.

It's really a shame that Loki in particular doesn't share this kind of architecture. Seems like Mimir, frustratingly, will share this deficiency.

[+] bboreham|3 years ago|reply
The typical way to run Mimir is centralised, with different regions/datacenters feeding metrics in to one place. You can run that central system across multiple AZs.

If you run Mimir with an object store (e.g. S3) that supports replication then you can have copies in multiple geographies and query them, but the copies will not have the most recent data.

(Note I work on Mimir)

[+] dikei|3 years ago|reply
Sad news for Cortex, with most of the maintainer moving on to Mimir, I fear it's pretty much dead in the water.
[+] Thaxll|3 years ago|reply
So many solutions to the same problem, how does it compare to Victoria Metrics?
[+] hagen1778|3 years ago|reply
VictoriaMetrics co-founder here.

There are many similar features between Mimir and VictoriaMetrics: multi-tenancy, horizontal and vertical scalability, high availability. Features like Graphite and Influx protocols ingestion, Graphite query engine are already supported by VictoriaMetrics. I didn't find references to downsampling in Mimir's docs, but I believe it supports it too.

There are architectural differences. For example, Mimir stores last 2h of data in local filesystem (and mmaps it, I assume) and once in 2h uploads it to the object storage (long-term storage). VictoriaMetrics doesn't support object storage and prefers to use local filesystem for the sake of query speed performance. Both VictoriaMetrics and Mimir can be used as a single binary (Monolithic mode in Mimir's docs) and in cluster mode (Microservices mode in Mimir's docs). The set of cluster components (microservices) is different, though.

It is hard to say something about ingestion and query performance or resource usage so far. While benchmarks from the project owners can be 100% objective, I hope community will perform unbiased tests soon.

[+] outsb|3 years ago|reply
Given Victoria Metrics is the only solution I've seen to make data comparing it to other systems easily accessible as part of official documentation, it's the only one I pay attention to.

I knew from reading the docs what VM excelled at and areas it was weak in, long before I ever ran it (and expectations from running it matched the documentation). I hate aspirational marketing-saturated campaigns for deep tech projects where standards should obviously be higher, it speaks more about intended audience than it does the solution, and that's why in this respect VM is automatically a cut above the rest.

[+] bbu|3 years ago|reply
i don't get why there's so much hate here.

cortex is a pain to configure and maintain. would be awesome to have mimir address these issue!

[+] jaigupta|3 years ago|reply
This is about Prometheus but Mimir makes it interesting. I can't find any other open source time series database except Mimir/Cortex which allows this much scale (clustering options in their open source version). Our use case will have high cardinality and Mimir seems to fit very well.

Can we use Prometheus/Mimir as general purpose time series database? Prometheus is built for monitoring purposes and may not be for general purpose time series databases like InfluxDB (I am hoping to be wrong). What are the disadvantages/limitations for using Prometheus/Mimir as general purpose time series database?

[+] valyala|3 years ago|reply
> I can't find any other open source time series database except Mimir/Cortex which allows this much scale (clustering options in their open source version)

The following open source time series databases also can scale horizontally to many nodes:

- Thanos - https://github.com/thanos-io/thanos/

- M3 - https://github.com/m3db/m3

- Cluster version of VictoriaMetrics - https://docs.victoriametrics.com/Cluster-VictoriaMetrics.htm... (I'm CTO at VictoriaMetrics)

> Can we use Prometheus/Mimir as general purpose time series database?

This depends on what do you mean under "general purpose time series database". Prometheus/Mimir are optimized for storing (timestamp, value) series where timestamp is a unix timestamp in milliseconds and value is a floating-point number. Each series has a name and can have arbitrary set of additional (label=value) labels. Prometheus/Mimir aren't optimized for storing and processing series of other value types such as strings (aka logs) and complex datastructures (aka events and traces).

So, if you need storing time series with floating-point values, then Prometheus/Mimir may be a good fit. Otherwise take a look at ClickHouse [1] - it can efficiently store and process time series with values of arbitrary types.

[1] https://clickhouse.com/

[+] ankitnayan|3 years ago|reply
I would love to see some benchmarks when making such a heavy claim. I would be interested in knowing performance of ingestion rate, query timings and resource usage.
[+] eatonphil|3 years ago|reply
It's hard to tell exactly how this works but judging from the tutorial's docker-compose.yml [0] it looks like this runs as a separate API next to Prometheus and you tell Prometheus to write [1] to Mimir. I'm unclear how reads work from it or maybe there is no read?

Maybe I'm completely misunderstanding.

[0] https://github.com/grafana/mimir/blob/main/docs/sources/tuto...

[1] https://github.com/grafana/mimir/blob/main/docs/sources/tuto...

[+] pracucci|3 years ago|reply
Mimir exposes both remote write API and Prometheus compatible API. The typical setup is that you configure Prometheus (or Grafana Agent) to remote write to Mimir and then you configure Grafana (or your preferred query tool) to query metrics from Mimir.

You may also be interested into looking at a 5 minutes introduction video, where I cover the overall architecture too: https://www.youtube.com/watch?v=ej9y3KILV8g

[+] bboreham|3 years ago|reply
It’s a centralised multi-tenant store, supporting the Prometheus query API. So you can point clients directly at Mimir, they send in PromQL and they get data back in Json.

(Note I work on Mimir)

[+] young_unixer|3 years ago|reply
Coincidentally, "mimir" is a funny, baby-like way of saying "dormir" (to sleep) in Spanish.
[+] estebarb|3 years ago|reply
Technical meetings are going to be fun with hispanic devs...

"And finally we sent the metrics to Mimir /giggles/"

Sadly they don't support encryption at rest (sorry, I really had to do one more pun)

[+] vladsanchez|3 years ago|reply
So true!!! LOL I related to "Vamos a mimir!" when I read it!!! ROFL
[+] sriv1211|3 years ago|reply
What's the latency between sending a metric and being able to query it when using object storage (s3) instead of block storage?

How do the transfer/retrieval (GET/PUT) costs factor in as well?

[+] pracucci|3 years ago|reply
Good question! Grafana Mimir guarantees read-after-write. If a write request succeed, the metric samples you've written are guaranteed to be queried by any subsequent query.

Mimir employes write deamplification: it doesn't write immediately to the object storage but keeps most recently written data in-memory and/or local disk.

Mimir also employes several shared caches (supports Memcached) to reduce object storage (S3) access as much as possible.

You can learn more here in the Mimir architecture documentation: https://grafana.com/docs/mimir/latest/operators-guide/archit...

[+] camel_gopher|3 years ago|reply
"the most scalable open source TSDB in the world"

You can be scalable, and still cost a lot of money to scale out. Unit economics are important.

[+] firstSpeaker|3 years ago|reply
How does it work with Rules? So far I cannot see if this can be a replacement for prometheus since I cannot see how can we re-use our prometheus rules with Mimir. Anyone knows anything around that?