top | item 28902403

(no title)

proddata | 4 years ago

That article takes various concepts from typical TSDB solutions and seemingly only looks at the bad sides. Time series data has many different forms, not every form works for every TSDB solution.

For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.

discuss

order

akulkarni|4 years ago

(TimescaleDB co-founder)

Thanks for the mention, and I completely agree :-)

Personally, there is a lot in this article that is misguided.

For example, it essentially defines "time-series database" as "metric store." As TimescaleDB users know, TimescaleDB handles a lot more than just metrics. In fact, we handle any of the data types that Postgres can handle, which I suspect is more than what Honeycomb's custom store supports.

  TSDBs are good at what they do, but high cardinality is not built into the design. The wrong tag (or simply having too many tags) leads to a combinatorial explosion of storage requirements.
This is a broad generalization. Some time-series databases are better at high cardinality than others. Also, what is "high-cardinality" - 100K? 1M? 10M? (We in fact are designed for _higher cardinalities_ than most other time-series databases [0])

  In contrast, our distributed column store optimizes for storing raw, high-cardinality data from which you can derive contextual traces. This design is flexible and performant enough that we can support metrics and tracing using the same backend. The same cannot be said of time series databases, though, which are hyper-specialized for a specific type of data. 
We just launched tracing and metrics support in the same backend - in Promscale, built on TimescaleDB [1]

I do commend the folks at Honeycomb for having a good product loved by some of my colleagues (at other companies). I also commend them for attempting to write an article aimed to educate. But I wish they had done more research - because without it, this article (IMO) ends up confusing more than educating.

For anyone curious on our definition of "time-series data" and "time-series databases": https://blog.timescale.com/blog/what-the-heck-is-time-series...

[0] https://blog.timescale.com/blog/what-is-high-cardinality-how...

[1] https://blog.timescale.com/blog/what-are-traces-and-how-sql-...

ignoramous|4 years ago

How does timescale (a single-purpose database) hold up against single-store (a multi-purpose database)? Of course, timescale is cheaper, but other than that, have you folks compared / contrast against single-store as a TSDB?

PS https://www.timescale.com/papers/timescaledb.pdf is 404

eska|4 years ago

I had a serious case of deja vu reminding me of your article on compression in timescaledb :-D

oconnore|4 years ago

> Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution

This might be a bit off topic, but speaking of gaps in common observability tooling: is an OLAP database a common go-to for longer-timescale analytics (as in [1])? We're using BigQuery, but on ~600GB of log/event data I start hitting memory limits even with fairly small analytical windows.

In this context I have seen other references to: Sawzall (google), Lingo (google), MapReduce/Pig/Cascading/Scalding. Are people using Spark for this sort of thing now? Perhaps a combined workflow would be ideal: filter/group/extract interesting data in Hadoop/Spark, and then load into OLAP for ad-hoc querying?

[1]: https://danluu.com/metrics-analytics/

proddata|4 years ago

> is an OLAP database a common go-to for longer-timescale analytics (as in [1])?

I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.

We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.

jpgvm|4 years ago

For longer-scale timeseries I still recommend Druid as the go-to. Mainly because if you make use of it's ahead-of-time aggregations (which you can do for real-time or scale-out batch ingestion) then your ad-hoc queries can execute extremely quickly even over very large datasets.

Druid only really has 1 downside, which is it's still a bit of a pain to setup. It's gotten a ton ton better in recent times and I have been contributing changes to make it work better out of the box with common big data tooling like Avro.

For performance it's the top dog except for really naive queries that are dominated by scan performance. For those you are best off with Clickhouse, it's vectorized query engine is extremely fast for simpler/scan heavy workloads.

shaklee3|4 years ago

We used clickhouse on about 80TB in a raid10 setup. It was extremely fast

alfiedotwtf|4 years ago

What are the best books out there to learn about Time Series databases? (there are already a million for relational and graph, but haven't seen one for time series). Bonus on how to implement one

jpgvm|4 years ago

If you want something like Honeycomb but scales better then maybe look at Druid.

dikei|4 years ago

Last time I checked, Druid were not very good at ad-hoc tasks because it lacked join and SQL supports was sketchy. How is it now ?

camel_gopher|4 years ago

Take a look at IronDB too. High scale distributed implementationd.