It’s About Time for Time Series Databases

[+] kodablah|8 years ago|reply

Sorry to leave the technical detail part real quick. But is anyone else concerned about using a DB solely from a company built specifically around that DB? After Rethink DB (sustainability issue) and Foundation DB (bought and shuttered/hidden) and Riak (admittedly haven't kept up but I saw [0]), I am wary of using any DB that is not built by a large community or is not built as a non-core project from a large tech company. Sorry TimescaleDB, I see you have raised a decent amount of funding, but I have to choose my DBs w/ trepidation these days.

0 - https://www.theregister.co.uk/2017/07/13/will_the_last_perso...

[+] lbruder|8 years ago|reply

We use a combination of SQLite and a sharding frontend service. One SQLite database file per device, one table per sensor, table contents are timestamp and measured value. As simple as it gets, easy to scale, and damn fast.

But try telling people you're using SQLite to store critical data...

[+] lima|8 years ago|reply

ClickHouse (the analytics DMBS by Yandex), while not explicitly designed as such, is a fantastic time series database.

There's even a special backend, the GraphiteMergeTree, which does staggered downsampling, something most TSDBs aren't able to.

It's the most promising development in this space I've seen in a long time.

https://clickhouse.yandex/

https://clickhouse.yandex/docs/en/table_engines/graphitemerg...

[+] manigandham|8 years ago|reply

I'd also recommend Druid, MemSQL, SnappyData, MapD and other column-oriented databases. Any of them can partition on a time column with full SQL and extremely fast aggregations and high compression that come from columnar storage.

[+] cevian|8 years ago|reply

Clickhouse is very cool. But note that it does not support transactional and relational semantics and does not have real-time updates or deletes. Thus, its meant for very different applications than TimescaleDB. I would classify Clickhouse more in the data-warehouse space...

[+] betaby|8 years ago|reply

Direct link how to use as a graphite whisper replacement https://github.com/yandex/graphouse

Telegram channel https://t.me/clickhouse_en

[+] nawgszy|8 years ago|reply

Completely unrelated question, but how on earth is 'clickhouse.yandex' a valid web address?

[+] YCode|8 years ago|reply

> Back in 2014...developers...could use relational databases with SQL interfaces, which are easy to use but they don’t scale well.

I'm skeptical.

[+] g09980|8 years ago|reply

Guessing they meant to write "... for time series aggregation"

[+] stdbrouw|8 years ago|reply

> nobody wants to have large grain snapshots of data for any dataset that is actually comprised of a continuous stream of data points

Except, of course, for those who realize that the precision of a statistic only increases at sqrt(n) and that a biased dataset will remain biased regardless of how much data you have. I'll take a large grain dataset that I can load on my computer and analyze in five minutes over a finer grained dataset where I need to set up a cluster before I can even get started. Enough with the "let's store everything" fetishism already.

(Somewhat tangential to the blog post, I realize.)

[+] teej|8 years ago|reply

The reason we need to store everything is less about needing perfect accuracy of measurement (though I think we do want it) and more about the curse of dimensionality[0]. We want to slice, pivot, and filter datasets more aggressively than ever before which helps drive aggressive data collection.

[0] - https://en.wikipedia.org/wiki/Curse_of_dimensionality

[+] jandrewrogers|8 years ago|reply

For sensor data analytics, you are frequently using many orthogonal sensor data sources to measure the same thing, precisely so that you can remove source bias. And most non-trivial sensor analytics are not statistical aggregates but graph reconstructions, the latter being greatly helped by having as much data as you can get your hands on.

The "let's store everything" isn't being done for fun; it is rather expensive. For sophisticated sensor analytics though, it is essentially table stakes. There are data models where it is difficult to get reliable insights with less than a 100 trillion records. (Tangent: you can start to see the limit of 64-bit integers on the far horizon, same way it was with 32-bit integers decades ago.)

[+] RobAtticus|8 years ago|reply

Re: setting up a cluster

With TimescaleDB we've focused on single node performance to try and reduce the need for clustering. We've found performance scales very well just by adding more disk space if needed and more cores. So maybe some datasets are not practical for your laptop necessarily but a single instance on Azure/AWS/GCP is workable. No need for a cluster to get started :)

(Read scale out is available today and we are working on write scale out, hopefully later this year)

[+] qaq|8 years ago|reply

There are a lot of use cases (for example security) where it is not the case.

[+] shangxiao|8 years ago|reply

This is a bit easier to grok: http://www.timescale.com/how-it-works

(if you're new to this like me)

[+] bitoneill|8 years ago|reply

https://en.wikipedia.org/wiki/Time_series_database

[+] acomjean|8 years ago|reply

I worked at a place that monitored power usage minute by minute across 1000s of locations. We just used MySQL with a time column. Maybe I'm not the target audience but I'm failing to see what this gets me.

The problem is they say the data is imuatable and stored sequentially, allthough our data was imutable with devices on the net the data comes in random order when these ineviatably have connection problems.

We always aggregated our data into larger time blocks. Storage was cheap and doing comparative analysis across locations and time zones was always our pain point.

I think using a Postgres database is wise.

[+] kev009|8 years ago|reply

Then effectively your insert rate is between 16 and 60 per second, sure, you don't really need a sophisticated partitioning or log structured DB. Native static partitioning would give you a decent speedup without much thought.

It's intro computer science, if you have a tree structure and fill it up, you spend a lot of time in the corners of theta notation. Timescale uses tightly integrated partitioning on the time axis to deal with write performance and aging data out. Other popular TSDBs are plays on log structured merge trees etc

[+] Radim|8 years ago|reply

Perhaps not as lofty goal as time series, but a good, commercial database for vectors (aka the building representation block of modern machine learning) is long overdue too.

A few open source options exist (Spotify's Annoy, Facebook's FAISS, NMSLIB) but these are rather low-level, more hobby projects than enterprise level engines (index management, transactions, sharding…).

After building a few document similarity engines for our clients we took up the gauntlet and created ScaleText, https://scaletext.ai. It's still early days but the demand for a sane, scalable, cross-vertical and well-supported NLP technology is encouraging.

[+] amyjess|8 years ago|reply

At my employer, we've recently (as of the middle of last year) been making a considerable effort to use InfluxDB to track our KPIs. It's working out wonderfully for us, and I'm expecting it'll get used more and more as the year goes on.

What really floors me about Influx is how fast it is. A query that used to take hours in Oracle takes seconds in Influx. And the influx query is readable: rolling data up to various intervals produced nightmare queries in Oracle but is short and sweet in Influx. Want to take data gathered every 5 minutes and give a daily average? Yeah, good luck with that in Oracle. It's doable, sure, but I've seen the code. And the speed. Both are ugly.

The only problem I have with InfluxDB is it's still a little immature, and the tooling isn't where it should be. This isn't entirely Influx's fault; most third parties aren't aware it exists or don't care. Our reporting team uses Crystal Reports, which can't talk to InfluxDB. So I end up having to write a Python script that runs in cron every night to perform InfluxDB queries for the previous day's data doing all the rolling and average/min/max calculations and then inserting the results into Oracle, just so our reporting team can get to the data. For some KPIs I'm working on right now, we decided to not go through the reporting team, and I'm writing a webapp in Python/Bottle to display the report, and we're probably going to augment that with graphs from Grafana.

Grafana is beautiful, by the way.

[+] alistairbayley|8 years ago|reply

How big is your database, and how long does it take to restart? We have what I would describe as a fairly small influx database, and it takes a long time to restart (20-30 mins). And the time to restart seems to be growing linearly with the db size. Not cool if you do regular server patching with reboots.

We also had a problem where user queries that return too many columns cause the DB server process to OOM. And then it restarts, so another 20 mins of downtime. Also not cool.

We liked the tagging and rollup features, and automatic retention management, but those first 2 problems really turned us off.

[+] Krunkel|8 years ago|reply

I can attest to how great InfluxDB is. Have used it in a few applications that rely on time based data and I am constantly blown away by how fast the query times are.

[+] scrollaway|8 years ago|reply

I think you're saying a lot more about Oracle than about Influx :)

I also am using Influx at my company and the experience is not that great, mostly due to it being immature yes. For example, I currently have runaway disk usage and I have no way to know what table is using it. So I have to choose between losing data or continuously increasing storage.

[+] tzakrajs|8 years ago|reply

Apache Hive is also a great piece of software that I would recommend for these use cases.

[+] olympus|8 years ago|reply

The article recognizes that several time series databases already exist. They also say, "we aren't trying to compete against kdb+." They explain how they can handle time series data better than NoSQL databases that aren't time series focused.

But what are they doing better than the existing time series databases? Surely they must have some advantage or they wouldn't have raised 16.1 million dollars.

[+] busterarm|8 years ago|reply

Every time I read about some new solution to storing time series data, I always feel like I must be doing something wrong, but I've run into _zero_ problems yet.

Every time I have to store time series data, I never really need ACID transactions. I definitely don't ever need to do updates or upserts. It's always write-once-read-many. ElasticSearch has always been the obvious choice to me and it has worked extremely well. Information retrieval is incredibly robust and for times that I'm worried about consistency, I use Postgres's JSON capabilities and write first to there. You can have your application sanity-check between the two if you're worried about ElasticSearch not receiving/losing data.

I find it really hard to beat.

[+] akulkarni|8 years ago|reply

A few reasons why you may want to use TimescaleDB vs other time series DBs:

1. For some developers, just having a SQL interface to time-series data (while maintaining insert/query performance at scale) is good enough reason to use TimescaleDB. For example, when trying to express complex queries, or when trying to connect to SQL-based visualization tools (e.g., Tableau), or anything else in the PostgreSQL ecosystem.

2 For others, the difference is being able to fully utilize JOINs, adding context to time-series data via a relational table. Other time-series DBs would require you to denormalize your data, which adds unnecessary data bloat (if your metadata doesn't change too often), and data rigidity (when you do need to change your metadata).

To quote a user in our Slack Support forums[1]: "Retroactively evolving the data model is a huge pain, and tagging “everything” is often not viable. Keeping series metadata in a separate table will allow you to do all kinds of slicing and dicing in your queries while keeping it sane to modify the model of those metadata as you gain more knowledge about the product needs. At least that is something we have been successful with"

3. For others, it's some of the other benefits of a relational database: e.g., arbitrary secondary indexes.

One thing to think about with NoSQL systems (including every other time-series db) is that their indexing is much more limited (eg, often no numerics) or you need to be very careful about what you collect, as costs grow with cross-product of cardinality of your label sets (strings). We have heard from multiple people that they explicitly didn't collect labels they actually wanted when using [another time series db], because it would have blown up memory requirements and basically OOM'd DB. (Timescale doesn't have this issue with index sizes in memory, and you can create or drop indexes at any time.) You often hear about this as the "cardinality problem", which TimescaleDB does not have.

4. There's also the benefit of being able to inherit the 20+ years of work of reliability on Postgres.

There was no database that did all of this when we decided to build Timescale. (If one did exist, we would have used it).

You can read more about that story here: https://blog.timescale.com/when-boring-is-awesome-building-a...

And for more on our technical approach: https://blog.timescale.com/time-series-data-why-and-how-to-u...

[1] http://slack-login.timescale.com/

[+] arbesfeld|8 years ago|reply

We use Cassandra extensively at https://logrocket.com. How does performance compare vs Cassandra when you don't need the semantics or transactions of SQL? I'm surprised the article only provides benchmarks against PostgreSQL.

[+] RobAtticus|8 years ago|reply

That is currently next in our pipeline for a benchmark blog post. Early results look good on both the read and write side in terms of raw performance, with the benefit of more complex queries being more easily expressable.

[+] tw1010|8 years ago|reply

I have a hard time shaking the feeling that database innovation is being forced not by necessity, but by a culture of blind relentless change.

[+] codingdave|8 years ago|reply

OK, but is that bad?

It might be bad for the specific organization pushing their new solutions, but for the software industry as a whole, it seems like the good side of "throw spaghetti against the wall and see what sticks". Many ideas will fail, but the good ones will stick around. So I'd say by all means, let people feel free to innovate. And those of us who consume the innovations just need to remember the mantra of "leading edge, not bleeding edge."

[+] akulkarni|8 years ago|reply

It (and tech innovation in general) is also driven by people scratching their own itch. That's where we started. And we actually didn't start as a DB company, but as an IoT company storing lots of sensor data. We needed a certain kind of time-series database, tried several options, but nothing worked. So we built our own, and then realized other people needed it as well.

And we picked Postgres as our starting point exactly because it wasn't new and shiny but boring and it worked.

More on this here [1] if you're interested.

[1] https://blog.timescale.com/when-boring-is-awesome-building-a...

[+] cevian|8 years ago|reply

I have to say that a sentiment similar to this is why we built Timescale as a Postgres extension and not started a new database from scratch. We didn't want a new shiny thing just for kicks but rather a product focused on solving a particular problem.

[+] postwait|8 years ago|reply

Relentless change is most likely the motivator for these innovations. However, I don't think it is needless innovation or innovation for the same of innovation. Personal opinion that there isn't much innovation in what this article covers, but still it is happening elsewhere.

Changes in telemetry production (IoT and others) have fundamentally changed the requirements... millions (and often billions) of data points points per-second are now happening. This is being driven by RSM and IoT. RSM is alluded to in my ACM Queue article: https://queue.acm.org/detail.cfm?id=3178371

[+] analognoise|8 years ago|reply

Silicon Valley on reinventing the wheel: "Ours is rounder!"

[+] marknadal|8 years ago|reply

We had major success by simply batching incoming writes into ~15 second chunks and writing that as a file to S3 and an index that tracks how the files are split / chunked to make read performance decent.

This alone gave us an insanely scalable (load tested against 100GB/day ~100M records/day) for a grand cheap total cost of $10/day for everything, server, disk, and S3. https://youtu.be/x_WqBuEA7s8

Works great for timeseries data, super scalable, no devops or managing database servers, simple, and works.

I believe the Discord guys also did something similar and had written some good engineering articles on it, give it a Google as well.

[+] postwait|8 years ago|reply

Storing and retrieving data has never been all that hard. The challenge is having user-interactive performance on complex queries against the data. Comparing and correlating and deriving and integrating and ... (lots of other analysis). For many "scaled" systems, 100M records/minute isn't uncommon... and while that's very likely possible with your design the question of economic feasibility enters. Solving these problems at scale with good economics is the playground of TSDB vendors today.

[+] jnordwick|8 years ago|reply

I think I say this about every time, because I see it as the major flaw in almost every approach to this problem. I say this from the perspective of having used a wide range of TS databases, but mostly KDB, Oracle, and prop solutions with Cassandra and Mongo products thrown in.

There are two big problems with relational SQL databases: the storage and the query language. The two aren't separate concerns. While projects like this might fix the storage issue, they will never be simple or fast without also changing the query language too.

SQL doesn't fit the column/array paradigm well. There are extensions that attempt to close the gap, but they are slow and complex compared to just fixing the query language. Operations like windowed queries and aggregates can be painfully slow because of the mismatch between SQL and the ordered data.

I would absolutely love to work on a product that attempts to fix the query issue at the same time as working on the storage and interface issues. You don't need an obscure APL-like language, just something that promotes using the properties inherent in time series to the front.

Also, it is a bit weird to say you are not trying to compete with the likes of KDB then go on to say how good you are at its core competencies. As always, eagerly awaiting the TPC-D benchmarks and comparisons to the heavy hitting time series dbs out there, not the easily crushed FOSS products (except for maybe AeroSpike that I wish I had more experience with).

[+] cobbzilla|8 years ago|reply

PostgreSQL with jsonb columns can be quite powerful, best of both worlds; but yes the syntax for querying within json is a bit arcane.

[+] dustingetz|8 years ago|reply

How about one made by Rich Hickey — https://www.datomic.com/

[+] dragandj|8 years ago|reply

Sadly, Datomic lacks two of (arguably) very important features in this space:

1. It seems to not be optimized for speed (but it's difficult to say since the license forbids publishing benchmarks).

2. It's not open source.

[+] witwughu391|8 years ago|reply

is Datomic good for timeseries ? AFAIWA it's good for rare-write often-read applications

[+] cbcoutinho|8 years ago|reply

We use a proprietary database system that uses a 'flat-file' format (no idea what that means) and is primarily time series based due to the fact that we're logging sensor data. Since it's primarily a backend, you can't access it outside of their proprietary gui. It's also accessible as a linked server via SQL Server, but this is slow as hell for non-trivial queries. We use it within a power plant setting where we heavily prioritize db-writes, which this software is apparently very good in, and db-reads are less of a focus.

I'm not sure if moving to another db system would be beneficial, but I would be very grateful if accessiblity could be much less of a hack

[+] dx034|8 years ago|reply

What I often miss with these kinds of databases is compression capabilities. I currently use my own delta-of-delta encoding for time series data (stored in postgres) as that gave me much better compression than any of the time series databases I tested.

It's not that storage isn't available, it's easy to store terrabytes of data. But really fast storage is still expensive and rare. Being able to compress time series data to 3-5% of the raw value allows keeping most in memory, speeding up reading writing and analysis.

InfluxDB is quite good at that but unfortunately, they struggle storing many small datasets.

[+] augustl|8 years ago|reply

Isn't it a bit weird to launch a "never delete anything" store today and not mention anything about European GDPR requirements?

How would you go about deleting a users data upon request?

[+] manigandham|8 years ago|reply

You delete it. It's still a database. What's being done here is just smart partition management as an extension to Postgres, similar to CitusDB.

You can also use any column-oriented relational database with a time-based partition key and do the same thing.

[+] chrismccabe|8 years ago|reply

What is the benefits of timescaledb or influxdb? surely the enterprise pricing is going to be similar to KDB which has been around for years and a very mature product

[+] RegBarclay|8 years ago|reply

How is TimescaleDB different from OSIsoft PI?

[+] apohn|8 years ago|reply

Not just OSISoft PI, but there are many other historians for time series data.

https://en.wikipedia.org/wiki/Operational_historian

At least in my experience, historians are rarely recommend for complex or ad-hoc queries. Typically you just pull the data (by tags) into another application and do your data processing there. It looks like in timeseriesdb lets you execute complex queries in the database. Historians typically only let you fetch data by tag and you need a metadata (e.g. asset management) framework on top to organize the data (e.g. give me avg temp every 5 minutes by sensor). It looks like with timeseriesdb you can have strings/text as fields within the timeseries table, which removes the need (to some degree) to join the data with a seperate metadata database.

I've also never heard of anybody using these commercial historians for time series data you'd see from non industrial processes (e.g. stock data, price tracking, GPS location of people or moving assets, time between clicks on a website,etc).

All that being said, OSISoft PI and AF have their warts, but OSISoft has been around for a while and PI has been battle tested in various industries (e.g. Oil & Gas, Manufacturing). It's closed source and you have to pay for it, so it's probably not attractive to startups and smaller companies. But it does come with a support organization if you need it and can pay for it. And IME data retrieval from PI is extremely performant!

[+] anonu|8 years ago|reply

I almost thought this was going to be an ad for KX as part of the subject is their slogan...

Glad to see open source tools in this space gaining traction.

148 comments