Non sequitur hint: If you are storing data like his powerhungy, consider storing the sum of the squares of the datum as well as the datum and number of samples in the aggregate (might be initially 1). This lets you compute the standard deviation for display, but it also has the nice property that after aggregating samples, you can still compute the standard deviation of that.
RRDtool is pretty nice, but it has a fair number of scalability issues too:
* Once you create an RRA (archive file) you can't modify it to add or remove metrics, or change their properties. This makes them relatively inflexible.
* Updating RRAs is I/O heavy. Every time an update comes in, the OS must read, modify and write a page.
* RRDcache mitigates this somewhat by deferring flushes, but there are diminishing returns to this (eventually the number of writes coming in will cause the cache flush and filesystem metadata update rate to exceed the maximum IOPS available), and you risk data loss in the event of a power outage or the OOM killer kills the process.
Time-series data access patterns tend to be write-heavy. Storing first in an append-only log is a big win here; Cassandra and MySQL are both good choices, though you do have to think about the schemata first. And disk is so cheap now that expiration can be an afterthought.
To handle very high throughput, storing RRD files on a ramdisk works surprisingly well, if you can afford the cost and the loss of a few seconds of data - which most of the time you can.
A simple tar + gzip is all you need to flush to disk, at the frequency of your choice. It turns out rrd write operations are safe enough to do this without corruption. And the IO cost is minimal compared to rrdcache: rrd data compresses extremely well.
You could do the same thing with mongodb and 'capped collections' although aging the data like rrd would require mongodb to have a callback for when the capped collection is full.
That was one of the clearest explanations of the strengths of RRDtool that I've read. You can spend a lot of time massaging a more general database to store time series data, or you can use RRDtool.
Pity there's no mention that RRDTool has been around for decades, pretty much stable. It's worth remembering that old tools aren't necessarily obsolete.
[+] [-] jws|14 years ago|reply
[+] [-] ghotli|14 years ago|reply
[+] [-] otterley|14 years ago|reply
* Once you create an RRA (archive file) you can't modify it to add or remove metrics, or change their properties. This makes them relatively inflexible.
* Updating RRAs is I/O heavy. Every time an update comes in, the OS must read, modify and write a page.
* RRDcache mitigates this somewhat by deferring flushes, but there are diminishing returns to this (eventually the number of writes coming in will cause the cache flush and filesystem metadata update rate to exceed the maximum IOPS available), and you risk data loss in the event of a power outage or the OOM killer kills the process.
Time-series data access patterns tend to be write-heavy. Storing first in an append-only log is a big win here; Cassandra and MySQL are both good choices, though you do have to think about the schemata first. And disk is so cheap now that expiration can be an afterthought.
[+] [-] shykes|14 years ago|reply
A simple tar + gzip is all you need to flush to disk, at the frequency of your choice. It turns out rrd write operations are safe enough to do this without corruption. And the IO cost is minimal compared to rrdcache: rrd data compresses extremely well.
[+] [-] thehammer|14 years ago|reply
[+] [-] pasbesoin|14 years ago|reply
[+] [-] rellik|14 years ago|reply
[+] [-] rellik|14 years ago|reply
[+] [-] pgr0ss|14 years ago|reply
[+] [-] nwmcsween|14 years ago|reply
[+] [-] sciurus|14 years ago|reply
[+] [-] shubber|14 years ago|reply