Efficient Tabular Storage

[+] SloopJon|10 years ago|reply

It looks like the NYCTaxi dataset is here:

http://www.andresmh.com/nyctaxitrips/

Some background on this data:

http://chriswhong.com/data-visualization/taxitechblog1/

And data for 2014 directly from the city:

https://data.cityofnewyork.us/view/gn7m-em8n

[+] TheGuyWhoCodes|10 years ago|reply

Vertica has all those performance enhancements, great DB can't recommend because of pricing :(

[+] beagle3|10 years ago|reply

kdb+ answers same description.

And it's a 300KB executable with no dependencies (other than glib/MSVCRT).

[+] owlish|10 years ago|reply

How do databases like MySQL store data efficiently for querying? It seems like something like protobuf would do well here, though you'd need to generate code for each dataset.

[+] kragen|10 years ago|reply

Typically they use row-oriented binary storage, optionally with individual columns or subsets of columns duplicated into indices for fast querying. Have you tried protobufs? How many hundreds of megs per second do you get? I think it is remarkably slow on the scales we're talking about here.

[+] brudgers|10 years ago|reply

Traditional DBMS's get performance by optimizing storage down to the physical layout of the data on the hardware. So MySQL makes a lot of assumptions based on the mechanics if spinning disks and buffers tailored to their physics. Database Systems: The Complete Book is a good text on the subject and the second half is all about the hardware and software used in implementing traditional systems.

9 comments