With Amazon Redshift SSD, querying a TB of data took less than 10 seconds

[+] jandrewrogers|12 years ago|reply

These numbers are not that surprising for an OLAP cluster. Even though Redshift is really architected to run on spinning disks, SSDs will almost always improve the performance.

On the other hand, the load performance is quite poor. On the 12x dw2.large hardware, a good clustered analytical database engine should be able to easily load 1.2TB in less than 15 minutes while the database tables are online and being queried. That it took well over an hour, and with a very simple data model at that, would argue against it being good for "real-time" even with SSDs. (This is not a surprising result though; Redshift is just a clustered PostgreSQL variant, which does not have the best internals for real-time.)

[+] kevindication|12 years ago|reply

It's not a Postgres variant at all. Postgres is emulated as an interface to the columnar ParAccel database underneath. ParAccel does neat things (compiles your SQL into a program that it runs to answer the question, for instance) and really rips if you can order your data on good keys up front (and then use those keys in your query, of course).

Source: I helped build a very high speed network data analytical tool on top of ParAccel (before it was bought by Amazon and rolled into redshift).

[+] fear91|12 years ago|reply

SSD drive saved my life when I had to query a 300 GB MySQL table that couldn't fit in my RAM. Since the data was organized by the primary key ( which was random in the SELECT queries), both reads and writes had to come from random places and the whole process became IOPS bound ( ordinary HDD can query only around 75-150 different disk areas per second). So while a normal HDD can achieve good sequential read speed, it SUCKS when it comes to reading data spread randomly.

I was amazed how much improvement I've seen just by getting an SSD - and how cheap it was compared to all other solutions.

[+] leobelle|12 years ago|reply

It's not cheap. Base price is $0.25 per hour:

http://www.wolframalpha.com/input/?i=%240.25+per+hour+for+a+...

$183 a month.

[+] gtaylor|12 years ago|reply

As far as target audience for this, $183/month is a pittance. From their product site:

"Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools."

That to me screams "enterprise" and "big data" and all sorts of other silly buzz words. Your average startup is probably not going to need this, but their target audience may view that $183/month base price tag favorably.

[+] flavor8|12 years ago|reply

That's still a bargain compared to running your own Vertica or Greenplum cluster.

[+] userbinator|12 years ago|reply

On the other hand, you can process a lot more data in an hour, so it's fair to charge more.

[+] antonmks|12 years ago|reply

Is it possible to generate the dataset that you used ? I would like to run a benchmark for myself and downloading a 1 TB file from Amazon unfortunately is not an option.

16 comments