top | item 7147153

With Amazon Redshift SSD, querying a TB of data took less than 10 seconds

73 points| fujibee | 12 years ago |flydata.com | reply

16 comments

order
[+] jandrewrogers|12 years ago|reply
These numbers are not that surprising for an OLAP cluster. Even though Redshift is really architected to run on spinning disks, SSDs will almost always improve the performance.

On the other hand, the load performance is quite poor. On the 12x dw2.large hardware, a good clustered analytical database engine should be able to easily load 1.2TB in less than 15 minutes while the database tables are online and being queried. That it took well over an hour, and with a very simple data model at that, would argue against it being good for "real-time" even with SSDs. (This is not a surprising result though; Redshift is just a clustered PostgreSQL variant, which does not have the best internals for real-time.)

[+] kevindication|12 years ago|reply
It's not a Postgres variant at all. Postgres is emulated as an interface to the columnar ParAccel database underneath. ParAccel does neat things (compiles your SQL into a program that it runs to answer the question, for instance) and really rips if you can order your data on good keys up front (and then use those keys in your query, of course).

Source: I helped build a very high speed network data analytical tool on top of ParAccel (before it was bought by Amazon and rolled into redshift).

[+] fear91|12 years ago|reply
SSD drive saved my life when I had to query a 300 GB MySQL table that couldn't fit in my RAM. Since the data was organized by the primary key ( which was random in the SELECT queries), both reads and writes had to come from random places and the whole process became IOPS bound ( ordinary HDD can query only around 75-150 different disk areas per second). So while a normal HDD can achieve good sequential read speed, it SUCKS when it comes to reading data spread randomly.

I was amazed how much improvement I've seen just by getting an SSD - and how cheap it was compared to all other solutions.

[+] leobelle|12 years ago|reply
It's not cheap. Base price is $0.25 per hour:

http://www.wolframalpha.com/input/?i=%240.25+per+hour+for+a+...

$183 a month.

[+] gtaylor|12 years ago|reply
As far as target audience for this, $183/month is a pittance. From their product site:

"Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools."

That to me screams "enterprise" and "big data" and all sorts of other silly buzz words. Your average startup is probably not going to need this, but their target audience may view that $183/month base price tag favorably.

[+] flavor8|12 years ago|reply
That's still a bargain compared to running your own Vertica or Greenplum cluster.
[+] userbinator|12 years ago|reply
On the other hand, you can process a lot more data in an hour, so it's fair to charge more.
[+] antonmks|12 years ago|reply
Is it possible to generate the dataset that you used ? I would like to run a benchmark for myself and downloading a 1 TB file from Amazon unfortunately is not an option.
[+] CompleteMoron|12 years ago|reply
whoa! sign me up! I wanna develop something with this speed
[+] goldenkey|12 years ago|reply
Are you by chance, a complete moron? Wait a minute...