top | item 15438287

(no title)

ozataman | 8 years ago

There is a different side to the cost benchmark that's not captured by the description here. If your use case needs a lot of stored data but not necessarily a matching degree of peak CPU (even if your query load is otherwise pretty consistent), Redshift will become really expensive really fast and it will feel like a waste. BigQuery will meanwhile keep costs linear (almost) in your actual query usage with very low storage costs.

For example, you may need to provision a 20-node cluster only because you need the 10+ terabytes in storage across several datasets you need to keep "hot" for sporadic use throughout the day/week, but don't nearly need all that computational capacity around the clock. Unlike BigQuery, Redshift doesn't separate storage from querying. Redshift also doesn't offer a practically acceptable way to scale up/down; resizes at that scale take up to a day, deleting/restoring datasets would cause lots of administrative overhead and even capacity tuning between multiple users is a frequent concern.

Making matters worse, it is common for a small number of tables to be the large "source of truth" tables that you need to keep around to re-populate various intermediate tables even if they themselves don't get queries that often. In Redshift, you will provision a large cluster just to be able to keep them around even though 99% of your queries will hit one of the smaller tables.

That said, I haven't tried the relatively new "query data on S3" Redshift functionality. It doesn't seem quite the equivalent of what BigQuery does, but may perhaps alleviate this issue.

Sidenote: I have been a huge Redshift fan pretty much since its release under AWS. I do however think that it is starting to lose its edge and show its age among the recent advances in the space; I have been increasingly impressed with the ease of use (including intra team and even inter-team collaboration) in the BigQuery camp.

discuss

order

joeharris76|8 years ago

Redshift offers hard disk based nodes with huge amounts of storage at low cost for precisely the use case you mention. The performance of these is actually very good, especially with a little effort applied to choosing sort keys and dist keys.

Spectrum extends that even further, allowing you to have recent and reference data locally stored and keep archival data in S3 available for query at any time.