top | item 29233100

(no title)

drej | 4 years ago

What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).

While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

discuss

order

sanketsarang|4 years ago

I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.

doppelganger1|4 years ago

The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.

To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.

tshanmu|4 years ago

Resume driven development FTW!

StephenJGL|4 years ago

Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.

autokad|4 years ago

its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.

jeltz|4 years ago

PostgreSQL and MySQL can handle a few TB just fine. It is when you reach over 10TB that you need something else.