top | item 30598438

(no title)

riknox | 4 years ago

I've not seen Storm being used anywhere sane for a few years at least now, and from a glance at job postings it looks unlikely. Spark, Kafka Streams etc. are definitely used in a modern data platform from my experience.

I think we're seeing a big shift with Hadoop-like workloads being moved onto cloud providers, so BigQuery, Amazon EMR etc.

discuss

order

bsenftner|4 years ago

I'm curious what constitutes "big data" anymore. In an intermediate machine learning course, we train on nearly a petabyte of data using Google Colab and Jupyter Notebooks. Nobody discusses the size of the data requiring any special treatment due to its size... would not 95% of a petabyte be "big data"?

happymellon|4 years ago

Big data is a shifting concept as computers gain more storage and faster commodity processors.

My general rule of thumb is whether it is too big to put on my laptop. So greater than a couple of Tb's.

pilotneko|4 years ago

What course are you taking? Imagenet is only 150 GB, and Common Crawl is only 320 TB.

Big data is a moving target, but I’m comfortable defining it as data too large to fit in memory. Obviously, you can always get a bigger node, my rule is thumb is that if you need generators, you are working with big data.