top | item 39140039

(no title)

hobos_delight | 2 years ago

…yes - processing 3.2G of data will be quicker on a single machine. This is not the scale of Hadoop or any other distributed compute platform.

The reason we use these is for when we have a data set _larger_ than what can be done on a single machine.

discuss

order

ralph84|2 years ago

Most people who wasted $millions setting up Hadoop didn’t have data sets larger than could fit on a single machine.

faet|2 years ago

I've worked places where it would be 1000x harder getting a spare laptop from the IT closet to run some processing than it would be to spend $50k-100k at Azure.

hobos_delight|2 years ago

I completely agree. I love the tech and have spent a lot of time in it - but come on people, let’s use the right tool for the right job!

saberience|2 years ago

Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?

I’ve heard this anecdote on HN before but without ever seeing actual evidence it happened, it reads like an old wives tale and I’m not sure I believe it.

I’ve worked on a Hadoop cluster and setting it up and running it takes quite serious technical skills and experience and those same technical skills and experience would mean the team wouldn’t be doing it unless they needed it.

Can you really imagine some senior data and infrastructure engineers setting up 100 nodes knowing it was for 60GB of data? Does that make any sense at all?

hiAndrewQuinn|2 years ago

Moore's law and its analogues makes this harder to back-predict than one might think, though. A decade ago computers had only had about an eighth (rough upper bound) of the resources modern machines tend to have at similar price points.

OskarS|2 years ago

This is exactly the point of the article. From the conclusion:

> Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.

the8472|2 years ago

What can be done on a single machine grows with time though. You can have terabytes of ram and petabytes of flash in a single machine now.

MrBuddyCasino|2 years ago

This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.

And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.

hobos_delight|2 years ago

Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.

dagw|2 years ago

The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.

Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.