top | item 45924349

(no title)

nevi-me | 3 months ago

The main reason why clusters still make sense is because you'll have a bunch of people accessing subsets of much larger data regularly, or competing processes that need to have their output ready at around the same time. You distribute not only compute, but also I/O, which others are pointing out to likely dominate the runtime of the benchmarks.

Beyond Spark (one shouldn't really be using vanilla Spark anyways, see Apache Comet or Databricks Photon), distributing my compute makes sense because if a job takes an hour to run, (ignoring overnight jobs) there will be a bunch of people waiting for that data for an hour.

If I run a 6 node cluster that makes the data available in 10 minutes, then I save in waiting time. And if I have 10 of those jobs that need to run at the same time, then I need a burst of compute to handle that.

That 6 node cluster might not make sense on-prem unless I can use the compute for something else, which is where PAYG on some cloud vendor makes sense.

discuss

order

No comments yet.