top | item 25350474

(no title)

frankmcsherry | 5 years ago

There are a few differences, the main one between Spark and timely dataflow is that TD operators can be stateful, and so can respond to new rounds of input data in time proportional to the new input data, rather than that plus accumulated state.

So, streaming one new record in and seeing how this changes the results of a multi-way join with many other large relations can happen in milliseconds in TD, vs batch systems which will re-read the large inputs as well.

This isn't a fundamentally new difference; Flink had this difference from Spark as far back as 2014. There are other differences between Flink and TD that have to do with state sharing and iteration, but I'd crack open the papers and check out the obligatory "related work" sections each should have.

For example, here's the first para of the Related Work section from the Naiad paper:

> Dataflow Recent systems such as CIEL [30], Spark [42], Spark Streaming [43], and Optimus [19] extend acyclic batch dataflow [15, 18] to allow dynamic modification of the dataflow graph, and thus support iteration and incremental computation without adding cycles to the dataflow. By adopting a batch-computation model, these systems inherit powerful existing techniques including fault tolerance with parallel recovery; in exchange each requires centralized modifications to the dataflow graph, which introduce substantial overhead that Naiad avoids. For example, Spark Streaming can process incremental updates in around one second, while in Section 6 we show that Naiad can iterate and perform incremental updates in tens of milliseconds.

discuss

order

shay_ker|5 years ago

That's very helpful, thanks! I think I still have to wrap my head around _why_ being stateful allows TD to respond faster, but maybe I just gotta dig deeper and see for myself.

frankmcsherry|5 years ago

It just comes down to something as simple as: "if I have shown you 1M different things, and now show you one more thing, what do you have to do to tell me whether that one thing is new or not?"

If you can keep a hash map of the things you've seen, then it is easy to respond quickly to that one new thing. If you are not allowed to maintain any state, then you don't have a lot of options to efficiently respond to the new thing, and most likely need to re-read the 1M things.

That's the benefit of being stateful. There is a cost too, which is that you need to be able to reconstruct your state in the case of a failure, but fortunately things like differential dataflow (built on TD) are effectively deterministic.

Also, I suspect "Spark" is a moving target. The original paper described something that was very much a batch processor; they've been trying to fix that since, and perhaps they've made some progress in the intervening years.

astn-austin|5 years ago

You should check out Apache Flink. It does a bunch of those things that Spark doesn't, though it's also missing a few things that Spark has. https://flink.apache.org/