top | item 25869677

(no title)

james_woods | 5 years ago

Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133

Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.

Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.

discuss

order

namibj|5 years ago

Late data is very deliberately not handled. The reasoning for that is best available at [0]. Now, there are ways [1] to handle bitemporal data, but they have fairly significant issues in ergonomics and performance, due to the additional work needed to allow the bitemporal aggregations.

As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees [2] (back then, `Aggregation` was called `ValueHistory`).

Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [1]: https://github.com/frankmcsherry/blog/blob/master/posts/2018... [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...

james_woods|5 years ago

Taken from [0] If you wanted to use the information above to make decisions, it could often be wrong. Let's say you want to wait for it to be correct; how long do you wait?

I know how long I want to wait, 30 minutes in one of my cases as I know that I've seen 95% of the important data by then. In the streaming world there is _always_ late data so being able to tell what should happen when the rest (5%) arrives is crucial for me.

This differs from use-case to use-case for me and being able to configure this and handling out-of-order data at scale is key for me when selecting a framework for stream processing. Apache Beam and Apache Flink do this very well.

Taken from [1]: Apache Beam has some other approach where you use both and there is some magical timeout and it only works for windows or something and blah blah blah... If any of you all know the details, drop me a note. It obviously only works when you window your data as it needs to fit in memory. The event-time and system-time concept from Beam and Flink are very similar, also the watermark approach. Thank you for sharing the links, For me it is now clearer where the difference lies between differential-dataflow and stream-processing frameworks (which also offer SQL and even ACID conformity!). I'm using Beam/Flink in production and missing out on one of these mentioned points is a deal-breaker for me.