Why isn't differential dataflow more popular?

[+] roenxi|5 years ago|reply

Indirectly answering the question - I've skimmed through the git README, the abstract and all the pictures in the academic paper that it references.

I have no idea what this thing does. Can someone explain in simple terms what it does?

My organisation is currently investigating on installing Spark on the theory that it connects to databases and we need analytics. As far as I can tell it breaks analytics work into parallel workloads.

[0] https://github.com/TimelyDataflow/differential-dataflow/

[+] rkhaitan|5 years ago|reply

(disclaimer: I work at Materialize and I work with Differential regularly)

Differential dataflow lets you write code such that the resulting programs are incremental e.g. if you were computing the most retweeted tweet in all of twitter or something like that and 5 minutes later 1000 new tweets showed up it would only take work proportional to the 1000 new tweets to update the results. It wouldn't need to redo the computation across all tweets.

Unlike every other similar framework I know of, Differential can also do this for programs with loops / recursion which makes it more possible to write algorithms.

Beyond that, as you've noted it parallelizes work nicely.

I wrote a blog post that was meant to explain "what does Differential do" and "when it is or isn't useful" and give some concrete examples that might be helpful. https://materialize.com/life-in-differential-dataflow/

[+] DougBTX|5 years ago|reply

Lets say you have three numbers, a, b, c, and you want to add them together to get the total. Then later, "c" changes, and you'd like to re-compute the total. One option would be to re-run the full sum, a + b + c, which would be fine. However, that repeats the "a + b" calculation.

Would it be possible to improve the efficiency of the total calculation by re-using the pre-computed "a + b" if only "c" changes?

Differential dataflow is one way to do that, but only really applies if you have lots of data with complex calculations. For analytics, maybe the "a + b" calculation would cover your last 5 years of operations, and then when a new day's worth of data comes in, you just compute the changes to the totals, rather than re-computing the analytics for all those years, all without manually having to write distinct "total" and "update" code.

[+] ilaksh|5 years ago|reply

I just found this video: https://youtu.be/yyhMI9r0A9E

and also this: http://muratbuffalo.blogspot.com/2017/11/on-dataflow-systems...

Timely dataflow was inspired by Naiad.

[+] asavinov|5 years ago|reply

Assume you need to execute the following query:

    SELECT SUM(A) FROM MyTable

For large tables it will take some time to compute the result. Now assume we append a new record and want to get the new result. The traditional approach is execute this query again. A better approach is to process this new record only by adding its value in A to the result of the previous query. It is important in (stateful) stream processing.

Something similar is implemented in these libraries which however rely on a different data processing conception (alternative to map-reduce):

https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce.

https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!

[+] worldsayshi|5 years ago|reply

I think there are a lot of similarly interesting paradigms that goes mostly unnoticed because of a lack of "developer UX".

My personal favorite is "Functional hybrid modelling" - https://github.com/giorgidze/Hydra

[+] nemothekid|5 years ago|reply

I've always been interested in distributed stream processing platforms (Storm, Spark, Samza, Flink, etc) - and I've been interested in a distributed processing platform that wasn't on the JVM (there used to be one called Concord). That said, I came across differential dataflow a while ago (as I also began writing more and more Rust).

I think the biggest issue is the documentation, not so much on writing code, but on building an actual production service using it. I think most of us can now grok that you have a Kafka Stream on one end and a datastore on the other, and the quintessential map/reduce hello world is WordCount.java. That doesn't isn't clear from the differential dataflow documentation - I remember thinking how are they getting data from the outside world into this thing, then thinking maybe I don't understand this project at all.

Consider the example in the ReadMe - the hello world is "counting degrees in a graph". While it gives you an idea of how simple it is to express that compuation, it isn't interactive - it's unclear how one might change the input parameters (or if that's even possible). The hardest part of most of these frameworks is glue - but once you have that running then exploring what's possible is much easier. Differential Dataflow doesn't provide that for me right off the bat.

That said - I'm not surprised, when I last checked it out Rust Kafka drivers weren't all there and it seemed to be evolving parallel to everything else. I think what would make it more popular is a mental translation of common Spark tasks (like WordCount) to differential dataflow.

[+] DougBTX|5 years ago|reply

> it's unclear how one might change the input parameters (or if that's even possible).

Yeah, the readme is pretty dense on terminology that is unfamiliar (at least to me).

It answers that question like this:

> In the examples above, we can add to and remove from edges, dynamically altering the graph, and get immediate feedback on how the results change

but it would be great to show an example of that in code, as otherwise it is easy to assume that "reachable" is a fixed result set, when the whole point of the system is that presumably you can subscribe to changes in "reachable" as "roots" or "edges" change.

[+] notslow|5 years ago|reply

This was exactly my experience as well. I needed an non-JVM streaming platform. Differential Dataflow seemed like a possible fit, but I wasn't able to unlock the magic. Most likely a product maturity issue (docs, examples, defined use cases, etc.) than a technical one.

[+] theamk|5 years ago|reply

I don't think it is possible to compare "differential dataflow" with projects like Spark. (don't know about kafka streams)

Spark is a "product" -- it has extensive documentation, supports multiple languages, and generally is production-grade. It is full of "nice-to-haves", like interactive status monitors, helper functions, serialization layers, and so on. It integrates with existing technologies (Hadoop, K8S, HDFS, etc..). It has this "finished product" feel.

"differential dataflow" seems to be a library. It only supports a single language. The documentation is very basic (it was not even clear if there is a way to run this on multiple machines or not). It is very bare-bones -- there are only a few dozens functions, and no resource monitors and interactive shells. It does not seem to integrate with anything. It has "research software" feel -- there are random directories in top-level repo, academic papers, and so on.

(it would probably be more fair to compare "materialize" to Spark...)

[+] jamii|5 years ago|reply

Ok. That sort of lines up with "lack of docs" and "missing features" but maybe the bigger thing is the general impression and expectations.

Having spent much of last year pointing kafka at differential dataflow and watching kafka fall over or fail to start, I definitely feel like I would trust differential dataflow more. But I agree that nothing about the presentation of the project gives that impression.

[+] lumost|5 years ago|reply

My 2 cents.

Compute and storage capabilities keep growing rapidly, if one structures their data well, uses a reasonable query processor, and some form of underlying columnar storage then computing calculations on TBs of data can be accomplished in seconds for low costs.

Being able to recompute the world from scratch is a p0 requirement for most analytic workloads, as otherwise migrations and un-forseen computation changes due to new product requirements and other activities become painful.

This leaves differential techniques in an awkward spot where to be effective they need to

1) Operate on vast quantities of data or sufficiently complex calculations such that optimization of compute is a concern to the end-user.

2) Operate in a computational environment that is sufficiently constrained such that all present and future changes to the computation can be reasonably accounted for.

3) Be transparent enough that engineers don't feel that they are duplicating logic.

It's hard to think of applications that would meet this criteria outside of intelligent caching of computation within DB engines.

[+] polskibus|5 years ago|reply

Depends how you look at it. Single thread perf is pretty much stagnant. Only recently ryzen has budged multi core perf, we're still to see how long this growth path can be.

I agree that cloud has made computing power more accessible, but there are known limits regarding scalability. Moreover, using distributed computing (spark, etc) kills efficiency (see all those posts about a laptop beating a medium sized big data cluster).

[+] michannne|5 years ago|reply

>It's missing some important feature, like persistence?

Hard to answer, considering....

>It's had very little advertising?

Personally, I've never heard of it and have no idea what differential dataflow is. Maybe I've done but I never gave or discovered a name for it. I don't know what spark or kafka streams are. Maybe because I've never had a use case for those that wasn't satisfied by a tool that was "good enough", or, more likely, I haven't come across anyone recommending those on projects, because they also don't know what those tools are. I would have never known what RabbitMQ was if a coworker never suggested we use it to build queues, and it turned out to be cumbersome to use and 100x more complicated than writing a stored procedure that turned out to be "good enough". Most tools fall into that space where they are marginally better in some regards over "good enough", but not better enough to accomadate for the learning curve for other developers, changes in maintenance or design, cost, etc. Advertising is pretty general and it's hard to say if which of these it's doing wrong, and depending on their market none of these might be wrong for them, just the potential market is content with "good enough" and have no need to search for tools like this.

>Rust is intimidating?

I'm not sure what the stats on Rust are but I don't think its that popular for business developers to where you could point to it for the reason a tool has failed the adoption phase

[+] carapace|5 years ago|reply

"Build a better mousetrap and the world will beat a path to your door." is bunk. People don't automatically adopt new better things. I don't know why though.

In my teens and twenties I collected ideas the way some folk collect stamps. The simple fact of the matter is that there are amazing things out there that you've never heard of, and no one really seems to care.

(As an aside, I hate the question, "If FOO is so great why doesn't everyone use it?" I do not know. That's not my department.)

These days some of these things are better known and some even have Wikipedia pages and stuff ( E.g. https://en.wikipedia.org/wiki/Vaneless_ion_wind_generator ) but a lot of others are still obscure (trawl through Rex Research if you want to look for weird tech.)

Like, there's a mechanism that can absorb kinetic energy. The demo has a little car on rails with a ramp at one end and a wall at the other. They put a wineglass at the wall and they put the car on the ramp and let it go: the glass shatters. They activate the device and repeat: car hits glass and halts, glass does not shatter. Messed up, right? They're from Poland IIRC, they've been doing demos at trade shows. I bet you've never heard of them. (Bug me and I'll try to dig up a link; They're in Rex Research.)

I already mentioned the "Vaneless" ion wind generator, an efficient solid-state device for converting wind into electric power without e.g. killing birds with spinning vanes. Cheap, simple, easy, durable, been around for decades, and you just now heard about it, eh? :)

There's a battery that desalinizes salt water. A nuclear reactor made of molten salt. Balloons stronger than steel. There's a guy in Michigan, Wally Wallington, who figured out how to move monoliths single-handedly, they walk just like the old stories say!

Anyway, I'm getting ranty here. To veer back on topic: yeah, it's a bummer, you build an awesome mousetrap and even the people with lots of mice ignore it. I wish I knew what to tell you. Maybe paint it mauve?

[+] hchz|5 years ago|reply

These examples provide your answer - digging just past the surface, they're not better at giving people what they want.

The vaneless ion generator is >5x less efficient.

Molten Salt Reactors, on the surface, sound quite safe, but nuclear power's primary problem is economics, and this is an unproven technology operating over the long time scales needed to amortize reactor construction costs. Nuclear power systems present novel failure mechanisms that don't exist in everyday technology, such as the corrosion mechanisms caused by coupling of mechanical, thermal, chemical, and radiological stresses, and MSR's present a new set of these that are poorly understood. Additionally, MSR's exhibit thermal shocks in normal operations orders of magnitude above what may be produced in a water-cooled reactor.

It's just not that simple - it looks simple to you.

[+] ricardobeat|5 years ago|reply

There are usually multiple sane reasons why these ideas don't make it; either technological, practical, business or social hurdles. Conformism is one of them. They aren't actually better than what they are going against until they overcome that, or offer some kind of massive advantage that makes it dumb not to switch.

[+] gdubs|5 years ago|reply

This is a great thread, thanks. Familiarity is a powerful thing. It helps explain why excel remains the most widely used “programming language” in enterprise.

Part of it is habits, which are not easy to change. And the fact that when they are changed, it’s often incrementally, by substituting them with something similar enough.

Thanks for the reminder of the amazing things that are out there, but just haven’t cracked the mainstream. I think about it a lot with software from the 90s. Things like HyperCard. Stuff that, despite lower processor rates and memory, seemed MORE sophisticated than a lot of what’s standard today.

[+] layoutIfNeeded|5 years ago|reply

>there's a mechanism that can absorb kinetic energy

Yes, it's called braking.

[+] andi999|5 years ago|reply

"The demo has a little car on rails with a ramp at one end and a wall at the other." any link for that?

[+] liminal|5 years ago|reply

I've looked at this and thought it looked amazing, but also haven't used it for anything. Some thoughts...

Rust is a blessing and curse. I seems like the obvious choice for data pipelines, but everything big currently exists in Java and the small stuff is in Javascript, Python or R. Maybe this will slowly change, but it's a big ship to turn. I'm hopeful that tools like this and Balista [1] will eventually get things moving.

Since the Rust community is relatively small, language bindings would be very helpful. Being able to configure pipelines from Java or Typescript(!) would be great.

Or maybe it's just that this form of computation is too foreign. By the time you need it, the project is so large that it's too late to redesign it to use it. I'm also unclear on how it would handle changing requirements and recomputing new aggregations over old data. Better docs with more convincing examples would be helpful here. The GitHub page showing counting isn't very compelling.

[1] https://github.com/ballista-compute/ballista

[+] mywittyname|5 years ago|reply

These products are competing for mindshare in an incredibly saturated market. There are a lot of was to skin the data pipeline cat. I think a lot of companies already founded data engineering teams, all of whom have established tech-stacks for data engineering tasks.

Personally, I keep an eye out on new technologies, but I'm not likely to embrace them without good reason. A fragmented tech stack is annoying.

This looks an awful lot like Spark to me. And doesn't seem to really solve the problems I typically experience with data engineering. For me, the biggest issue is orchestration. I don't see any facilities here for managing and executing data pipelines.

So, it seems to me that people aren't using dataflow more because it looks a lot like legacy products on the market. And it doesn't solve the massive problem of job orchestration and management. Apache Airflow + python + BigQuery is immensely powerful and dead simple to use. It's going to be hard to compete with.

[+] akiselev|5 years ago|reply

The problem is these two:

The api is too hard to use?

The docs / tutorials are not good enough?

DD falls into an uncanny valley where the API surface is simple enough to grasp quickly yet foreign enough that actually grokking it is pretty hard, let alone applying it in an organization where maintenance is a top concern. To do anything nontrivial, you need knowledge of timely-dataflow too and the DD documentation doesn't do a good job of integrating knowledge from TD docs - they're written by someone who has already internalized that knowledge so it's an afterthought. Getting data in and out of the dataflow and orchestrating workers is pretty much undocumented outside of Github issue discussions. Trying to abstract away dataflows behind types and functions turns into a big ol' generic mess. There are a lot of rough edges like that (and the abomination crate is... well... an abomination).

McSherry's blog posts, while tantalizing, are often focused too much on static examples (entire dataset is available upfront) and are too academic-focused to make up for holes in the book. As far as I can tell, the library hasn't seen enough use for best practices to emerge and there's almost no guidance on how to build a real world system with DD.

By far the biggest problem I've had: I can avoid a DD project for a week or two at most before enough knowledge leaves my memory that I have to spend days rereading my own code to get reoriented and productive again. You either use unlabeled tuples which turns the dataflow into an unholy mess or you spend half your time writing and deleting boilerplate when doing R&D. DD is just too weird and the API too awkward - I haven't figured out a method for writing straightforward DD code.

That said, when I have gotten it to work on nontrivial problems, the performance and capabilities have been really impressive. I've just never been able to get the stars to align to use it in a professional context with future maintainers.

I think what DD needs is a LYNQ-like composable query language that abstracts away the tuple datatypes and provides an ORM/query builder layer on top of dataflow statements. Most developers are familiar with SQL statements which would make DD a lot easier to adopt.

[+] Grimm1|5 years ago|reply

Personally, after now seeing this, I think it's going to solve a problem for us that we're going to run into in the medium term so that's pretty neat, but we are dealing with a lot of data and re-computing certain things from scratch for us would be potentially prohibitive at our scale of data.

I think the main issue is that in most shops is that the scale of their data isn't so large that a re-computation of a query with new data takes long enough that they would want to put it engineering effort to switch off more common tools like spark, airflow and columnar storage dbs. They're also likely, with decent engineering, not yet at a point where they run into tuning issues on their ingest side. An ETL taking an hour every night and then taking a couple seconds to run that query or even have that query set up on a job that just sends out a report isn't really an issue for most small - medium sized companies, and even at larger ones if your data throughput isn't particularly high I don't see people needing to reach for this for the same reasons.

You obviously can do those less intensive tasks in DDF but it doesn't really strongly make a case for itself in those regards, largely because DDF doesn't seem to offer anymore benefit on those smaller tasks, 15s to 230ms is a really tremendous leap in performance but for many companies I doubt the 15s is a bottleneck in the first place so it's not actually solving a problem there, it would be a nice to have.

[+] thelastbender12|5 years ago|reply

A possible reason not mentioned in the post is that writing efficient incremental algorithms is just fundamentally hard, despite the primitives and tooling afforded by the differential dataflow library. For example, even with a lot of machine learning libraries targeting python, there are only a couple that really implement online algorithms.

[+] namibj|5 years ago|reply

Can confirm, am still unsure if DDlog [0] can be switched to Worst-case optimal joins (WCOJ) [1] with the recent (unreleased, but almost 1y old) calculus operators of DDflow [2][3], because at least the original dogs^3 approach supposedly doesn't work in iterative contexts (which are necessary for recursive operations, like graph computations). The calculus blog post ends on a promising note, however.

I'm trying to help a couple (friends?) with getting the analysis of rslint [4] running well on DDlog or at least DDflow, with the end-goal being a perceptually zero-latency linter that typically responds faster than a human types. We're currently seeing initial delays in the single-digit second range, and that's not even on large projects (the incremental performance is far better, but we would like to out-compete the official TS typechecker even in CI settings that don't keep the linter's state across runs). The good news: we're making nice progress on profiling tools and I might get to trying some WCOJ code later today.

[0]: https://github.com/vmware/differential-datalog [1]: https://github.com/TimelyDataflow/differential-dataflow/tree... [2]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [3]: https://mtrlz.dev/api/rust/dogsdogsdogs/calculus/index.html (3rd-party hosting of docs for the calculus/dogs^3 crate) [4]: https://github.com/rslint/rslint

[+] hansvm|5 years ago|reply

> It's missing some important feature, like persistence?

For the use cases I'm envisioning, this strikes me as a nice-to-have, and even then only if the persistence API were sufficiently easy to use (or at least to avoid).

> It's had very little advertising?

I hadn't heard much about it till now.

> Rust is intimidating?

At work I need a killer reason to inflict _any_ language on everyone else. We have a lot of shallow computation graphs (really the same few graphs on different datasets) and a few deep graphs which need incremental updates. The cost of an ad-hoc solution is less than the perceived cost (maybe the real cost) of adopting an additional language.

> Other

Broad classes of algorithms will basically expand to being full re-computations with this framework (based on a quick read of the whitepaper), and adopting a tool for efficient incremental updates is less enticing if I'm going to have to manually fiddle with a bunch of update algorithms anyway. E.g., kernel density estimation needs to be designed from the ground up to support incremental updates; a naive translation of those algorithms to dataflow would efficiently update some sub-components like covariance matrices, but you'd still wind up doing O(full_rebuild) work.

[+] hansvm|5 years ago|reply

I missed the edit window, but I'm rethinking that last point. It's not clear to me yet whether the current implementation supports this, but I don't see any fundamental reason why one couldn't extend the framework with user-defined operators, and that could make for an extremely pleasant end-user experience.

[+] gen220|5 years ago|reply

We have something very similar to differential dataflow implemented at my current place of work, with our own home-brewed libraries and patterns that leverage the relatively unique way we store our data (most similar to TimeScaleDB).

Like map reduce, most people do not understand how it works and why it is a useful paradigm.

Unlike map reduce, there is not an entire sub-industry of companies offering it as a service, and engineers who have used it for years without contemplating its alternatives. In absence of this background noise, people assume DD is niche, and even "wrong" or harmful.

We have new people who come in from time to time, who have experience working at a giant MR shop, who spend the first few months wondering aloud why we don't "just use <MR Framework>". They usually come around (if they care to understand how this new system works) or give up (usually because they never understood the MR trade-offs in the first place, but were unwilling to part with its style of thinking/working).

One thing I'll note is our jargon around it is extremely minimal and literal. The diction employed by DD (and TimeScaleDB!) feels very formal in comparison, which can be off-putting to prospective users.

I'm not one to advocate for dumbing down your tone (quite the opposite). However, it's interesting to note that the successful-yet-complicated projects (like MR, kafka) have an accretion disk around them of dumbed-down explanations on Medium, youtube and the like, that can lure in people who are curious but less-academic.

I don't think you can manufacture these. It's just a matter of time, until things like this appear for DD.

[+] unknown|5 years ago|reply

[deleted]

[+] HeyImAlex|5 years ago|reply

I’ve been curious about it, but it’s difficult to wrap my mind around. I’ve read a lot of frank mcsherry’s blog posts, watched his videos, been through the book, and I guess it just hasn’t clicked for me! I also don’t have any use cases that make sense as a hobby project, and abstractly I know it could be useful at work but I can’t evangelize something I don’t really understand.

Rust took me around three attempts to get into, and it took a motivated project to really seal the deal, but at some point I understood enough that it just became programming again. Haven’t reached that with differential dataflow yet, but I’ll keep trying.

[+] namibj|5 years ago|reply

Hey,

in case you're interested, we're a loose group working on DDlog [0] and DDflow to be able to use them for a JS/TS linter [1].

There are a couple fairly concrete and varyingly-isolated tasks on Timely [2], DDlog, as well as DDflow's dogs^3 [3](for the latter, boilerplate-encapsulating tooling with documentation for WCOJ [4]). Let me know if you'd like to talk.

[0]: https://github.com/vmware/differential-datalog [1]: https://github.com/rslint/rslint [2]: https://github.com/TimelyDataflow/timely-dataflow [3]: https://mtrlz.dev/api/rust/dogsdogsdogs/index.html_ [4]: https://github.com/TimelyDataflow/differential-dataflow/tree...

[+] virgilp|5 years ago|reply

I wanted to pick it up, I feel it's under-appreciated technology that has lots of potential. Reasons why I didn't:

- It's somewhat hard to sell to management. There (was) no company behind it to provide support; and it's not a "successful Apache project"/ with large-ish community, either. And generally for a long while it was a passion project more than something Frank McSherry would actively encourage you to use in production.

- As other have said, the "hello world" is somewhat tricky. Not a lot of people know Rust. If you say "let's do this project in Rust", this will likely not go well; if I were able to use it from .NET and JVM, as a library, it might be an easier sell (I'm personally more invested in .NET now but earlier in my carer it would've been JVM)

- last but not least: the "productization" story is a bit tricky; comparing it to Spark does it no service. For Spark, not only do I have managed clusters like EMR, but I have a decent amount of tooling to see how the cluster executes (the spark web UI). Also I can deploy it in mesos, yarn not just standalone (and mesos/yarn have their own tooling). For differential dataflow, one had none of that (at least last time I checked). Maybe it'd be more fair to compare it to Kafka Streams?

  * Might I add: spark-ec2 was a huge help for me picking up spark, since before the 1.0 version. You can do tons of work on a single machine, yes... but, for this kind of systems, the very first question is "how do you distribute that?". And you have the story that "it's possible", but you don't have easy examples of "word count, done on 3 machines, not because it's necessary but because we demonstrate how easy it is to distribute the computing across machines".

  * Compared to Kafka Streams: the thing about Kafka Streams is that you know what to use it for (Kafka!) and one immediately groks how one uses this in production (all state management is delegated to Kafka, this is truly just a library that helps you work better with Kafka). With differential dataflow, it's much less clear. You could use it with Kafka, but also with Twitter directly, or with something else. And what happens if it crashes? How do you recover from that? What are the data loss risks? Does it give you any guarantees or do you have to manage that?

[+] scott_meyer|5 years ago|reply

I am a huge fan of Frank McSherry's work and don't necessarily agree with the premise that DD is somehow failing. However,...

Batch data processing is very well understood, cheap and getting cheaper every year. So, if you can afford to boil the ocean every night, DD is a tough sell.

The addressable market, customers with problems which can only be solved with DD (instantaneous exactly correct answers) is probably small right now.

[+] mistersys|5 years ago|reply

I think the killer app for differential dataflow would be an easy to set up realtime database like Firebase, but with much richer real-time queries and materialized views.

Materialize (built on differential dataflow) is cool but doesn't have the complete package of a persisted database.

[+] albertwang|5 years ago|reply

Do you happen to have any examples of real-time queries or apps you would be interested in?

Re: the second point — you’re right, Materialize has historically leveraged existing upstream systems (like Kafka) for things like persistence. But we also hear you loud and clear that not everyone wants to stand up Kafka :)

[+] asavinov|5 years ago|reply

Having a possibility to update (query) output with new input data rather than process the whole input again even if the changes are very small is indeed a very useful feature. Assume that you have one huge input table and you computed the result consisting of a few rows. Now you add 1 record to the input. A traditional data processing system will again process all the input records while the differential system will update the existing output result.

There are the following difficulties in implementing such systems:

o (Small) changes in input have to be incrementally propagated to the output as updates rather than new results. This changes the paradigm of data processing because now any new operator has to be "update-aware"

o Only simple operators can be easily implemented as "update-aware". For more complex operators like aggregation or rolling aggregations, it is frequently not clear how it can be done conceptually (efficiently)

o Differential updates have to be propagated through a graph of operations (topology) which makes the task more difficult.

o Currently popular data processing approaches (SQL or map-reduce) were not designed for such a scenario so some adaptation might be needed

Another system where such an approach was implemented, called incremental evaluation, is Lambdo:

https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!

Yet, this Python library relies on a different novel data processing paradigm where operations are applied to columns. Mathematically, it uses two types of operations: set operations and functions operations, as opposed to traditional approaches based on only set operations.

A new implementation is here:

https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce.

Yet, currently incremental evaluation is implemented only for simple operations (calculated columns).

[+] jaggirs|5 years ago|reply

How is a rolling aggregate hard to update? If the value at index i is changed, just update everything from i-n to i+n (where n is the rolling window size).

[+] andi999|5 years ago|reply

I am not working in this domain, but why not put some (a lot of) numbers on the claim that it is dramatically faster than spark etc. Maybe show how a 10 hour spark problem can be reduced to minutes.

[+] sriku|5 years ago|reply

Frank McSherry .. the one behind the timely dataflow library and materialize.com .. did show many years ago that there is plenty of performance headroom above Spark by showing how one laptop can beat a Spark cluster.

"Scalability .. but at what COST?" http://www.frankmcsherry.org/assets/COST.pdf

[+] james_woods|5 years ago|reply

Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133

Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.

Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.

[+] namibj|5 years ago|reply

Late data is very deliberately not handled. The reasoning for that is best available at [0]. Now, there are ways [1] to handle bitemporal data, but they have fairly significant issues in ergonomics and performance, due to the additional work needed to allow the bitemporal aggregations.

As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees [2] (back then, `Aggregation` was called `ValueHistory`).

Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [1]: https://github.com/frankmcsherry/blog/blob/master/posts/2018... [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...

120 comments