rkhaitan's comments

rkhaitan | 5 years ago | on: Why isn't differential dataflow more popular?

(disclaimer: I work at Materialize and I work with Differential regularly)

Differential dataflow lets you write code such that the resulting programs are incremental e.g. if you were computing the most retweeted tweet in all of twitter or something like that and 5 minutes later 1000 new tweets showed up it would only take work proportional to the 1000 new tweets to update the results. It wouldn't need to redo the computation across all tweets.

Unlike every other similar framework I know of, Differential can also do this for programs with loops / recursion which makes it more possible to write algorithms.

Beyond that, as you've noted it parallelizes work nicely.

I wrote a blog post that was meant to explain "what does Differential do" and "when it is or isn't useful" and give some concrete examples that might be helpful. https://materialize.com/life-in-differential-dataflow/

rkhaitan | 5 years ago | on: Materialize Raises a $32M Series B

I'm not entirely sure what you mean by n per x, but if by top you mean something like "get top k records by group" then we support that. See [1] for more details. top-k is actually also rendered with a heap-like dataflow

When we plan queries we are rendering them into dataflow graphs that consist of one or more dataflow operators transforming data and sending it to other operators. Every single operator is designed to do work proportional to the number of changes in its inputs / outputs. For us, optimizing our performance a little bit less a matter of the right data structures, and more about expressing things in a dataflow that can handle changes to inputs robustly. But the robustness is more a question of "what do are my constant factors when updating results" and not "is this being incrementally maintained or not".

We have a known limitations page in our docs here [2] but it mostly covers things like incompleteness in our SQL support or Postgres compatibility. We published our roadmap in a blog post a few months ago here [3]. Beyond that everything is public on Github [4].

[1]: https://materialize.com/docs/sql/idioms/ [2]: https://materialize.com/docs/known-limitations/ [3]: https://materialize.com/blog-roadmap/ [4]: https://github.com/MaterializeInc/materialize

rkhaitan | 5 years ago | on: Materialize Raises a $32M Series B

(Disclaimer: I'm one of the engineers at Materialize)

> for example, max and min aggregates aren't supported in SQL Server because updating the current max or min record requires a query to find the new max or min record

This isn't a requirement in Materialize, because Materialize will store values in a reduction tree (which is basically like a min / max heap) so that when we add or remove a record, we can compute a new min / max in O(log (total_number_of_records)) time in the worst case (when a record is the new min / max). Realistically, that log term is bounded to 16 (it's a 16-ary heap and we don't support more than 2^64 records). Computing the min / max this way is substantially better than having to recompute with a linear scan. This [1] provides a lot more details on how we compute reductions in Materialize.

> there are obviously limits to what can be efficiently maintained

I think we fundamentally disagree here. In our view, we should be able to maintain every view either in linear time wrt the number of updates or sublinear time with respect to the overall dataset, and every case that doesn't do so is a bug. The underlying computational frameworks [2] we're using are designed for that, so this isn't just like a random fantasy.

> if Materialize has a list of constraints shorter than SQL Server's then you're sitting on technology worth billions

Thank you! I certainly hope so!

[1]: https://materialize.com/robust-reductions-in-materialize/ [2]: https://github.com/timelydataflow/differential-dataflow/blob...

rkhaitan | 5 years ago | on: Eventual Consistency isn’t for Streaming

The aim of the examples is to show what goes wrong in eventually consistent systems where it's possible that two reads of a stream may not be consistent with respect to each other. The examples are not intended to say that such anomalies can't be fixed by providing stronger consistency guarantees by using timestamps.

page 1