Materialize has tackled the hardest problem in data warehousing, materialized views, which has never really been solved, and built a solution on a completely new architecture. This solution is useful by itself, but I'm also watching eagerly how their road map [1] plays out, as they go back and build out features like persistence and start to look more like a full-fledged data warehouse, but one with the first correct implementation of materialized views.
For a primer on materialized views, and one of the key rationales for Materialize's existence, there's no better presentation than Martin Kleppman's "Turning the Database Inside-Out" (2015). (At my company it's required viewing for engineers across our stack, because every data structure is a materialized view no matter where on frontend or backend that data structure lives.)
Confluent is building an incredible business helping companies to build these types of systems on top of Kafka, Samza, and architectural principles originally developed at LinkedIn, but more along the lines of "if you'd like this query to be answered, or this recommender system to be deployed for every user, we can reliably code a data pipeline to do so at LinkedIn scale" than "you can run this query right away against our OLAP warehouse without knowing about distributed systems." (If it's more nuanced than this please correct me!)
On the other hand, Materialize could allow businesses to realize this architecture, with its vast benefits to millisecond-scale data freshness and analytical flexibility, simply by writing SQL queries as if it was a traditional system. As its capabilities expand beyond parity with SQL (though I agree that's absolutely the best place for them to start and optimize), there are tremendous wins here that could power the next generation of real-time systems.
Materialize can help us manifest The Web After Tomorrow [^1].
My previous comments persuading you why DDF is so crucial to the future of the Web:
> "There is a big upset coming in the UX world as we converge toward a generalized implementation of the "diff & patch" pattern which underlies Git, React, compiler optimization, scene rendering, and query optimization." — https://news.ycombinator.com/item?id=21683385 also with links to prior art like Adapton and Incremental.
Thanks for this, I'm glad to see I'm not the only one tired of writing everything twice (once in the frontend and once in the backend). I'll revisit the links later.
I'm glad more people are tackling this problem. There still isn't a good solution to real-time aggregation data at large scale.
At a previous company, we dealt with huge data streams (~1TB data / minute) and our customers expected real-time aggregations.
Making an in-house solution for this was incredibly difficult because each customer's data differed wildly. For example:
- Customer A's shards might have so much cardinality where memory becomes an issue.
- Customer B's shards might have so much throughput where CPU becomes a constraint. Sometimes a single aggregation may have so much throughput where you need to artificially increase the cardinality and aggregate the aggregations!
This makes the optimal sharding strategy very complex. Ideally, you want to bin-pack memory-constrained aggregations with CPU-constrained aggregations. In my opinion, the ideal approach involves detecting the cardinality of each shard and bin-packing them.
I've always found that when you are solving a concrete problem, like you were, it's vastly easier than the case of a general-purpose database because you can make all the tradeoffs that benefit your exact use case. but it sounds like that's not what you experienced. was it just how heterogeneous the clients' needs were? I guess what I'm saying is, if you are capable of handling 1TB/minute, seems like you're plenty able to and would want to be designing the system yourself - but interested what I'm missing about this.
Late to the post, but if anyone wants a good primer on Materialize (beyond what their actual engineers and a cofounder are saying in the comments), check out the Materialize Quarantine Database Lecture: https://db.cs.cmu.edu/events/db-seminar-spring-2020-db-group...
The headline refers to "incrementally updated materialize views". How does a company get funding for a feature that has already existed in other DBs for at least a decade?
E.g, Vertica refers to this as Live Aggregate Projections.
It's a cool concept but comes with huge caveats. Keeping tracking of non-estimated cardinality for COUNT DISTINCT type queries, as an example.
(Disclaimer: I'm one of the engineers at Materialize.)
> How does a company get funding for a feature that has already existed in other DBs for at least a decade? ... It's a cool concept but comes with huge caveats.
I think you answered your own question here. Incrementally-maintained views in existing database systems typically come with huge caveats. In Materialize, they largely don't.
Most other systems place severe restrictions on the kind of queries that can be incrementally maintained, limiting the queries to certain functions only, or aggregations only, or only queries without joins—or if they do support maintaining joins, often the joins must occur only on the involved tables' keys. In Materialize, by contrast, there are approximately no such restrictions. Want to incrementally-maintain a five-way join where some of the join keys are expressions, not key columns? No problem.
That's not to say there aren't some caveats. We don't yet have a good story for incrementally-maintaining queries that observe the current wall-clock time [0]. And our query optimizer is still young (optimization of streaming queries is a rather open research problem), so for some more complicated queries you may not get the resource utilization you want out of the box.
But, for many queries of impressive complexity, Materialize can incrementally-maintain results far faster than competing products—if those products can incrementally maintain those queries at all.
The technology that makes Materialize special, in our opinion, is a novel incremental-compute framework called differential dataflow. There was an extensive HN discussion on the subject a while back that you might be interested in [1].
In particular, there are important constraints like (among others)
> The projections can reference only one table.
In Materialize you can spin up just about any SQL92 query, join eight relations together, have correlated subqueries, count distinct if you want. It is then all maintained incrementally.
The lack of caveats is the main difference from the existing systems.
> The headline refers to "incrementally updated materialize views". How does a company get funding for a feature that has already existed in other DBs for at least a decade?
They're getting funding for doing it much more efficiently.
I read into the background papers when it first popped up. This is legitimate, deep computer science that other DBs don't yet have.
> All of this comes in a single binary that is easy to install, easy to use, and easy to deploy.
And it looks like they chose a sensible license for that binary [1], so they're not giving too much away.
I wonder though if they could have made this work as a bootstrapped business, so they would answer only to customers, not to investors chasing growth at all costs.
An old coworker explained to me about how his previous company used DBT to create many different projections of messy data to serve many applications, rather than trying to come up with the One Canonical Representation. It truly blew my mind in terms of thinking about how to model data within a business.
The huge limitation with this vision is that it only works in places where you can tolerate some pretty significant staleness. So the promise of this approach excludes most OLTP applications. I simply assumed it wouldn't be reasonable to create something that allows for unconstrained SQL-based transformations in real time, and that no one was working on this. Oh well.
But several months back, I discovered Materialize and it was an "oh shit" moment. Someone was actually doing this, and in a very first principles-driven approach. I'm really excited for how this project evolves.
Don’t forget to keep your eyes on the architectural concept of Command Query Record Separation (CQRS).
When combined with event sourcing [1], there is a new unified architecture possible that solves the problem that microservices create by fragmenting data [2], and performant querying on data updating in real time.
This architecture represents more complexity but increased flexibility.
I recently saw this article about federated GraphQL [3], and while a cool idea and probably the ultimate solution (API composition), I expect that with network and physical boundaries between services still adding latency, we need materialized views as part of the architecture to compensate for the overhead of bringing together aggregate root objects from multiple systems.
Can you point me at documentation for the fault tolerance of the system? A huge issue for streaming systems (and largely unsolved AFAIK) is being able to guarantee that counts aren't duplicated when things fail. How does Materialize handle the relevant failure scenarios in order to prevent inaccurate counts/sums/etc?
I think the right starter take is that Materialize is a deterministic compute engine, one that relies on other infrastructure to act as the source of truth for your data. It can pull data out of your RDBMS's binlog, out of Debezium events you've put in to Kafka, out of local files, etc.
On failure and restart, Materialize leans on the ability to return to the assumed source of truth, again a RDBMS + CDC or perhaps Kafka. I don't recommend thinking about Materialize as a place to sink your streaming events at the moment (there is movement in that direction, because the operational overhead of things like Kafka is real).
The main difference is that unlike an OLTP system, Materialize doesn't have to make and persist non-deterministic choices about e.g. which transactions commit and which do not. That makes fault-tolerance a performance feature rather than a correctness feature, at which point there are a few other options as well (e.g. active-active).
This is a solved problem, for a few years now. The basic trick is to publish "pending" messages to the broker which are ACK'd by a later written message, only after the transaction and all it's effects have been committed to stable storage (somewhere). Meanwhile, you also capture consumption state (e.x. offsets) into the same database and transaction within which you're updating the materialization results of a streaming computation.
Here's [1] a nice blog post from the Kafka folks on how they approached it.
Gazette [2] (I'm the primary architect) also solves in with some different trade-offs: a "thicker" client, but with no head-of-line blocking and reduced end-to-end latency.
Estuary Flow [3], built on Gazette, leverages this to provide exactly-once, incremental map/reduce and materializations into arbitrary databases.
This is maybe a silly question, but what's the difference between timely dataflow and Spark's execution engine? From my understanding they're doing very similar things - break down a sequence of functions on a stream of data, parallelize them on several machines, and then gather the results.
I understand that the feature set of timely dataflow is more flexible than Spark - I just don't understand why (I couldn't figure it out from the paper, academic papers really go over my head).
Does anyone know how Materialize stacks up against VIATRA in terms of performance? VIATRA seems very similar to Materialize. They have multiple algorithms implemented to incrementalize queries, including Differential Dataflow. The main difference seems to be that it's based on Graph Patterns instead of SQL.
It's a good question, but you'd have to ask them I think. Tamas (from Itemis) and I were in touch for a while, mostly shaking out why DD was out-performing their previous approach, but I haven't heard from him since.
My context at the time was that they were focused on doing single rounds of incremental updates, as in a PL UX, whereas DD aims at high throughput changes across multiple concurrent timestamps. That's old information though, so it could be very different now!
is there a single comprehensive list of restrictions on what can and can't be materialized? for example, if SQL Server can't efficiently maintain your materialized view then it doesn't let you create it - the whole list of restrictions is here: https://docs.microsoft.com/en-us/sql/relational-databases/vi...
I'd love to be able to directly compare this with that Materialize is capable of - does a similar document exist?
It's easier to describe the things that cannot be materialized.
The only rule at the moment is that you cannot currently maintain queries that use the functions `current_time()`, `now()`, and `mz_logical_timestamp()`. These are quantities that change automatically without data changing, and shaking out what maintaining them should mean is still open.
Other than that, any SELECT query you can write can be materialized and incrementally maintained.
I was wondering if Materialize is meant to be used in analytical workloads only, or would it be equally up to the task for consumer app kind of workloads as well?
[+] [-] georgewfraser|5 years ago|reply
[1] https://materialize.com/blog-roadmap/
[+] [-] btown|5 years ago|reply
https://www.confluent.io/blog/turning-the-database-inside-ou...
Confluent is building an incredible business helping companies to build these types of systems on top of Kafka, Samza, and architectural principles originally developed at LinkedIn, but more along the lines of "if you'd like this query to be answered, or this recommender system to be deployed for every user, we can reliably code a data pipeline to do so at LinkedIn scale" than "you can run this query right away against our OLAP warehouse without knowing about distributed systems." (If it's more nuanced than this please correct me!)
On the other hand, Materialize could allow businesses to realize this architecture, with its vast benefits to millisecond-scale data freshness and analytical flexibility, simply by writing SQL queries as if it was a traditional system. As its capabilities expand beyond parity with SQL (though I agree that's absolutely the best place for them to start and optimize), there are tremendous wins here that could power the next generation of real-time systems.
EDIT: some clarifications and additional examples
[+] [-] dataplayer|5 years ago|reply
[+] [-] random314|5 years ago|reply
https://flink.apache.org/2020/07/28/flink-sql-demo-building-...
[+] [-] adamnemecek|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] jameslk|5 years ago|reply
[+] [-] pgt|5 years ago|reply
My previous comments persuading you why DDF is so crucial to the future of the Web:
> "There is a big upset coming in the UX world as we converge toward a generalized implementation of the "diff & patch" pattern which underlies Git, React, compiler optimization, scene rendering, and query optimization." — https://news.ycombinator.com/item?id=21683385 also with links to prior art like Adapton and Incremental.
> "DD (Differential Dataflow) is commercialized in Materialize" — https://news.ycombinator.com/item?id=24846119
> "Materialize exists to efficiently solve the view maintenance problem" https://news.ycombinator.com/item?id=22888396
[+] [-] cocoflunchy|5 years ago|reply
[+] [-] temuze|5 years ago|reply
At a previous company, we dealt with huge data streams (~1TB data / minute) and our customers expected real-time aggregations.
Making an in-house solution for this was incredibly difficult because each customer's data differed wildly. For example:
- Customer A's shards might have so much cardinality where memory becomes an issue.
- Customer B's shards might have so much throughput where CPU becomes a constraint. Sometimes a single aggregation may have so much throughput where you need to artificially increase the cardinality and aggregate the aggregations!
This makes the optimal sharding strategy very complex. Ideally, you want to bin-pack memory-constrained aggregations with CPU-constrained aggregations. In my opinion, the ideal approach involves detecting the cardinality of each shard and bin-packing them.
[+] [-] jstrong|5 years ago|reply
[+] [-] beoberha|5 years ago|reply
[+] [-] mavelikara|5 years ago|reply
[+] [-] mrits|5 years ago|reply
E.g, Vertica refers to this as Live Aggregate Projections.
It's a cool concept but comes with huge caveats. Keeping tracking of non-estimated cardinality for COUNT DISTINCT type queries, as an example.
[+] [-] benesch|5 years ago|reply
> How does a company get funding for a feature that has already existed in other DBs for at least a decade? ... It's a cool concept but comes with huge caveats.
I think you answered your own question here. Incrementally-maintained views in existing database systems typically come with huge caveats. In Materialize, they largely don't.
Most other systems place severe restrictions on the kind of queries that can be incrementally maintained, limiting the queries to certain functions only, or aggregations only, or only queries without joins—or if they do support maintaining joins, often the joins must occur only on the involved tables' keys. In Materialize, by contrast, there are approximately no such restrictions. Want to incrementally-maintain a five-way join where some of the join keys are expressions, not key columns? No problem.
That's not to say there aren't some caveats. We don't yet have a good story for incrementally-maintaining queries that observe the current wall-clock time [0]. And our query optimizer is still young (optimization of streaming queries is a rather open research problem), so for some more complicated queries you may not get the resource utilization you want out of the box.
But, for many queries of impressive complexity, Materialize can incrementally-maintain results far faster than competing products—if those products can incrementally maintain those queries at all.
The technology that makes Materialize special, in our opinion, is a novel incremental-compute framework called differential dataflow. There was an extensive HN discussion on the subject a while back that you might be interested in [1].
[0]: https://github.com/MaterializeInc/materialize/issues/2439
[1]: https://news.ycombinator.com/item?id=22359769
[+] [-] frankmcsherry|5 years ago|reply
You can read about Vertica's "Live Aggregate Projections" here:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/An...
In particular, there are important constraints like (among others)
> The projections can reference only one table.
In Materialize you can spin up just about any SQL92 query, join eight relations together, have correlated subqueries, count distinct if you want. It is then all maintained incrementally.
The lack of caveats is the main difference from the existing systems.
[+] [-] hnmullany|5 years ago|reply
https://docs.rs/timely/0.11.1/timely/
[+] [-] jacques_chester|5 years ago|reply
They're getting funding for doing it much more efficiently.
I read into the background papers when it first popped up. This is legitimate, deep computer science that other DBs don't yet have.
[+] [-] mwcampbell|5 years ago|reply
And it looks like they chose a sensible license for that binary [1], so they're not giving too much away.
I wonder though if they could have made this work as a bootstrapped business, so they would answer only to customers, not to investors chasing growth at all costs.
[1]: https://materialize.com/download/
[+] [-] offtop5|5 years ago|reply
If your goal is an exit, and you can raise this much, why not.
[+] [-] acjohnson55|5 years ago|reply
An old coworker explained to me about how his previous company used DBT to create many different projections of messy data to serve many applications, rather than trying to come up with the One Canonical Representation. It truly blew my mind in terms of thinking about how to model data within a business.
The huge limitation with this vision is that it only works in places where you can tolerate some pretty significant staleness. So the promise of this approach excludes most OLTP applications. I simply assumed it wouldn't be reasonable to create something that allows for unconstrained SQL-based transformations in real time, and that no one was working on this. Oh well.
But several months back, I discovered Materialize and it was an "oh shit" moment. Someone was actually doing this, and in a very first principles-driven approach. I'm really excited for how this project evolves.
[+] [-] thejosh|5 years ago|reply
[+] [-] npiit|5 years ago|reply
[+] [-] nickstinemates|5 years ago|reply
[+] [-] sixdimensional|5 years ago|reply
When combined with event sourcing [1], there is a new unified architecture possible that solves the problem that microservices create by fragmenting data [2], and performant querying on data updating in real time.
This architecture represents more complexity but increased flexibility.
I recently saw this article about federated GraphQL [3], and while a cool idea and probably the ultimate solution (API composition), I expect that with network and physical boundaries between services still adding latency, we need materialized views as part of the architecture to compensate for the overhead of bringing together aggregate root objects from multiple systems.
[1] https://www.confluent.io/blog/event-sourcing-cqrs-stream-pro...
[2] https://microservices.io/patterns/data/cqrs.html
[3] https://netflixtechblog.com/how-netflix-scales-its-api-with-...
[+] [-] haggy|5 years ago|reply
[+] [-] frankmcsherry|5 years ago|reply
I think the right starter take is that Materialize is a deterministic compute engine, one that relies on other infrastructure to act as the source of truth for your data. It can pull data out of your RDBMS's binlog, out of Debezium events you've put in to Kafka, out of local files, etc.
On failure and restart, Materialize leans on the ability to return to the assumed source of truth, again a RDBMS + CDC or perhaps Kafka. I don't recommend thinking about Materialize as a place to sink your streaming events at the moment (there is movement in that direction, because the operational overhead of things like Kafka is real).
The main difference is that unlike an OLTP system, Materialize doesn't have to make and persist non-deterministic choices about e.g. which transactions commit and which do not. That makes fault-tolerance a performance feature rather than a correctness feature, at which point there are a few other options as well (e.g. active-active).
Hope this helps!
[+] [-] jgraettinger1|5 years ago|reply
Here's [1] a nice blog post from the Kafka folks on how they approached it.
Gazette [2] (I'm the primary architect) also solves in with some different trade-offs: a "thicker" client, but with no head-of-line blocking and reduced end-to-end latency.
Estuary Flow [3], built on Gazette, leverages this to provide exactly-once, incremental map/reduce and materializations into arbitrary databases.
[1]: https://www.confluent.io/blog/exactly-once-semantics-are-pos...
[2]: https://gazette.readthedocs.io/en/latest/architecture-exactl...
[3]: https://estuary.readthedocs.io/en/latest/README.html
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] coinwitcher|5 years ago|reply
https://news.ycombinator.com/item?id=25267734
[+] [-] shay_ker|5 years ago|reply
I understand that the feature set of timely dataflow is more flexible than Spark - I just don't understand why (I couldn't figure it out from the paper, academic papers really go over my head).
[+] [-] jstrong|5 years ago|reply
[+] [-] leeuw01|5 years ago|reply
[+] [-] frankmcsherry|5 years ago|reply
My context at the time was that they were focused on doing single rounds of incremental updates, as in a PL UX, whereas DD aims at high throughput changes across multiple concurrent timestamps. That's old information though, so it could be very different now!
[+] [-] gfody|5 years ago|reply
I'd love to be able to directly compare this with that Materialize is capable of - does a similar document exist?
[+] [-] frankmcsherry|5 years ago|reply
The only rule at the moment is that you cannot currently maintain queries that use the functions `current_time()`, `now()`, and `mz_logical_timestamp()`. These are quantities that change automatically without data changing, and shaking out what maintaining them should mean is still open.
Other than that, any SELECT query you can write can be materialized and incrementally maintained.
https://materialize.com/docs/sql/select/
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] adamnemecek|5 years ago|reply
[+] [-] thuongleit|5 years ago|reply
[+] [-] animeshjain|5 years ago|reply
[+] [-] frankmcsherry|5 years ago|reply
https://materialize.com/lateral-joins-and-demand-driven-quer...
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] unknown|5 years ago|reply
[deleted]