Materialize: A Streaming Data Warehouse

luhn|6 years ago

I didn't really understand what the product actually did after reading this blog post or the products page. I found the docs much more edifying:

> Materialize lets you ask questions about your data, and then get the answers in real time.

> Why not just use your database’s built-in functionality to perform these same computations? Because your database often acts as if it’s never been asked that question before, which means it can take a long time to come up with an answer, each and every time you pose the query.

> Materialize instead keeps the results of the queries and incrementally updates them as new data comes in. So, rather than recalculating the answer each time it’s asked, Materialize continually updates the answer and gives you the answer’s current state from memory.

> Importantly, Materialize supports incrementally updating a much broader set of views than is common in traditional databases (e.g. views over multi-way joins with complex aggregations), and can do incremental updates in the presence of arbitrary inserts, updates, and deletes in the input streams.

https://materialize.io/docs/

yomly|6 years ago

This reminds me a lot about Noria DB. Wonder if anyone familiar with both can shed any further light?

pedalpete|6 years ago

I had the same response. So reading what you've posted here, it appears to be a smart(er) cache.

Do you think it's valuable to have this as a service rather than roll your own solution?

slap_shot|6 years ago

> We believe that streaming architectures are the only ones that can produce this ideal data infrastructure.

I just want to say this is a very dangerous assumption to make.

I run a company that helps our customers consolidate and transform data from virtually anywhere in their data warehouses. When we first started, the engineer in me made the same declaration, and I worked to get data into warehouses seconds after and event or record was generated in an origin system (website, app, database, salesforce, etc).

What I quickly learned was that analysts and data scientists simply didn't want or need this. Refreshing the data every five minutes in batches was more than sufficient.

Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

To be clear, I don't discount the need of telemetry and _some_ data to be actionable in a smaller time frame, I'm just weary of a data warehouse fulfilling that obligation.

In any event, I do think this direction is the future (an overwhelming amount of data sources allow change data capture almost immediately after an event occurs), I just don't think it's only architecture that can satisfy most analysts'/data scientists' needs today.

I would love to hear the use cases that your customers have that made Materialize a good fit!

Darkphibre|6 years ago

I architected and implemented a true-realtime telemetry pipeline. The requirement was subsecond per-user aggregation and round-trip notification of thresholds exceeded. Took us a couple years, but when Halo 5 launched, we handled 2.5B events/hour without breaking a sweat (AMQP over Websockets). It's since been rolled out to multiple Microsoft 1st-party games.

The round-trip requirement was dropped before we launched, reducing the usage of the technology stack to pure telemetry gathering.

The analysts are all perfectly happy with 5-10 minute delays.

Link to my GDC talk, in case people are interested: https://www.youtube.com/watch?v=o098roxWAkA

benesch|6 years ago

> I just want to say this is a very dangerous assumption to make.

I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.

But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.

And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.

To give you an example, let me respond to this point directly:

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.

[0]: https://github.com/TimelyDataflow/differential-dataflow

thom|6 years ago

I work in the sports stats industry and we've been talking to Materialize for a while. We have a lot of demand for real time data, especially from the media and gambling sectors. The SLAs are mostly in seconds not milliseconds, so we currently don't have to wring every last ounce of latency out of our pipeline, but I'm very excited about the product because it offers a good balance of performance and developer productivity. I can even imagine a universe in which we directly offer the capability to customers to be able to build their own real-time KPIs in a language that is fairly accessible and easy to hire for.

derefr|6 years ago

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data.

The point of Materialize, from my understanding, is that you don't put things into the data warehouse and then, as a separate step, run these enrichment/reporting jobs on it.

Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

I believe you end up with a consistent MVCC snapshot between the source data and its dependent inc-matviews, where you can never see new source data and old derived data in the same query; sort of as if the inc-matview were being updated by a synchronous ON INSERT trigger on the original raw-data ingest transaction. (Or—closer analogy—sort of like the raw table has a computed-expression index visible as a standalone relation.)

Also, the job itself is written as a regular OLAP query over a fixed-snapshot dataset, as a data-scientist would expect; but this gets compiled by the query planner into a continuous CTE/window query sorta thing, that spits out running totals/averages/etc. as it gets fed incremental row-batches—but only at the granularity that consumers are demanding from it, either by directly querying the inc-matview, or by sourcing it in their own streaming jobs. Which, of course, you do by just using a regular OLAP query on inc-matview X in your definition of inc-matview Y.

> I would love to hear the use cases that your customers have that made Materialize a good fit!

We haven't used Materialize yet, but it's a very promising fit for our use-case: high-level financial-instrument analytics on blockchain data. We need to take realtime-ish batches of raw transaction data and extract, in est, arbitrarily-formatted logs from arbitrary program-runs embedded in the transaction data (using heavy OLAP queries); recognize patterns in those that correspond to abstract financial events (more heavy OLAP queries); enrich those pattern-events with things like fiscal unit conversions; and then emit the enriched events as, among other things, realtime notifications, while also ending up with a data warehouse populated with both the high- and low-level events, amenable to arbitrary further OLAP querying by data scientists.

je42|6 years ago

I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

From my perspective: batching is more complicated, than batching. (Batching requires you to define parameters like batch size and interval, while streaming does not for example). But may be batching tools are simpler than streaming tools, but i am not so sure.

Batching in general has also high(er) latency. That's why I usually don't prefer it unless:

That said batching has an advantage over streaming, it can ammortise a cost that you only pay once per batch process. With streaming you would pay the cost for each items as it arrives.

Further, the mindset requirements for engineers that work with batching is different than for streaming.

Each of these items can be valid concern for batching vs streaming. However, I find it difficult to value statements like "Batching" is the default because the industry has been doing this for years by default.

I think the industry as a whole benefits when engineers in these kind of discussions repeat why certain conditions lead to a choice like batching.

chrisjc|6 years ago

Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Eg:

KC-source --> ksql --> materialize --> kafka --> KC-sink

Question to Materialize...

What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

Why would I use this over KSqlDB?

Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

benesch|6 years ago

> Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

I'm not sure I'd say it's "more OLAP." ksqlDB is about as OLAP as it gets, considering it doesn't support any sort of transactions or consistency. We think Materialize is quite a bit more powerful than what ksqlDB offers, thanks to the underlying technologies (timely/differential). For example, our joins are proper SQL joins, and don't require you to reason about the complicated and confusing difference between a stream and a table (https://docs.ksqldb.io/en/latest/developer-guide/joins/join-...). We also have preliminary support for maintaining the consistency properties of upstream OLTP data sources, and we'll be rolling out a more complete story here shortly.

> Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Is there something in particular about ksqlDB connectors that we don't seem to support? Our CREATE SOURCE command is quite powerful: https://materialize.io/docs/sql/create-source/.

> What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

We already support file sources in a variety of formats, and support for Amazon Kinesis is on the short-term roadmap: https://github.com/MaterializeInc/materialize/issues/1239.

Are there other connector types you'd like to see?

> Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

At the moment you can't snapshot and resume, but support for this is planned.

dswalter|6 years ago

Presumably one reason to use this is latency: materialize is built on differential and timely dataflow, some innovative frameworks (by Frank McSherry, who was at Microsoft Silicon Valley Research back in the day and is elsewhere on this HN discussion) that are intended to reduce the amount of computation in certain kinds of calculations. Materialized views are particularly ripe for those advances.

It's also written in rust instead of Java, so there's no JVM RAM penalty or GC to contend with.

drej|6 years ago

I really like the pg protocol (like e.g. Cockroach), it let me use my usual tools. There are a few things I noticed:

1. It has a fairly rich support for types - these new-ish SQL engines often lack quite a lot of things, but this seems pretty decent. 2. I don't see any comparisons to KSQL, which seems to be the primary competitor. 3. Read the license. Read it carefully. It has a weird "will become open source in four years" clause, so keep that in mind. It also disallows it being hosted for clients to use (esentially as a DBaaS).

jamii|6 years ago

The main difference vs KSQL is that we support standard SQL (roughly SQL92 + correlated subqueries + json so far) and provide strong consistency:

KSQL has a distinction between streams and tables, effectively giving you control over how views are materialized but also forcing you to do that work yourself. In Materialize you just write the same SQL that you would for a batch job and the planner figures out how to transform it into a streaming dataflow.

KSQL is also eventually consistent - late arriving records might cause changes to previously emitted answers. Materialize won't give you an answer for a given timestamp until it can guarantee that the answer is correct and won't be retroactively updated later.

Expect to see some performance comparisons soon too.

irfansharif|6 years ago

(Linked in the post but) github repo: https://github.com/MaterializeInc/materialize

samuell|6 years ago

For anyone interested in the details behind all of this, you should check out Frank's blog:

https://github.com/frankmcsherry/blog

kiwicopple|6 years ago

For anyone that might be considering trying something similar with their own Postgres database (PG10+), we recently opensourced this: https://github.com/supabase/realtime

It's an Elixir (Phoenix) server that listens to PostgreSQL's native replication, transforms it into JSON, then blasts it over websockets.

I see that Materialize are using Debezium, which will give you a similar result, just with connectors to Kafka etc

thom|6 years ago

The magic here isn't that it's giving you streaming updates of a database, it's that it's making some of those updates ridiculously fast. The point is that if you have a materialized view that takes a long time to update (or just a stack of views that are slow to return results), you now have results instantly whenever changes come in. If you built a bunch of code to work around this and do clever invalidation of cached data, you get to throw all that away. The input/output into the system is less of an issue overall.

yayr|6 years ago

I am curious about the physical storage. Is it purely in-memory or is there a disk persistency possible? Is there some kind of data compression applied or what are the memory needs of it? Is it a row or column based data persistence pattern?

gaogao|6 years ago

The "you may not cluster any server instances of the Licensed Work together for one use" in the license is a fairly tricky clause. Under this clause, how would one run a fault-tolerant instance of Materialize?

benesch|6 years ago

You would need to contact us to buy an enterprise license. (Or wait four years for the license to convert to Apache 2.)

streblo|6 years ago

How does materialize compare in performance (especially ingress/egress latency) to other OLAP systems like Druid or ClickHouse? Would love to see some benchmarks.

jamii|6 years ago

We're working on benchmarks, but I expect that at the moment Materialize will be slower for one-off queries but faster for frequently repeated queries.

grumblestumble|6 years ago

This appears to not be fully open source: https://github.com/MaterializeInc/materialize/blob/master/LI...

solidangle|6 years ago

> Blazing fast results

I highly doubt this, given that the query engine is interpreted and non-vectorized. Queries are 10x to a 100x slower on a simple query, and 100x to 1000x slower on a query with large aggregations and joins without compilation of vectorization.

> Full SQL Exploration

Except for window functions it seems. These actually matter to data analysts.

benesch|6 years ago

Considerations are completely different in a streaming context. It’s not so much about how fast you can churn through terabytes of data; it’s more about how quickly you can turn around the incremental computation with each new datum. There’s some serious research behind this product, in timely and differential dataflow, and I’d encourage you to check out some of that research before making sweeping performance claims. Frank’s blog post on TPC-H is a good place to start: https://github.com/frankmcsherry/blog/blob/master/posts/2017...

We definitely have some performance engineering work to do in Materialize, but don’t let the lack of vectorization scare you off. It’s just not as important for a streaming engine.

edmundsauto|6 years ago

It's one thing to be skeptical and ask for evidence of speed, another to dismiss them out of hand due to a casual review of their website. Or did I miss it that you tried it out and found it wanting?

I work in an org with > 100 data scientists. I bet that 50% have never used window functions. I would guess than fewer than 20% know how to write one.

Your intent in this comment is unclear, but if you were looking to provide actionable feedback, you might want to reconsider your tone. This project looks like a pretty impressive feat of applied CS theory to doing useful stuff.

solidangle|6 years ago

FWIW, I do think this project is really cool. I should have taken a little bit more time to write this comment, as it's overly negative right now.

1290cc|6 years ago

Pretty cool tech although I feel they may have missed the moment as AWS, Azure and GCP are becoming hypercompetitive to solve all things related to data/storage. Azure has been churning out major updates to its services and clearly taking inspiration from companies like Snowflake. AWS I think hesitated to compete with Snowflake as they were running on AWS anyway - win/win for them.

Snowflake had incredible timing as they hit the market just before CFO's and non-tech business leaders realized the cost and talent needed to pull off a datalake successfully was more than they'd like. Those that were sick of the management jumped to Snowflake fast and AWS/Azure never really responded until recently.

Awesome to see all the innovative takes on solving these extremely technical problems! I love it!

manigandham|6 years ago

Google's data product suite seems to be the most advanced today with a completely no-ops approach and very solid primitives to work with.

Snowflake still doesn't an answer to streaming data (other than their fragile Kafka connector) and Azure Synapse still isn't publically available even months after their announcement.

AWS had a good headstart but they keep piling on more products and features that it's now a big mess and requires yet another tool (AWS Lake Formation) just to wire it all up.

cbdumas|6 years ago

I'm not sure if it's a mistake or just some cheeky humor but the "Known Limitations" link on their Docs page returns a 404.

frankmcsherry|6 years ago

It looks like a caching bug somewhere in the web pipeline. It isn't running on Materialize, sadly.

There are a pile of known limitations, and our goal is to be 100% clear about them.

manigandham|6 years ago

Congrats on the launch, always nice to see new products.

This is an interesting mix between the (now obsolete) PipelineDB, TimescaleDB with continuous aggregates, Kafka and other message systems with KSQL/ksqlDB/KarelDB, stream processing engines like Spark, and typical RDBMS like SQL Server with materialized views.

The amount of research to support complex and layered queries definitely sets this apart.

mason55|6 years ago

Not sure how the featuresets compare but AWS is releasing materialized views for Redshift sometime soon and one of the things it will support is incremental refresh (assuming your view meets some criteria).

I'm sure Materialize is better at this since it's purpose-built but if you're on Redshift you can get at least some of the benefits of incremental materialize.

jerglingu|6 years ago

Materialized views, the upcoming ability to query directly from an RDS transactional DB through Redshift, and finally true separation of compute vs storage with RA nodes IMO make Redshift the market leader by a huge margin now. I haven't actually tested RA nodes yet, but if performance is even just a fraction of legacy nodes then the competition is already dead. Redshift already is one of the best engines to optimize a query

chrisjc|6 years ago

It's been a while since I've used Redshift, but isn't it still dependent on data coming in via a COPY from S3? Any sort of Redshift materialized view offering would depend on batches of data landing in an underlying table or tables. The closest service offering from AWS is probably using Kinesis analytics (or Flink on KA) using their flavor of streaming SQL to join Kinesis streams forming new ones.

wiradikusuma|6 years ago

Materialize connects directly to event stream processors (like Kafka) --- how about Pulsar? (Goggling doesn't yield anything useful, Materialize and Pulsar are both name of brands of other things)

chris_overseas|6 years ago

Connecting Pulsar and Materialize is of interest to me too and something I might try when I find time to do so. Note that Pulsar does have a Kafka compatibility layer already[0], so it might just work out of the box. If you try this I'd be keen to hear how it goes.

[0] https://pulsar.apache.org/docs/en/adaptors-kafka/

EDIT: I don't think this adaptor will work after all, it works by replacing the Kafka Java client library with its own, so is only applicable to Java applications.

simo7|6 years ago

I'm wondering how this technology could work for OLAP cubes.

An OLAP cube that is automatically & incrementally kept in sync with the changes in the source data sounds promising.

Is that a potential use case?

polskibus|6 years ago

Looking back at the project, knowing what you know now, if you were to start again (but without obtained rust skills), would you go with rust again or pick another toolbox?

jamii|6 years ago

> would you go with rust again

Absolutely. Even aside from safety, rust has so many quality of life improvements over C++ - algebraic data types, pattern matching, checked errors instead of exceptions, sane standard library, easy packaging, readable error messages, #[must_use], immutable utf8 strings, traits instead of templates, hygienic macros etc.

Other than compile times and editor tooling, which are being actively worked on, the only real pain point I can think of is the poor support for custom allocators.

As for other alternatives, I don't personally have strong opinions on go but I think it's notable that none of the ex-cockroachdb folks at materialize suggested using it instead of rust.

mnkmnk|6 years ago

How does this compare to Spark structured streaming? That too allows writing SQL on streams and having the state update incrementally.

gbrits|6 years ago

Is this similar to TimescaleDB's Continuous Aggregates? Interested in knowing the overlap / differences.

benesch|6 years ago

Similar in concept, but much more powerful in execution. We can incrementally materialize practically any SQL 92 query, with the killer features being joins and correlated subqueries. I’m not super familiar, but TimescaleDB’s continuous aggregates (just on a cursory glance) don’t support stacked aggregates, for example: https://github.com/timescale/timescaledb/issues/1400

rl3|6 years ago

Their careers page says it's built in Rust.

That's always nice to see, since Rust jobs are somewhat rare.

gbrits|6 years ago

Looks promising! Can materialized views be backfilled?

vhold|6 years ago

According to the documentation at https://materialize.io/docs/overview/architecture/

> Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data. However, you can union streaming and file sources in views, which accomplishes a similar outcome.

What I find odd is that the documentation doesn't show UNION support in the SELECT documentation : https://materialize.io/docs/sql/select/

Perhaps just an oversight in the documentation?

unknown|6 years ago

[deleted]

sashavingardt2|6 years ago

I'm intrigued by the product. The SQL examples on the product page are atrocious though ((((

programmarchy|6 years ago

Isn't BigQuery real-time as well? For me, the wow factor is that you can host this yourself.

chrisjc|6 years ago

I'm unfamiliar with BQ, but from what I understand BQ doesn't have any concept of streams... with maybe an exception for change-streams (change tracking) and that's very different and less powerful than what Materialize is able to do.

After all, in BQ you're only able to query what has been already has been ingested or is being digested and that is somewhat by definition not 'real-time'.

manigandham|6 years ago

BigQuery does have streaming ingest and can pickup new rows immediately but you still have to run a query to get the latest results.

This system is more like BigQuery with materialized views that are constantly updated as new data arrives.

tschmidleithner|6 years ago

dang|6 years ago

This comment was briefly downvoted and flagged because the convention on HN is only to link to past threads that have comments. Otherwise users click on the link, don't find the discussion they expected, and get ornery.

Since you're new and likely didn't know this, I've restored your post. (And welcome!)

mrtweetyhack|6 years ago

[deleted]

justlexi93|6 years ago

Clickhouse has materilized views and is free.

vhold|6 years ago

I think the biggest difference is that Materialize can do any kind of SQL join on many tables at once. Clickhouse materialized views can only reference one table.

What I'd like to know is if that would enable basically implementing social networks as just 3 tables and one materialized view, and how it would scale and perform.

Users, Relationships, Post, and a Feed materialized view that simply joins them together with an index of user_id and post_timestamp.

As relationships and messages are created or deleted, the feed view is nearly instantly updated. The whole entire view service logic then is just one really fast query. "select user,post,post_timestamp from feed where user_id = current_user and post_timestamp <= last_page_post_timestamp order by post_timestamp desc limit page_size"

100 comments