top | item 26218859

(no title)

nchammas | 5 years ago

> If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target.

Author here. 100% agreed.

As an aside, I just came across your post about how Databricks is an RDBMS [0]. I recently wrote a similar article from a slightly more abstract perspective [1].

Having worked heavily with RDBMSs in the first part of my career, I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling. And that was part of my inspiration for this post about data pipelines.

[0] https://fivetran.com/blog/databricks-is-an-rdbms

[1] https://nchammas.com/writing/modern-data-lake-database

discuss

lmm|5 years ago

https://www.confluent.io/blog/turning-the-database-inside-ou... might be interesting if you haven't seen it already. The way I see it RDBMSes already had most of the tech, but they wrap it up in an opaque black box that can only be used one way, like those automagical frameworks that generate a full webapp from a couple of class definitions but then fall apart as soon as you want to do something slightly specialised. So the data revolution is really about unbundling all the cool RDBMS tech and moving from a shrinkwrapped database product to something more like a library that gives you full control over what happens when, letting you integrate your business logic wherever it makes sense.

nchammas|5 years ago

That talk by Martin Kleppmann is fantastic.

Another, closely related article is "The Log: What every software engineer should know about real-time data's unifying abstraction" by Jay Kreps.

https://engineering.linkedin.com/distributed-systems/log-wha...

jgraettinger1|5 years ago

> I feel like so many of the concepts and patterns I learned about there are being re-expressed today with modern, distributed data tooling

So much this.

I work on Flow [0] at Estuary [1]; you might be interested in what we're doing. It offers millisecond latency continuous datasets that are _also_ plain JSON files in cloud storage -- a real-time data lake -- which can be declaratively transformed and materialized into other systems (RDBMS, key/value, even pub/sub) using precise, continuous updates. It's the materialized view model you articulate, and Flow is in essence composing data lake-ing with continuous map/combine/reduce to make it happen.

I was asked the other day if Flow "is a database" by someone who only wanted a 2-3 sentence answer, and I waffled badly. The very nature of "databases" are so fuzzy today. They're increasingly unboxed across the Cambrian explosion of storage, query, and compute options now available to us. S3 and friends for primary storage; on-demand MPP compute for query and transformation; wide varieties of RDBMSs, key/value stores, OLAP systems, even pub/sub as specialized indexes for materializations. Flow's goal, in this worldview, is to be a hypervisor and orchestrator for this evolving cloud database. Not a punchy elevator pitch, but there it is.

[0] https://estuary.readthedocs.io/en/latest/README.html [1] https://estuary.dev/

kthejoker2|5 years ago

Strong concur, the idea of "database" just doesn't make sense anymore, I've (d?)evolved to "storage, streaming, and compute" (as it seems you do as well) in all my discussions.

Have you read ThoughtWorks material on the "data mesh"? Sounds like your product is looking to be a part of that kind of new data ecology.

https://martinfowler.com/articles/data-mesh-principles.html

It is in definitely anticlimactic after data warehousing, data lakes, data lakehouses, etc. to just throw up your hands and say "data whenever you want it, wherever you want it, in whatever form you want it" (at whatever price you're willing to pay!) So I feel your pain on marketing your product,but I think the next 5 years or so will be heavily focused on automating data quality and standardized pipelines, computational governance and optimizing workloads, and intelligent "just in time" materializaton, caching, HTAP, etc.

Your big play in my mind is helping customers optimize the (literal financial) cost/benefit tradeoff of all those compute and query engines.

nchammas|5 years ago

Interesting project! (The interactive slides are cool btw.)

Could you share a bit about how engineers express data transformations in Flow?

From a quick look at the docs, it doesn't look like you are using SQL to do this, which is interesting since it bucks the general trend in data tooling.