psfried's comments

psfried | 1 year ago | on: Streaming joins are hard

Yes, and this is an important point! This is the reason for our current approach for sqlite derivations. You can absolutely just store all the data in the sqlite database, as long as it actually fits. And there's cases where people actually do this on our platform, though I don't think we have an example in our docs.

A lot of people just learning about streaming systems don't come in with useful intuitions about when they can and can't use that approach, or even that it's an option. We're hoping to build up to some documentation that can help new people learn what their options are, and when to use each one.

psfried | 1 year ago | on: Streaming joins are hard

I agree completely! We've always talked about this, but we haven't really seen a clear way to package it into a good developer UX. We've got some ideas, though, so maybe one day we'll take a stab at it. For now we've been more focused on integrations and just building out the platform.

psfried | 1 year ago | on: Streaming joins are hard

The main benefit isn't necessarily that it's _streaming_ per se, but that it's _incremental_. We typically see people start by just incrementally materializing their data to a destination in more or less the same set tables that exist in the source system. Then they develop downstream applications on top of the destination tables, and they start to identify queries that could be sped up by pre-computing some portion of it incrementally before materializing it.

There's also cases where you just want real time results. For example, if you want to take action based on a joined result set, then in the rdbms world yoy might periodically run a query that joins the tables and see if you need to take action. But polling becomes increasingly inefficient at lower polling intervals. So it can work better to incrementally compute the join results, so you can take action immediately upon seeing something appear in the output. Think use cases like monitoring, fraud detection, etc.

psfried | 1 year ago | on: Gazette: Cloud-native millisecond-latency streaming

To my knowledge, nobody's implemented parquet fragment files. But it supports compression of JSONL out of the box. JSON compresses very well, and compression ratios approaching 10/1 are not uncommon.

But more to the point, journals are meant for things that are written _and read_ sequentially. Parquet wasn't really designed for sequential reads, so it's unclear to me whether there would be much benefit. IMHO it's better to use journals for sequential data (think change events) and other systems (e.g. RDBMS or parquet + pick-your-compute-flavor) for querying it. I don't think there's yet a storage format that works equally well for both.

psfried | 1 year ago | on: Gazette: Cloud-native millisecond-latency streaming

I don't think it's correct to say that JSONL is any more vulnerable to invalid data than other message framings. There's literally no system out there that can fully protect you from bugs in your own application. But the client libraries do validate the framing for you automatically, so in practice the risk is low. I've been running decently large Gazette clusters for years now using the JSONL framing, and have never seen a consumer write invalid JSON to a journal.

The choice of message framing is left to the writers/consumers, so there's also nothing preventing you from using a message framing that you like better. Similarly, there's nothing preventing you from adding metadata that identifies the writer. Having this flexibility can be seen as either a benefit or a pain. If you see it as a pain and want something that's more high-level but less flexible, then you can check out Estuary Flow, which builds on Gazette journals to provide higher-level "Collections" that support many more features.

psfried | 2 years ago | on: Data-Oriented Design Principles

In contrast to jebarker's comment, I actually think it's really interesting that a concept coming from game engine development actually seems quite applicable in some very different domains.

We (https://estuary.dev/) ended up arriving at a very similar design for transformations in streaming analytics pipelines: https://docs.estuary.dev/concepts/derivations/

To paraphrase, each derivation produces a collection of data by reading from one or more source collections (DOD calls these "streams"), optionally updating some internal state (sqlite), and emitting 0 or more documents to add to the collection. We've been experimenting with this paradigm for a few years now in various forms, and I've found it surprisingly capable and expressive. One nice property of this system is that every transform becomes testable by just providing an ordered list of inputs and expectations of outputs. Another nice property is that it's relatively easy to apply generic and broadly applicable scale-out strategies. For example, we support horizontal scaling using consistent hashing of some value(s) that's extracted from each input.

Putting it all together, it's not hard to imagine building real-world web applications using this. Our system is more focused on analytics pipelines, so you probably don't want to build a whole application out of Flow derivations. But it would be really interesting to see a more generic DOD-based web application platform, as I'd bet it could be quite a nice way to build web apps.

psfried | 2 years ago | on: An off the shelf solution for data products

I feel like "data products" was a great idea, but difficult to implement in practice. There's kind of a paradox where you need a platform in order to host your data products, but any data products that are tied to a specific platform are almost by definition _not_ data products. Our solution was to focus on _delivery_ of data products to the systems that you're already using instead of making consumers of data products use our platform. I think it's turning out pretty well, so I thought I'd share and see what y'all think.

psfried | 3 years ago | on: Why isn’t there a decent file format for tabular data?

> But it is binary, so can’t be viewed or edited with standard tools, which is a pain.

I've heard this sentiment expressed multiple times before, and a minor quibble I have with it is that the fact that it's binary has nothing to do with whether or not it's a pain. It's a pain because the tools aren't ubiquitous, so you can't count on them always being installed everywhere. But I'd argue that sqlite _is_ ubiquitous at this point and, as others have mentioned, it's a _great_ format for storing tabular data.

JSON is also a fine choice, if you want it to be human readable, and I'm not sure why this is claiming it's "highly sub-optimal" (which I read as dev-speak for 'absolute trash'). JSON is extremely flexible, compresses very well, has great support for viewing in lots of editors, and even has a decent schema specification. Oh, and line-delimited JSON is used in lots of places, and allows readers to begin at arbitrary points in the file.

psfried | 4 years ago | on: Programmers’ emotions

I get the gist of what you're saying, and broadly agree that seasoned programmers tend to develop a strong sense of professional humility. I have to say that I think the analogy goes a bit too far, though. Even very poor programmers can get things to compile and tests to pass. The things that make a programmer very successful are still very nebulous and difficult to measure, just like in creative professions.

psfried | 4 years ago | on: MapReduce is making a comeback

First let me say that I think Timely Dataflow and Materialize are both super cool. The two approaches are quite different, in part because they solve slightly different problems. Or maybe it's more fair to say that they think of the world in somewhat different ways. Probably most of the differences can be traced back to how Timely Dataflow relies on the expiration of timestamps in order to coordinate updates to its results. You can read the details on that in their docs (https://timelydataflow.github.io/timely-dataflow/chapter_5/c...).

I think a reasonable TLDR might be to say that continuous map reduce has a better fault-tolerance story, while timely dataflow is more efficient for things like reactive joins. They both have their purpose, though, and I imagine that both Flow and Materialize will go on to co-exists as successful products.

psfried | 4 years ago | on: MapReduce is making a comeback

IANAE on Flink, especially when it comes to the internals. But I think that the decomposition of computations into distinct map and reduce functions seems to afford a bit more flexibility, since it can be useful to apply reductions separately from map functions, and vice versa. For example, you could roll up updates to entities over time just with a reduce function, and you could easily do so eagerly (when the data is ingested) or lazily (when the data is materialized into an external system). That type of flexibility is important when you want a realtime data platform that needs to serve a broad range of use cases.

psfried | 4 years ago | on: MapReduce is making a comeback

Apart from Google, which has a patent related to their 2004 paper, I don't know how much people are trying to "take credit" for map-reduce. I'm certainly not. But I do think the approach of running map-reduce continuously in realtime is interesting and worth sharing. And I hope that some folks will be interested enough to try it out, either with Flow or in a system of their own design, and report on how it goes for them.

psfried | 4 years ago | on: MapReduce is making a comeback

I agree with this. As soon as the MapReduce paper came out, people were criticizing it for a lack of novelty, claiming that so-and-so has been using these same techniques for years. And of course those critics are still around saying the same things. But I think there's a reason we keep going back to these techniques, and I think it's because they repeatedly prove to be practical and effective.

psfried | 4 years ago | on: MapReduce is making a comeback

Versioning is indeed an issue, but that's the case for anything with long-lived state. Our current rely on JSON schemas, TypeScript, and built-in testing support to help ensure compatibility. Those things actually help quite a bit in practice. But I think we may also want to build some more powerful features for managing versions of datasets, since there's a real need there, regardless of the processing model you use to derive the data.

psfried | 4 years ago | on: Green vs. Brown Programming Languages

Another possible explaination is simply that people have gotten better at designing programming languages. Or that newer languages are better adapted to solving the problems we now want to solve. Which really shouldn't be too hard to swallow.
page 1