Eventual Consistency isn’t for Streaming

[+] kqr|5 years ago|reply

I agree with the other commenter. Eventual consistency has always been roughly a synonym for "tactical lack of consistency." The reason this works is that inconsistency is, in many business domains, not such a big deal as we make it out to be. Most business are used to data lagging behind, documents being filed incorrectly, decisions being changed and half of documents referring to the old decision, to mention just a few possibilities. As long as everything is dated and there are corroborating versions of all facts, this can be untangled by experts in the few cases it really matters. Most of the time, it doesn't matter that much.

Eventual consistency is embracing this philosophy of a lack of consistency for computer systems too, on the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.

This of course, in principle, can lead to ever degrading consistency and since you can't assume everything is consistent, you also cannot really verify consistency in any other way than heuristically, as another commenter suggested.

Eventual consistency is a design driven by practical needs. It is never a path to reach complete data purity.

And this applies both to streaming and batch tasks alike.

[+] virgilp|5 years ago|reply

> the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.

Maintaining actual consistency is seldom more complex - the opposite is true, eventual consistency can lead to mind boggling complexity (because it's very hard to reason about your guarantees anymore... even the "eventual correctness" guarantee; in practice it's more often than not a handwavy "yeah, it's likely probably correct in many cases, and if you find something wrong, we'll take it as a bug and fix it. Or at least claim to fix it, because you know, it might be hard to reproduce". Good enough for usecases like advertising, I guess)

Too expensive/slow is the typical reason for eventual consistency - but the whole point of materialize.io is to challenge this "too expensive/slow" assumption.

[+] dustingetz|5 years ago|reply

enterprise is rapidly approaching a data quality crisis where they have all these data warehouses but the final analytic artifacts end up being garbage and unusable for data science ... you will be hearing a lot more of this in the 2020s

[+] PeterCorless|5 years ago|reply

If the term "asynchronous consistency" was adopted, I wonder if people would grok it easier.

[+] asdfasgasdgasdg|5 years ago|reply

This article isn't very convincing to me. I mean, I one hundred percent buy that eventually consistent stream processing systems can theoretically be subject to unbounded error. But eventual consistency isn't just a theoretical model. It's also a practical engineering decision, and so in order to evaluate its use for any given business purpose we have to see how it performs in practice. That is, what is the average/99.9%/max error? And we have to understand how business-critical the correct answer is. This article has some great examples of theoretical issues with eventually consistent stream processing computation, but it doesn't demonstrate that any real systems evince these problems under any given workload.

[+] dominotw|5 years ago|reply

> Not all is lost! There are stream processing systems that provide strong consistency guarantees. Materialize and Differential Dataflow both avoid these classes of errors by providing always correct answers

yeah i was expecting to see what tradeoffs materialize made to get 'always correct' result. There is definitely something 'lost' for 'always correct' too.

I can only attribute this one sided take to deviousness. Personally , I would avoid whatever this company is selling.

[+] cs702|5 years ago|reply

For more concise and precise explanations of the rationale for these kinds of tools, see this paper: https://github.com/TimelyDataflow/differential-dataflow/raw/... -- here's the abstract:

> Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams.

See also this friendlier (and lengthier) online book: https://timelydataflow.github.io/differential-dataflow/

[+] virgilp|5 years ago|reply

materialize.io is literally timelydataflow/ differential dataflow... same product, developed by Frank McSherry. It's not "the other tool", it's the very same.

[+] alextheparrot|5 years ago|reply

I'm actually just fundamentally confused about what is being argued.

I'm familiar with streaming, as a concept, from the likes of Beam, Spark, Flink, Samza - they do computations over data, producing intermediate results consistent with the data seen so far. These results are, of course, not necessarily consistent with the larger world because there could be unprocessed or late events in a stream, but they are consistent with the part of the world seen so far.

The advantage of streaming is the ability to compute and expose intermediate snapshots of the world that don't rely on the stream closing (As many streams found in reality are not bounded, meaning intermediate results are the only realizable result set). These intermediate results can have value, but that depends on the problem statement.

To examine one of the examples, let's use example 2, this aligns with the idea that we actually don't have a traditional streaming problem. The question being asked is "What is the key which contains the maximum value". There is a difference between asking "What is the maximum so far today" and "What was the maximum result today" -- the tense change is important because in the former the user cares about the results as they exist in the present moment, whereas the other cares about a view of the world in a time frame that is complete. It seems like the idea of "consistent" is being conflated with "complete", wherein "complete" is not a guaranteed feature of an input stream.

If anyone could clarify why the examples here isn't just a case of expecting bounded vs unbounded streams?

[+] arjunnarayan|5 years ago|reply

The argument is that without stronger consistency guarantees you can't do joins between two streams (or even something like argmax over a single stream, since it splits the stream into two subcomputations, which then have to be joined back together).

I think when folks say that eventual consistency is okay, they're thinking about simple aggregates - where transient incorrectness in the result is indistinguishable from noise.

But if you want to do joins, you really want to be able to reason about your unbounded streams causally - Flink, Beam, (and as another commenter points out, Firebase as well) provide stronger consistency guarantees on computations over unbounded streams.

[+] bkirwi|5 years ago|reply

The issue being pointed out here isn't that the computation is out of sync with the outside world... it's that it's our of sync with _itself_! It will return answers that are not just stale, but inaccurate for any time in history.

This might still be fine, depending on your needs, but IMO a legitimate distinction.

[+] nikhilsimha|5 years ago|reply

In both examples 2 and 3, the author reads the same stream twice independently and assumes that a join is not synchronized between the transformed streams. This seems like a fundamental flaw in their offering.

Pushing in a timestamp along with the max/variance change stream[1]. And then using the timestamp to synchronize the join[2] would naturally produce a consistent output stream.

I quoted flink because they have the best docs around. But it should be possible in most streaming systems. Disclaimer, I used to work for the fb streaming group and have collaborated with the flink team very briefly.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/t...

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.11...

[+] rkhaitan|5 years ago|reply

The aim of the examples is to show what goes wrong in eventually consistent systems where it's possible that two reads of a stream may not be consistent with respect to each other. The examples are not intended to say that such anomalies can't be fixed by providing stronger consistency guarantees by using timestamps.

[+] dekimir|5 years ago|reply

> you should be prepared for your results to be never-consistent

Isn't this a core feature of distributed systems? How can you be "consistent" if there's a network failure between some writer and the stream? How can you tell a network failure from a network delay? How can you tell a network delay from any other delay?

And finally, how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?

[+] jlokier|5 years ago|reply

> Isn't this a core feature of distributed systems? How can you be "consistent" if there's a network failure between some writer and the stream? How can you tell a network failure from a network delay? How can you tell a network delay from any other delay?

This is covered by the CAP theorem. https://en.wikipedia.org/wiki/CAP_theorem

The basic solution is: If you need consistency and there's too much network failure (or delay), you'll have to pause operations and wait until the network is fixed.

If there's only a bit of network failure (or delay), consistency stays possible using quorum protocols such as Paxos and Raft.

> how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?

Implicit causality helps.

You're right that there may be no definite logical time, but it often doesn't matter.

When a program issues a read command, the logical timestamp is, implicitly, greater than the timestamp of all results previously received from the network that were inputs to produce the read command.

So the rest of the network "knows" something about the logical time of the read command. It's not an exact logical time, and if the timestamps aren't passed around, it might not even be an inequality. It's more like a logical property that relates dependent values.

If done right, that's enough to ensure strict consistency in observable results.

Unless the program issuing reads does wild things with value speculation. You may have heard how much things can go wrong with speculative execution...

[+] PeterCorless|5 years ago|reply

I'm still trying to figure out how you avoid speed of light propogation delays to get immediate consistency.

[+] anonymousDan|5 years ago|reply

There's been plenty of work in the past on weaker correctness guarantees for stream processing system (e.g. concepts like rollback and gap recovery from Aurora). Not sure it's an either/or between eventually consistent and strong consistency.

[+] satyrnein|5 years ago|reply

Side question - has anyone tried using Materialize beyond toy workloads? Can I move billions of rows off of a batch workflow on Snowflake onto Materialize and suddenly everything is near realtime?

[+] DevKoala|5 years ago|reply

I keep falling for these clickbait titles in the hopes I will find a fair argument. However, the moment I realize the article is trying to sell me a product based around an argument, I lose faith on the perspective of the writer.

If the title was something more honest such as “How product X solves for Y” I’d feel more compelled to put trust on the analysis being objective.

[+] tlarkworthy|5 years ago|reply

Firebase provides causal consistency. By subscribing to streams (listen), the client opts into which data sources it was consistent snapshots of, then all distinct client streams are bundled up and delivered in order over the wire. It's a very elegant model which does not get in the way and has nice ergonomics.

[+] andrekandre|5 years ago|reply

so, if i understand the article correctly, for purposes of realtime reporting/monitoring (streaming, as stated), eventual consistency is not an appropriate "store" to hook into because you cant know when things have become consitstent, and reliable streaming of (near?) realtime data requires some chance for that to occur

is that a correct interpretation?

[+] erikerikson|5 years ago|reply

TL;DR: accessing materializations is necessarily a snapshot.

This article reads as though the author hadn't shifted mindset from "the database will solve it for me" to "I'm taking on the relevant subset of problems in my use case". This seems off given that they're trying to sell a streaming product. They claim their product avoids problems by offering "always correct" answers which requires a footnote at the very least but none was given.

Point of note: The consistency guarantee is that upon processing to the same offset in the log that, given that you have taken no other non-constant input, you will have the same computational result as all other processes executing semantically equivalent code.

I take this sort of comment as abusive of the reader:

> What does a naive application of eventual consistency have to say about > > -- count the records in `data` > select count(*) from data > > It’s not really clear, is it?

A naive application of eventual consistency declares that along some equivalent of a Lamport time stamp across the offsets of shards in the stream, the system will calculate account of records in data as of that offset. Given the ongoing transmission of events that can alter the set data, that value will continue changing as appropriate and in a manner consistent with the data it processes. The new answers will be given when the query is run again or it may even issue an ongoing stream of updates to that value.

Maybe it got better as the article went on...

[+] erikerikson|5 years ago|reply

I appreciate that the downvote mechanism is low friction. I wish it were easier to learn and improve from it too.

[+] unknown|5 years ago|reply

[deleted]

[+] unknown|5 years ago|reply

[deleted]

[+] ecopoesis|5 years ago|reply

Almost every distributed system (including "simple" client-server systems) is eventually consistent. And all systems are distributed.

It's great that your DB is ACID and anyone who queries it gets the latest greatest but in reality you also have out of date caches, ORM models that haven't been persisted, apps where users modifying data that hasn't been pushed back to the server and a million other examples.

I'm sure it's possible to create a consistent system but I'm also sure it's not practical. No one does it.

Instead of constantly fighting eventual consistency just learn to embrace it and its shortcomings. Design systems and write code that are resilient to splits in HEAD and provide easy methods to merge back to a single truth.

[+] bcrosby95|5 years ago|reply

There is a huge difference between having an ACID store of Truth surrounded by eventual consistency, vs making even your store of Truth eventually consistent. You're basically doubling or tripling your work for any given constraint because you have to both monitor after-the-fact violations and build in a way to resolve those violations.

This is on top of regular "nope, can't do that" code that you would write in both systems.

[+] jasonwatkinspdx|5 years ago|reply

> I'm sure it's possible to create a consistent system but I'm also sure it's not practical. No one does it.

Billions of dollars flow through fully consistent systems every day. The basic IT concept for smaller hedge funds is "buy the biggest MSSQL machine available on the planet and move on." The big ones have custom frameworks that resemble Frank's arguments here, though the abstractions are different.

[+] Supermancho|5 years ago|reply

> No one does it.

Oh, some people do. I used this EXACT phrase when I came in to fix an analytics system at a healthcare company that was plagued with analytics problems. They had 5 senior engineers, fulltime, working on this system for years. It had persistent problems and could not be modified in any meaningful way. Upstream systems sent data through multiple SQS topics (duplicate and out of order data) fed into lambda, fed into a giant cache-db which tried to catch dupes and order data, fed into files, processed in batch. It was a horror show in complexity and costing (despite the near-free lambdas). A distributed set of large data streams we feeding into a singular database which was processed, multiple times and put back in the same database. What's billions of inserts into an amazon postgres db, per hour? The company cloud infrastructure gave 0 other tools to work with. I shored up the batch processing (which had all kinds of try catch everywhere, despite a fixed schema) and went on to another company. Medical company software is always a ball of fail.

[+] BillinghamJ|5 years ago|reply

My company is an automated insurance broking service which issues about 4% of all UK motor insurance policies (by number, not value).

We have consistency across our distributed system (~75 services currently) for all the fundamentals of our business. It is not difficult to do at all.

49 comments