Why disaster happens at the edges: An introduction to queue theory

[+] roenxi|4 years ago|reply

Queue theory is almost a distraction because it overcomplicates the situation. The underlying insight is that rates matter. If a system can handle 100 events a second, then 99 events per second everything is fine and 101 events per second the system is down. There is a threshold where everything falls apart.

Queue theory encourages people to think in terms of flow rates and has the technically correct de-rates to nominal capacity to account for variance. But normally that is overkill except in highly engineered systems.

The battle in an organisation is convincing people to look at flows at all. In my experience people love metrics that track stock (we have X widgets or can handle Y orders) and not flows (we built A widgets and sold B widgets). Anyone who has training in queue theory knows when it is appropriate to look at flows rather than stocks and that is where the value is. The formulas and variances tend to just scare people away from monitoring flows because they don't understand what is happening.

[+] tharkun__|4 years ago|reply

Why is 101 events per second == system down?

Of course there are systems where not being able to process everything right when it comes in means that you're "down". It's not a given though and queuing systems are actually perfect for use cases where not being able to process at the speed of incoming requests is completely fine. Eventual consistency is a thing.

I have the same experience though that it seems to be hard for folks to accept that yes, 100k messages in _that_ queue is totally fine. Some process dumped 100k messages to then be processed over the next hour and that's totally fine. That way we don't have to actually spend a small fortune to process them at the speed at which they can be generated.

Of course it depends on the problem at hand. If that particular flow in the system is one that is highly user interactive, then yes, this means 'system down' for your users. Never mind that the system will eventually process it all. I always smile when I see another system than ours and I can spot distributed processing w/ queues in between in how the system behaves in certain circumstances.

[+] antman|4 years ago|reply

That is absolutely not what queuing theory says. In equally simplified terms, queuing theory says that for a system that handles 99 events per second if you size it for 100 it will probably fail because 99 is just the mean, and when you overshoot (in events per second) you don’t have the capability to cover it during the undershoot (in events per second) period because events have piled up.

For a numerical example of queuing theory 101 and why variance is very important rather than something that “scares people away” [0]. Flow tracking is not enough.

Also this discussion is relevant [0]:https://www.johndcook.com/blog/2008/10/21/what-happens-when-...

[+] joshuanapoli|4 years ago|reply

> The battle in an organization is convincing people to look at flow at all.

I guess that's the point... an organization needs enough people to know about queue theory "to look at flow" and understand that queues are ubiquitous and anticipate that these queues experience a latency phase change at some point depending on utilization (and job duration variance). If an organization doesn't have anyone who understands queue theory, then it's probably more likely to have queue-related failures.

[+] justicezyx|4 years ago|reply

Another random software engineer claiming a fundamental theoretical problem irrelevant...

Queuing theory deals with the stochastic behavior of a discreet event in at a single queue, and possibly multiple ones connected in some structured manner.

It's importance is that it can determine before hand what are the stochastic boundary of the system.

Your whole idea of 99 events and 101 events are so strawman that I don't know where to start poking at the holes...

For one thing, the rate of something has its definition, there is jitter or busyness. To say something has a rate is plainly a impractical concept. As there is nothing in the network that had steady rate...

Edit: I should stand corrected that the parent does not really miss the meaning of queuing theory. It was just that I only read the first paragraph.

[+] kccqzy|4 years ago|reply

> The underlying insight is that rates matter. If a system can handle 100 events a second, then 99 events per second everything is fine and 101 events per second the system is down.

Not really. You need actual theory to inform you whether 99 events per second is okay. See https://www.johndcook.com/blog/2008/10/21/what-happens-when-...

[+] jimkleiber|4 years ago|reply

> The battle in an organisation is convincing people to look at flows at all. In my experience people love metrics that track stock (we have X widgets or can handle Y orders) and not flows (we built A widgets and sold B widgets).

This reminds me of an entrepreneurship class I took in college where the professor emphasized the importance of cash flow, something I hadn't considered much and found somewhat counterintuitive.

I'd love to learn more about this perspective on when to focus on the flows over stocks, are there any resources you'd recommend to help me better understand it?

[+] whatever1|4 years ago|reply

The main question posed to business is the following.:

Do you want to provision your system for the worst case scenario volume, and what is this worst case scenario you are willing to pay to insure your system against.

It’s an easy question, but I don’t see how queueing theory can provide answers to. It is at its root a subjective business decision.

[+] efitz|4 years ago|reply

There are secondary non-obvious effects of caching and queueing.

Cold start is an important scenario that lots of people overlook - “we won’t restart it during busy times”, etc. however if you have a bug or outage then that might be precisely when you restart. During a cold start you will often find your system behaves as it does at the tails of the performance curve, either because you are getting 100% cache misses, or because all your queues are full due to a client pile-on or queued upstream work, or some combination.

The article described back pressure; I can’t emphasize that enough as part of any resilient queuing system. Chains of queues are only as performant as their weakest link.

[+] jph|4 years ago|reply

Queueing theory-- my notes for programmers:

https://github.com/joelparkerhenderson/queueing-theory

Queueing theory is the mathematical study of waiting lines, or queues. We use queueing theory in our software development, for purposes such as project management kanban boards, inter-process communication message queues, and devops continuous deployment pipelines.

[+] layer8|4 years ago|reply

The notes don’t explain what problems queue theory solves, i.e. why would I want to use it, as a programmer. The Wikipedia article’s second sentence is a bit more helpful in that respect: "A queueing model is constructed so that queue lengths and waiting time can be predicted."

[+] jimmySixDOF|4 years ago|reply

All kinds of Queue analysis is also done extensively in Six Sigma practice

[+] kesor|4 years ago|reply

TheNewStack needs more articles like this one and less articles that talk about some new feature where by adding extra five lines of YAML you get something or other in your Kubernetes.

[+] crmd|4 years ago|reply

I worked at IBM in earlier days, and a couple of the Research people in the data storage group at Almaden and Poughkeepsie gave some great lectures and made excellent tools for modeling cache disk storage array performance that were based heavily on queue theory. This brings back great memories, thanks!

[+] xmcqdpt2|4 years ago|reply

I'm being pedantic but,

>Every distribution can be described as a curve. The width of this curve is known as its variance.

Not every distribution has a variance. Some notable examples include the Cauchy distribution (or Lorentz lineshape in physics),

https://en.m.wikipedia.org/wiki/Cauchy_distribution

the Power law distribution for powers lesser or equal to 2,

https://en.m.wikipedia.org/wiki/Pareto_distribution

and the Levy distribution,

https://en.m.wikipedia.org/wiki/Lévy_distribution

These are not mathematical curiousities but actually describe real physical systems.

Additionally, the variance works as a "width" only for unimodal distributions. Any variance based metrics or analysis should only be used after checking for multimodality and, for multimodal data, with extreme caution thereafter.

[+] thingsgoup|4 years ago|reply

A frustrating headline. Where else would it happen? The middle? Crystals fracture on their faces. Things tend to break along their boundaries.

[+] bonestormii|4 years ago|reply

I feel like you are understanding the title as I first understood it, meaning like, "the external facing portions of infrastructure". However, reading the article, it seems clear that he's referring to the edges of a distribution curve (i.e. Infrequent events that impact experience nonetheless).

From the article: "It’s tempting to focus on the peak of the curve. That’s where most of the results are. But the edges are where the action is. Events out on the tails may happen less frequently, but they still happen. In digital systems, where billions of events take place in a matter of seconds, one-in-a-million occurrences happen all the time. And they have an outsize impact on user experience."

EDIT: IMO, the title is still a little annoying in this respect. I think everyone would agree if a request to your site fails 5% of the time, that is unacceptable, even though it "usually works." The discussion of the distribution curve simply to make the point that spikes in usage cause backed up queues which impact performance isn't necessarily helpful as far as I can tell, and it seems done largely in service to the title. In my mind while reading this, I'm thinking, "Okay, cool, but how does the fact that this interesting issue exists at the edge of the curve help me identify it?" Answer: It doesn't. If you see errors occurring, you will investigate them once they are noticed. Being at the edge of the curve may mean it takes longer to notice, but like, what kind of alerting system are you using that discriminates against rare issues in favor of common ones?

Discussing queues, over provisioning, back pressure, etc. are all super interesting and helpful.

[+] aaron695|4 years ago|reply

[deleted]

[+] ignoramous|4 years ago|reply

Bitten by queues in every team I was at AWS:

> As a rule of thumb, target utilization below 75%

This is one good reason to rely on serverless. Simply outsource the problem to a system that knows how to handle this better.

> Steer slower workloads to paths with lower utilization

This is fraught with all sorts of perils [0], so take caution going down this valid but tricky route.

> Limit variance as much as possible when utilization is high

This is key, and one of the most elegant solutions to this problem I know of comes from a Facebook talk on CoDel + Adaptive LIFO. [1]

> Implement backpressure in systems where it is not built-in

Making downstream dependencies behave is never an option. Selectively isolating noisy neighbours, if possible, from other well behaved clients, tends to work well for multi-tenant systems [2], in addition to monitoring long work (by dropping work that takes forever) [3], or better yet, doing constant amount of work [4] (aka eliminating modes) [5].

> Use throttling and load shedding to reduce pressure on downstream queues.

One way is to impl admission control (ala Token Bucket) to minimise the impact of thundering herds. Though, clients do not take kindly to being throttled. [6]

A great article; many have been written at this point. Very many still have been burned by queues. Though, in my experience, without an exception, some component somewhere was always building up that backlog! [7][8][9]

[0] Interns with Toasters: How I taught people about Load Balancers, https://news.ycombinator.com/item?id=16894946

[1] Fail at scale: Controlling queue delays, https://blog.acolyer.org/2015/11/19/fail-at-scale-controllin...

[2] Worload isolation, https://aws.amazon.com/builders-library/workload-isolation-u...

[3] Avoiding insurmountable queue backlogs, https://aws.amazon.com/builders-library/avoiding-insurmounta...

[4] Constant work, https://aws.amazon.com/builders-library/reliability-and-cons...

[5] Cache, modes, and unstable systems, https://news.ycombinator.com/item?id=28344561

[6] Fairness in multi-tenant systems, https://aws.amazon.com/builders-library/fairness-in-multi-te...

[7] Using load shedding to avoid overload, https://aws.amazon.com/builders-library/using-load-shedding-...

[8] Treadmill: Precise load testing, https://research.fb.com/publications/treadmill-attributing-t...

[9] Cascading failures, https://sre.google/sre-book/addressing-cascading-failures/

[+] polskibus|4 years ago|reply

I would love to read more on queue theory + web applications - is there anything out there worth reading for a more in-depth understanding (applied to applications) but with key takeways/rules of thumb presented like in this article?

[+] lelanthran|4 years ago|reply

> The TCP protocol, for instance, generates backpressure with code 503

Surely they mean 'HTTP protocol'? TCP doesn't do back pressure but IP does at the router level, IIRC.

[+] dboreham|4 years ago|reply

Throughout my career, the one thing I've noticed engineers/developers consistently have trouble modeling in their heads is queuing.

[+] sparsely|4 years ago|reply

They absolutely love adding them to previously functional systems even so!

[+] Neverending|4 years ago|reply

Don’t canceled/rejected requests have infinite latency by definition?

[+] bonestormii|4 years ago|reply

For the one request that is cancelled/rejected, but not for every other request behind them in a queue.

[+] overkalix|4 years ago|reply

Seems like this guy is about to discover six sigma...

[+] wyager|4 years ago|reply

> The TCP protocol, for instance, generates backpressure with code 503

Doesn't TCP typically use sliding window flow control? I'm not sure what "code" is referring to in this context.

[+] addisonj|4 years ago|reply

I think this might be an error in the text, HTTP status code 503 can be used for backpressure, which I think is what is being referred too

But yes, TCP uses windows for back-pressure, but that isn't really useful for application level backpressure as the OS controls the queues sizes, so pretty much most systems have their own backpressure on top.

56 comments