top | item 7652036

Knightmare: A DevOps Cautionary Tale

101 points| nattaylor | 12 years ago |dougseven.com | reply

60 comments

order
[+] AndrewBissell|12 years ago|reply
This story puts the lie to a couple of canards about HFT:

- "It's risk free." Any time you put headless trading code into the market you are risking a catastrophic loss. That risk can be managed to a degree with many layers of programmatic safeties, and other practices like having your operations people look for warning emails the day after you've deployed new code. But the risk is always present.

- "It makes the market more unstable." The most important market-maker in U.S. equities blew itself up in spectacular fashion and had to remove itself from the trading entirely. Sending unaccounted orders into the market in an endless loop is about the worst mistake an algorithmic trading firm can make. Can anyone pick the day this happened out of a long-term chart of the S&P 500?

[+] ams6110|12 years ago|reply
Automated deployment would not necessarily have prevented this. Errors happen when humans deploy software manually, and errors happen when humans configure automated deployment tools. The real problem was lack of a "kill switch" to shut down the system when it became obvious something was wrong.
[+] 30thElement|12 years ago|reply
A kill switch wouldn't have saved them. What killed Knight wasn't the $400 million loss, it was the lack of confidence all other firms had in them afterwards. Brokers can't just shut down in the middle of the trading day.

They managed to raise the money to cover the loss, but afterwards they were getting round 10% of their normal order volume [1].

Somewhat ironically, the closest thing they had to a kill switch, backing out the code, actually made the situation worse as it made all 8 servers misbehave instead of just the first one[2].

The full SEC report in [2] is an interesting read, just skip the parts about "regulation 5c-15...".

[1] http://www.businessinsider.com/look-how-knight-capitals-trad...

[2] http://www.sec.gov/litigation/admin/2013/34-70694.pdf

[+] dsr_|12 years ago|reply
An operations group should:

1. know what a normal morning looks like

2. and recognize the abnormality

3. and have the authority to shut down all trading immediately

DevOps is not an excuse to fire your operations staff, it's a requirement that your developers work with and understand your operations staff and vice-versa.

[+] siliconc0w|12 years ago|reply
I think the lesson is actually more about how do proper versioning and message serialization in higher risk distributed systems. Higher message versions should fail to deserialize and cause the message to re-queue (or go to a dead letter queue). Then you monitor queue length like a hawk and have a plan in place for rolling back not just the consumers but the producers as well
[+] fizx|12 years ago|reply
Every company I've worked at had similar (though far less costly) issues.

Put an API method in every service that exposes the SHA of the running code, the build time of the binary (if compiled), and the timestamp when the deploy was initiated. (btw, having this information in the filesystem is insufficient, because what if a post-deploy restart failed?) Verify after every deploy.

[+] Havoc|12 years ago|reply
>BTW – if there is an SEC filing about your deployment something may have gone terribly wrong

I can't help but smile at this comment. Production servers crashing is bad news, but the above is a whole new level of bad.

[+] bowlofpetunias|12 years ago|reply
The quite common lack of a kill switch is something that never fails to amaze me. Especially in large scale deployments with all kinds of distributed processes where you cannot simply turn off "the" server.

Everybody is worried about downtime, but downtime is rarely the worst that can happen.

Things turning to shit fast and not being able to stop it is both much more common and much harder to recover from.

So many organizations have dutifully implemented a single command deployment but don't even have a playbook for simply pulling the plug.

[+] hft_throwaway|12 years ago|reply
That's very true in this case. The issue here was that Knight isn't just trading for its own account. They're a broker where they likely have some SLA-ish agreement with clients, or face repetitional risk at the very least. As a registered market-maker they're obligated to quote two-way prices. Shutting down costs them money and exposes them to regulatory risk.
[+] coldcode|12 years ago|reply
In addition to having a repeatable and dependable deployment process, it's also a good idea to remove unused functionality completely from deployed products. I also don't understand why people insist on using integer flags for things that matter. I've seen this type of error before, people reuse a bit pattern to mean something different and two pieces of code start doing different things.
[+] iandanforth|12 years ago|reply
I think this says more about their business than their deployment process. It might be a good rule of thumb to say, "If your business can lose $400M in 45 minutes, you're not in business, you're playing poker."
[+] rcxdude|12 years ago|reply
Many businesses can lose that much or more in a very short time if something goes wrong, by destruction of product, damage to equipment, or damage to environment. Software can be responsible for all of these.
[+] protomyth|12 years ago|reply
The market is tough and I know a real business (commodity) that can lose about a million a minute in market shifts if done really wrong (like this).

If you can lose $400M in 45 minutes, you need an actual deployment team with actual procedures and triple check code verifications.

[+] hga|12 years ago|reply
From the article, which even has a Wikipedia link to market maker:

"Knight Capital Group is an American global financial services firm engaging in market making, electronic execution, and institutional sales and trading."

Institutions and other big entities buy and sell stocks, in huge quantities. Someone has to execute these trades, and doing it electronically is infinitely faster and more efficient, and usually less error prone. And the platforms for doing this are therefore very "powerful".

But "With great power comes great responsibility", and this company was manifestly grossly irresponsible on many levels, it was likely only a matter of time before something like this would kill them.

[+] gcb0|12 years ago|reply
you are rigth in this case.

those companies exist for one reason... in the past there were rules so people dont send money to the wrong place in the stock exchange. those brokers and speed traders got ahead of everyone by bypassing those safeties with little refard for safety. the only sad part in this history is that it still havent haooened to all of them.

[+] teyc|12 years ago|reply
This case is extremely interesting, because it presents a very difficult problem. What is it that Knight could have done to prevent such a serious error from occurring?

At the core it seems is that each application server is effectively running as root. Having enormous capacity to cause immediate damage. The lesson from http://thecodelesscode.com/case/140 is to "trust no-one". This implies having automated supervisors that has the capacity or authority to shut down machines. This is difficult, and difficult to reason.

Secondly, it warns us of the dangers of sharing global variables/flags. Humans lack the capacity to reason effectively what happens when a repurposed flag gets used by an old piece of code. That should be sufficient heuristic to avoid doing so. This is utterly preventable.

Thirdly, incomplete/partial deployment is extremely dangerous. While assembly signing and other approaches work for binaries, there's nothing said about configuration files. Perhaps, best practice in highly dangerous situations require configuration to be versioned and checked by the binaries as they load. After all, a configuration represents an executable specification. Similarly, relying on environment variables is extremely risky as well.

[+] rbc|12 years ago|reply
I think it shows how release engineering hasn't been seriously funded at many companies. There are a bunch of tools, serving different communities that are used, but most of them operate in the context of the single server. Production reality is clusters of machines. We need better tools for managing cluster deployments and hot-swapping code. the Erlang platform takes on some of this, but doesn't seem to have picked up the following it probably deserves. I bet there are some lessons to be learned there.
[+] chris_wot|12 years ago|reply
Why didn't they remove the old code first?

In LibreOffice, we are spending a LOT of time trying to remove ununsed and outdated code.

[+] MartinCron|12 years ago|reply
That's what I'm thinking as well. You can't leave land mines lying around and then blame the poor guy who steps on one.

If you find yourself afraid to pull old code out, you've got probably got a combination of technological and cultural problems.

[+] EliRivers|12 years ago|reply
Because time (and money) spent removing old code that's not used right now is time (and money) spent for zero short-term profit increase.
[+] personZ|12 years ago|reply
There are many interesting lessons from this tale, but the conclusion that automated deployment would have saved the day seems a bit of a jump.

Automation does not protect you from either automated devastation, gaps, or human errors. Your automation tools, just as with written instructions, require configuration -- a list of servers, for instance.

Automation can be bullet-proof when it's a continuous deployment situation, but is less reliable when you do infrequent deployments, as such a financial firm does. I say this having been in a firm where we moved from "a list of deployment steps" to "fully automated" for our quarterly builds, and the result was much, much, much worse than it was before. We could certainly have resolved this (for instance having a perfect replica of production), but the amount of delta and work and testing we did on our deployment process vastly, and by several magnitudes, exceeded our manual process.

An observer did not validate the deployment (which should be the case whether automated or not for such a deploy). They ignored critical warning messages sent by the system pre-trading (the system was warning them that it was a SNAFU situation). Systems in a cluster didn't verify versions with each other. Configuration did not demand a version. Most importantly for a system of this sort, they didn't have a trade gateway that they could easily see what the system was doing, and gate abnormal behaviors quickly and easily (such a system should be as simple as possible, the premise being that it's an intermediate step between the decision/action systems and the market. The principal is exactly the same as sending a mass customer mailing to a holding pen for validation to ensure that your macros are correct, people aren't multi-sent, to do throttling, etc).

[+] vacri|12 years ago|reply
The bit that surprised me the most was the lack of killswitch (or a halt or a pause); that a human supervisor couldn't invoke a "holy shit!" button.
[+] rossjudson|12 years ago|reply
This has nothing to do with automation of deployments. Any part of an automated deployment can fail. At scale a single failure cannot be allowed to halt the deployment, either.

This is an architectural mistake. Distributed systems must always be able to operate in an environment with a variety of versions, without ill effects.

They repurposed a flag and then failed to test the mixed environment.

Hindsight is 20/20, of course.

[+] MartinCron|12 years ago|reply
Automation can be bullet-proof when it's a continuous deployment situation, but is less reliable when you do infrequent deployments

If your deployment pipeline is fully automated, why aren't you making lots of little deployments? The safest change to make is the smallest change possible, after all.

[+] codr|12 years ago|reply
No rollback plan? Wtf.
[+] micro-ram|12 years ago|reply

  repurposed an old flag
The flag was not OLD since there was still code in the CURRENT code base which COULD use it.
[+] VintageCool|12 years ago|reply
There was live code in the current code base which could use the flag, but it hadn't been active in 8 years.
[+] chollida1|12 years ago|reply
The blog spam isn't necessary. To get the actual findings check out the sec post-mortem located here.

http://www.sec.gov/litigation/admin/2013/34-70694.pdf (PDF warning).

[+] greenyoda|12 years ago|reply
The posted article is much more readable than this SEC document. I wouldn't call it "blog spam" at all. (I'd define "blog spam" as a blog post that links to another article without adding any value.)