Why the CrowdStrike bug hit banks hard

jmuguy|1 year ago

Maybe the IT departments at the affected orgs take solace in the fact that so many other orgs had issues that the heat is off - but in my opinion this was still a failure of IT itself. There's no reason that update should have been pushed automatically to the entire fleet. If Crowdstrike's software doesn't give you a way to rollout updates on a portion of your network before the entire fleet, it shouldn't be used.

candiddevmike|1 year ago

The update bypassed the controls orgs had in place to defer/schedule updates, AFAIK.

lucianbr|1 year ago

Management decides to use Crowdstrike, not IT, and IT has no way to rollout updates in controlled fashion.

So not really a failure of IT, at least not for this reason.

toddmorey|1 year ago

Was anyone else surprised how little disruption they personally experienced? I had braced for impact that weekend. But all my flights were perfectly on time, all my banking worked, providers worked, and sites & resources were available.

I don’t know if I somehow just have little exposure to Windows in my life or if there’s an untold resiliency story for the global internet in the face of such a massive outage.

All I can say is THANK YOU to all the unsung heroes who answered the call and worked their butts off. Infrastructure doesn’t work without you. We see you & we thank you!

davio|1 year ago

I was unaffected on my work laptop. One of my coworkers is a long-timer and said when the company first got laptops there was a huge "OMG leave your laptops on overnight" push to make sure updates were applied. I always at least sleep if not shut-down after work so I guess I missed out

blackoil|1 year ago

IIRC only 5% of Windows machines were affected. So, it is very probable that most people just saw the news but have no real impact on them. Some had minor and maybe memorable impact, like Indian airlines giving handwritten boarding passes.

yowzadave|1 year ago

I wish that were the case for me! My in-laws had their flight out of JFK delayed by 2-3 days, as did my daughter who was supposed to fly as an unaccompanied minor.

duderific|1 year ago

I was flying back to the US from Mexico on United the day after the meltdown. Reading the news, I was obviously quite concerned about how it was going to go (I was traveling with my 10 and 6 year old kids). Amazingly, everything went off without a hitch; not even the slightest delay.

I asked the guy at the luggage counter, and he said the day before was pretty crazy, but they had everything straightened out by the next day.

marcosdumay|1 year ago

People found a really quick workaround. It would take a couple more days to fix if there wasn't any.

bdjsiqoocwk|1 year ago

> all my banking worked

Vanguard.co.uk was down.

But yes, I echo your feelings. When you examine how complex everything is under the hood it's almost unbelievable that anything works.

mikeocool|1 year ago

I had a 7am flight on Delta from LGA to MSP. Seeing all of the blue screens in the airport was pretty surreal and our flight was delayed four hours.

But yeah, other than that, the only issue we ran into was that the Jimmy John’s we stopped at for lunch outside of MSP was slammed because Delta had ordered hundreds of sandwiches for their staff.

I’ve definitely experienced much worse travel disruptions due to normal weather (though obviously we got real lucky compared to some Delta customers).

pantulis|1 year ago

This is a good writeup, but to be fair it's just not a matter of banking regulations. Basically all big companies are under similar obligations regarding endpoint protection.

candiddevmike|1 year ago

Should endpoint protection require kernel level access? At what point does it stop becoming protection and start becoming a liability? Obligatory who watches/protects the watchmen/protector...

SkyPuncher|1 year ago

Basically all B2B companies are under some sort of obligation to have endpoint protection.

All of these requirements essentially become transitive across a company's entire supply chain.

* Big bank needs to comply with X, so do all of their vendors.

* Vendor wants to sell to big bank, so they comply with X. They also need all of their vendors to comply with X.

* So on and so on.

----

Ultimately, there are a lot more options than CrowdStrike, but this is a case of "Nobody gets fired for buying IBM". Even if CrowdStrike isn't the "best", it's good enough. Because it's use is sooo widespread, an issue with it often affects dozens and dozens of other companies when you're affected. One of the great things about this effect is everyone "goes down at the same time", so people don't tend to point fingers at you. In fact, they might not have any clue you're down because some other, more critical system is down internally and preventing them from accessing you.

I remember a similiar situation happening a few years back. A big outage hit large parts of the internet. A pretty major part of our app got taken offline with this outage. This was a known risk and something that we accepted. We expected some backlash and inquires if this situation should ever happen. It was a calculated risk to dedicate more effort towards building customer-facing value.

I think we got one inquiry. It was basically just an FYI. This person had so many things broken on their end that "one more thing" being broken was just a drop in the bucket.

SoftTalker|1 year ago

If not regulations, then demands by insurers for cyberattack insurance coverage.

taeric|1 year ago

Any explanation that doesn't boil this down to "software required by corporate policy checklist not written by technical team" is almost certainly missing something here. This is almost definitionally policy capture by a security team and the all too common consequences that attach.

The section that goes over why this wasn't federally pushed is largely accurate, mind. Not all capture is at the federal level. Is why you can get frustrated with customer support for asking you a checklist of unrelated questions to the problem you have called in.

And the super frustrating thing is that these checklists are often very effective for why they exist.

shadowgovt|1 year ago

Just as a general comment on this whole affair:

This would be the third incident I'm familiar with of a file of entirely zeroes breaking something big.

Folks, as much as we wish it weren't true, null comes up all the damn time, and if you don't have tests trying to force-feed null into your system in novel and exciting ways, production will demonstrate them for you.

Never assume 'zero' (for whatever form zero takes in context) can't be an input.

tempodox|1 year ago

As long as the botchers get away with impunity, they won't “waste” resources on higher standards.

saltminer|1 year ago

> This created a minor emergency for me, because it was an other-than-minor emergency for some contractors I was working with.

> Many contractors are small businesses. Many small businesses are very thinly capitalized. Many employees of small businesses are extremely dependent on receiving compensation exactly on payday and not after it. And so, while many people in Chicago were basically unaffected on that Friday because their money kept working (on mobile apps, via Venmo/Cash App, via credit cards, etc), cash-dependent people got an enormous wrench thrown into their plans.

I never really thought about not having to worry about cashflow problems as a privilege before, but it makes sense, considering having access to the banking system to begin with is a privilege. I remember my bank's website and app were offline, but card processing was unaffected - you could still swipe your cards at retailers. For me, the disruption was a minor annoyance since I couldn't check my balance, but I imagine many people were probably panicking about making rent and buying groceries while everything was playing out.

HeyLaughingBoy|1 year ago

The really admirable thing about this is that Patrick acknowledged that it was "an other-than-minor emergency" for the contractors and took steps to ensure that they were paid rapidly. In a similar situation many people would have shrugged and taken an attitude of "sorry, bank's down. I'll pay you when it comes back up."

MadVikingGod|1 year ago

While reading this I was struck with an interesting question: What risk does any particular software vendor pose to an industry at large?

For example (making up numbers here): if 75% of all airline computers have croudstrike falcon installed that seems like a very concentrated risk.

I actually wouldn't be surprised if we had this we would see really high concentrations of a small number of vendors in any industry.

anticristi|1 year ago

The EU DORA regulation (Digital Operational Resilience Act for Financial Entities) has explicit provisions to avoid concentration risks. I heard a story that a bank was forced to use Google Cloud, because two other banks were already on AWS and Azure.

saltminer|1 year ago

Alternatively, if Oracle hikes the price on an industry-specific product by 75%, how much of that industry goes under?

adrr|1 year ago

Did it really hit banks hard? Core banking systems don't run windows, they run on mainframes typically on IBM z/OS. I know it hit the financial firms hard and knocked out their trading systems but I don't know of any major bank losing their core bank system due to crowdstrike.

Australia got hit hard because they modernized their bank systems and now most are cloud based. I am not aware of any major bank running their core systems on the cloud or on windows.

tempodox|1 year ago

> they modernized their bank systems

You mean they made them more vulnerable?

josephthejoe|1 year ago

[deleted]

HeyLaughingBoy|1 year ago

> Configuration bugs are a disturbingly large portion of engineering decisions which cause outages

I work in medical device software -- the stuff that runs on machines in hospital labs, ER's or at patient bedside.

The first "ohmigod do we need to recall this?" bug I remember was an innocuous piece of code that was inserted to debug a specific problem, but which was supposed to be disabled in the "non-debug" configuration.

Then somehow, the software update shipped with a change to the configuration file that enabled that code to run. Timing-critical debug code running on a real-time system with a hard deadline is a recipe for disaster.

Thankfully, we got out of that pretty easily before it affected more than a small handful of users, but things could have been a lot worse.

kristaps|1 year ago

The article specifically mentions US banks and as I personally didn't see any disruption over here - is there (anec)data on how popular CrowdStrike is in the US vs the EU?

Muromec|1 year ago

Can't have disruption from CrowdStrike if you run on IBM mainframes with cobol coz your math only opens gates for new technologies once in 25 years.

Ekaros|1 year ago

Might be question what type of disruption it is. Transfers and web bank is likely to work. Branches offices and ATMs might have issues. So if you try to do anything in person or negotiate anything with workers in bank there could be issues.

bob1029|1 year ago

I feel like this only impacted the larger banks. I've heard absolutely no explosion noises coming from smaller institutions. The effect of regulations and their enforcement is felt differently across the spectrum.

There is something to be said for a diverse banking industry when it comes to this kind of problem. Also, this event is a powerful argument for keeping the core systems on unusual mainframe architectures. I think building a bank core on windows would be a really bad choice, but some vendors have already done this.

unknown|1 year ago

[deleted]

waihtis|1 year ago

Regulations are a big reason why this happened, sure, but also it hit the companies with great security budgets more.

Hospitals, for instance, weren't that widely affected as they barely have any money to buy security tooling.

Silver linings and all that, I guess.

cookiengineer|1 year ago

> Hospitals

Everybody seems to be quick to forget about WannaCry.

hpen|1 year ago

We blame car manufacturers for defects from suppliers, but we don't blame platform manufacturers (Microsoft) for holes in their architecture?

cibyr|1 year ago

You don't blame your car's manufacturer if it won't start because the monitoring dongle your insurance provider sent you in exchange for a discount drained the car's battery.

SirMittens|1 year ago

I think that's the wrong analogy. A more correct one would be "Should we blame a car company for a broken engine, that was modified after it was sold to you?".

A kernel level driver from a 3rd party is something that you willingly add to the OS, it wasn't there.

Just because windows allow you to do it, doesn't mean you should.

I mean, you can apply some dangerous mods to your car's engine, but you probably shouldn't, and if you do, it's your responsibility, not the car company.

vel0city|1 year ago

If I add a NOS kit to my car and it blows up my engine, is that Honda's fault?

Retr0id|1 year ago

> For historical reasons, that area where almost everything executes is called “userspace.”

It's an old term at this point, but I don't think the reasons for it being called "userspace" have changed or become outdated since then, so I wouldn't call them historic per se.

Macha|1 year ago

Things have gotten messier with virtualization, containerisation, hypervisors etc. The internet loves to produce pedants to argue the post should go into the finer points of these even when it's not relevant to the message. And so people like the author have a defensive reflex to throw in some language to bounce the pedants away.

SoftTalker|1 year ago

I used to like Patrick's posts but lately they are way to long and full of irrelevant minutia.

Decide who you're writing for, and write to that audience.

shadowgovt|1 year ago

Why is it called "userspace" when all it runs is some Docker containers hosting a web frontend's server, and no human being ever telnets into it? Where's the "user" in that story?

Where is the "user" when the machine is a Windows box stuffed behind a façade wall that displays airport directions, notifications, and ads on rotate?

unknown|1 year ago

[deleted]

btbuildem|1 year ago

The takeaway from this article seems to be: buy crowdstrike shares, because major corps are unable to make any changes, and will continue to pay licensing fees for this "service" for the foreseeable future.

tootie|1 year ago

This is going to crush their sales pipeline and lead to at least a few attempting a migration off. Crowdstrike is unlikely to go out of business, but this is not a good time to buy.

candiddevmike|1 year ago

The lawsuits alone are going to be eyewatering. But sure, buy those shares.

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

deepsun|1 year ago

I'm still amazed how the blame shifted from Microsoft to CrowdStrike. Yes, CrowdStrike update caused that -- but applications fail all the time. It was Microsoft's oversight to put it on Windows critical path.

And banks/airlines etc were hit hard because their _Windows_ didn't boot, not because of an application crash on a perfectly working Windows.

ctxc|1 year ago

The application (Crowdstrike) was part of Windows' booting process.

Windows cannot simply "skip" failed drivers. Say Crowdstrike driver failed as a one time thing, Windows skipped it instead of retrying which led to the endpoint being vulnerable and a ransomware happens. We'd be saying the opposite now.

This is a high-impact ability Windows offers to applications - and applications should take responsibility and treat it as such.

I spoke to another EDR lead I know - they said they had provisions in place to read the dump if boot crashed, check if it was due to their driver and skip it if it was (and then send telemetry after startup so that it can be fixed, probably). Crowdstrike should have done the same.

One more thing to note is that we cannot say Windows shouldn't provide this ability - that becomes an anti-trust monopoly, because MS themselves are a competitor in this space.

makeitdouble|1 year ago

Windows could sure handle this kind of error better, but IMHO it would be a mistake to require Microsoft to absolutely block any path Windows could be crashing due to third party software.

We'd end in a situation similar to Mac OS where there's a single gatekeeper and whole industries are subjected to the will of the platform owner.

Enterprises have chosen Windows because of that flexibility and control, while having a business partner they don't get with linux. If anything the blame should fall on them for getting hosed even as they fully had the means to avoid that situation.

CydeWeys|1 year ago

I don't think "Microsoft should lock down Windows so hard" is the solution we want here. I don't want my desktop OS to be a walled garden like iOS is. I want to be able to install software on it that does anything I need to be able to do -- and yes, having that capability to run software at the lowest possible level in the OS does also mean that that software has extra responsibility to be well-behaved, as the OS can't protect the system from it. But I still would rather have that option than not have it (and also I wouldn't use CrowdStrike).

klodolph|1 year ago

How did Microsoft put it on the Windows critical path? (Informational question—I’m not following the issue super closely, but I thought CrowdStrike was a third-party system. Crowdstrike was wrong to put so much code in the kernel. Microsoft was reportedly legally bound to provide this access and allow third-party code to run in the kernel.)

umanwizard|1 year ago

Microsoft is not who made the decision to put this on Windows' critical path; CrowdStrike was. Nothing stops you from running whatever dodgy third-party kernel modules you like on Linux or FreeBSD and they could easily cause the same sort of problem.

imiric|1 year ago

To be fair, AFAIK the CrowdStrike driver was WHQL-certified. The loophole is that the driver loaded files at runtime, which made it impossible to predict every failure scenario.

Maybe this is the loophole that needs closing. You can't claim a driver is certified for Windows if the manufacturer can push arbitrary files that change its behavior. Especially if that manufacturer has sloppy development practices.

I understand that a primary goal of endpoint monitoring software is to be able to quickly react to new threats, and that the turn around time for Windows certification is surely unacceptable in this scenario, but this functionality can never be allowed to jeopardize the stability of the system it's supposed to protect. So it's ultimately on Microsoft to fix this for their users.

Cthulhu_|1 year ago

In the article it states that Microsoft HAD to allow Crowdstrike to run in kernelspace by EU laws, because else MS would have the monopoly on kernel-level security solutions / integrations.

asr|1 year ago

Not MSFT’s fault: https://stratechery.com/2024/crashes-and-competition/

999900000999|1 year ago

I’ve used this analogy before.

If I sell you a bike and you remove the breaks you can’t sue me when you crash.

Any OS which allows users to do what they generally want to do, also allows users to fubar their own systems.

Saris|1 year ago

What about the previous crowdstrike bugs that hit Linux systems in a similar fashion?

I don't understand how this has anything to do with Windows, Crowdstrike is the one who built the application.

btbuildem|1 year ago

Isn't corporate malware by definition on the "critical path"? The article outlines the reasons why that jank runs in kernel space, and why MS is unable to "downgrade" it to userspace.

ClumsyPilot|1 year ago

This is the comment I expected, begging to handover your freedoms to run software to a big carry.

If you replace parts in your BMW, and put in some garbage or incompatible parts, it your fault if it doesn’t run.

You expect to sue your mechanic if he messed up, and for him to cover the full cost. For some reason people do not expect CrowdStrike to pay for their stupidity, which is the root of the problem. And the management that installed crowdstrike without due diligence

ClumsyPilot|1 year ago

This is the comment I expected, begging to handover your freedoms to run software to a big carry.

If you replace parts in your BMW, and put in some garbage or incompatible parts, it your fault if it doesn’t run.

You expect to sue your mechanic if he messed up, and for him to cover the full cost. For some reason people do not expect CrowdStrike to pay for their stupidity, which is the root of the problem. And the management that installed crowdstrike without due diligence

unknown|1 year ago

[deleted]

ortusdux|1 year ago

Is there any merit to Microsoft's argument that the EU forced them into keeping their kernel accessible by 3rd parties?

https://www.theregister.com/2024/07/22/windows_crowdstrike_k...

gciguy|1 year ago

Dave's Garage has a great video on this: https://www.youtube.com/watch?v=wAzEJxOo1ts

kmeisthax|1 year ago

Microsoft didn't write the Falcon sensor software nor did they put it in the kernel. In fact, Microsoft has been shouting to the heavens trying to shift the blame from CrowdStrike onto the European Commission, because they want people to irrationally hate antitrust so they can turn Windows into shitty iOS and monopolize the security market (and applications market) for it.

Furthermore, Microsoft does actually have some rules regarding what you can and can't put into a signed kernel driver. Specifically, they won't sign kernel code unless they've seen and tested it first. CrowdStrike deliberately circumvented this rule by implementing their own configuration format - really, just a fancy way of loading code into the kernel that Microsoft doesn't have signing control over.

If there is blame to be had here for Microsoft, maybe it's that their kernel code signing program doesn't scrutinize third-party configuration formats hard enough. I mean, if you sign a code loader, you're really signing all possible programs, making code signing irrelevant. And configuration is more often than not, code in a trenchcoat. It's often Turing-complete, and almost certainly more complicated than the actual programming languages used to write the compiled code being signed off on.

But at the same time I imagine Microsoft tried this and got pushback. That might be why they feel (incorrectly) like they can blame the EU for this. Every third-party security solution does absolutely unspeakable things in kernel space that no one with actual computer science training would sign off on, using configuration to wrestle signing control away from Microsoft. Remember: Crowdstrike is designed to backdoor Windows systems so that their owners know if an attack has succeeded, not to make them more secure from attacks in the first place. Corporations are states[0], and states fundamentally suffer from poor legibility: they own and operate far too much stuff for a tribe[1] of humans to meaningfully control or remember.

The problem is that we have two different entities that all have the ability to stop this madness. When states run into this situation, they impose "joint and several liability", which means "I don't care how we precisely assign blame, I'm just going to say you all caused it and move on". In other words, it's Microsoft's fault and it's CrowdStrike's fault.

[0] ancaps fite me

[1] Maximally connected social graph with node degree below Dunbar's number.

hulitu|1 year ago

I think they said it was a windows driver, not a normal application. Running crap in kernel mode does not end well on any OS.

hpen|1 year ago

This is a valid opinion and I don't know why you were downvoted (well other than the hacker news bubble mindset (or mindless-set).

How is Microsoft not to blame, it's their product? We wouldn't blame a Toyota supplier for a failure in a car, but we somehow segment that in the software world?

jsmith99|1 year ago

Yes, use a different operating system, one that gracefully handles null pointer dereferencing by third party kernel modules? /s

unknown|1 year ago

[deleted]

voytec|1 year ago

> Another way is if it has recently joined a botnet orchestrated from a geopolitical adversary of the United States after one of your junior programmers decided to install warez because the six figure annual salary was too little to fund their video game habit.

Fictional statements like this make me reluctant to read further, and ignore source of such "news" in the future.

bdamm|1 year ago

It's obviously fictional, but let's call it contemporary drama based on a true story. I thought the point was well made. The author already noted this was a handwaving segment.

samspot|1 year ago

I got in trouble for something like this early in my career (running bittorrent over my work vpn).

davidgerard|1 year ago

what makes you think it was fictional?

also, bragging about your inability to read text seems an odd way to interact.

__MatrixMan__|1 year ago

I like the technical stuff here.

I'm not so sure about this:

> money is core societal infrastructure, like the power grid and transportation systems are. It would be really bad if hackers working for a foreign government could just turn off money.

Sure, it would be inconvenient in the short term. But I think the current design is holding us back.

I suspect that most of us would have more to gain than to lose if we managed to shut off money-as-we-know-it and keep it off for long enough to iterate on alternatives. Any design that even tried to step beyond "well that's how we've always done it" would likely land somewhere better than what we're doing. Much has changed since Alexander Hamilton.

Joker_vD|1 year ago

In the early 90's Russia, essentially, voided almost all of the Soviet money that remained in monetary system (most of which were bank deposits; they simply vanished with zero compensation), allowing rather small upper limit on the amount of old Soviet roubles one person was allowed to exchange for the new Russian roubles.

Believe it or not, that really did not help the low and low-middle classes with their growing financial problems; and the upper-middle and top classes mostly operated in dollars (or less often, in deutschmarks) by this time anyhow, so that didn't inconvenience them much at all.

gadders|1 year ago

In the short term people would probably starve to death.

imtringued|1 year ago

I agree there needs to be more competition, but that doesn't mean you need to get rid of the old way. It is better when two approaches run in parallel, to compensate the other's shortcomings.

vel0city|1 year ago

The uber-wealthy don't have most of their assets in currency. Its in stocks, houses, cars, boats, etc. Delete the dollars, it'll hurt them a bit, but in the end they still have a house(es).

But now all those people who were using currency to trade for housing now suddenly need to find a new way to trade for shelter.

Who got hurt worse here?

238 comments