top | item 41007898

CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there

203 points| thunderbong | 1 year ago |arstechnica.com | reply

234 comments

order
[+] ziizii|1 year ago|reply
Has anyone discerned the root cause of this in the software?

As in, what exactly is wrong in these C00000291-*.sys files that triggers the crash in csagent.sys, and why?

[+] crawancon|1 year ago|reply
It seems the affected update file seems to have been over written with 0s on the 42kb file, whereas the before and after sys files have obfuscated ays/config file info as expected.
[+] surfingdino|1 year ago|reply
This is a global multi-layer failure: Microsoft allowing kernel mods by third-party software, CrowdStrike not testing this, DevSecOps not doing a staged/canary deployment, half the world running the same OS, things that should not be connected to the internet but are by default. Microsoft and CrowdStrike drove a horse and a cart through all redundancy and failover designs and showed very clearly where there were no such designs in place.
[+] LordKeren|1 year ago|reply
While I will be the last person in line to defend Microsoft, I am not sure that disallowing 3P kernel mods is a workable solution. Crowdstrike and companies like it exist to fill a very real need within the windows ecosystem. I don’t foresee that suddenly going away now or Microsoft unilaterally forcing every company like crowdstrike out of business and taking over this role themselves
[+] santoshalper|1 year ago|reply
Literally every OS allows you to install 3rd party kernel modules or plugins. If Microsoft banned them, people would be up in arms about them being a controlling walled garden. There is no winning.
[+] Connector2542|1 year ago|reply
Hello, IT, have you tried turning it on and off again 15 times?

Seriously though - this entire outage is the poster child for why you NEVER have software that updates without explicit permission from a sysadmin. If I were in congress, I would make it illegal, it's an obvious national security issue.

[+] MBCook|1 year ago|reply
Nah. That’s not the problem.

Kernel level code blindly loading arbitrary files?

Panicking when the file doesn’t parse because it’s not a memory safe language?

Not validating the files before loading them?

Not validating the files before SHIPPING them? No CI? No safety net?

No staged rollout in case of explosion?

There are far FAR bigger mistakes here than “sys admin didn’t have to press button”.

[+] pas|1 year ago|reply
That's not the big no-no here. Lack of any real DRP is. Sure, it's cheaper to just buy CS Falcon (and who knows what other amazing vendors supplied timebombs are ticking silently) than paying sysadmins and developers ... and letting them build something that does what it needs, not much else, so there's no need to put these fantastic "single agents" from these RCE-as-a-service vendors on all the fucking servers.
[+] johnkizer|1 year ago|reply
What % of those sysadmins are then going to turn around and script something to auto-approve those updates, once they realize that they are A) requested at inconvenient times and B) are related to security?

Who's going to take the risk of appearing to have sat on an important update, while the org they support is ravaged by ThreatOfTheDay, because they thought they knew better than a multi-billion dollar, tops-in-their-field company?

(I'm not necessarily saying that's actually objectively correct, but I can't imagine that many folks are willing to risk the downside)

[+] rfoo|1 year ago|reply
> why you NEVER have software that updates without explicit permission from a sysadmin

In general I agree, but this case is quite messy. It's more like your anti-virus had a bug since forever that if it loads a broken virus definition it bricks your system. And a broken virus definition finally happened today.

Do you want every virus definition (that is updated every few hours) to require explicit permission from a sysadmin?

[+] more_corn|1 year ago|reply
You’re learning the wrong lesson here. Automatic security updates in Debian and Ubuntu actually get tested and work. The RCE in ssh a week ago is an argument for enabling automatic security updates. (And for security in depth, putting everything behind VPN for example)

This example is probably an argument for not running windows on critical systems due to insufficient focus on security from the beginning which has lead to a need for things like crowdstrike.

They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.

[+] a0123|1 year ago|reply
They still run Windows XP (og edition, not this patched rubbish) to make sure national security isn't compromised.

The really important machines are still on Win 3.1.

[+] avs733|1 year ago|reply
I understand the logic of this but it is somewhat based on the assumption - which most industries have in droves - that people in THAT industry are the competent bullwhark against stupidity.

I consulted for a company for a while where the 'sysadmin' was the owner's mother - who bought laptops from walmart. Not only could she NOT have approved updates like this, even if she could she would have she wouldn't have had any knowledge whatsoever with which to make a determination if it worked.

In an abstraction, the problem really is with externalities. These approaches to updates exist because people who CAN'T do what you describe are likely a more dominant part of the threat model than this happening to people you do describe. The resulting fix, as we're seeing, is very reliable until it isn't...and if the isn't is enormous in scale the systems aren't setup to fail gracefully.

If you want to make a rule...require graceful failure.

[+] mardifoufs|1 year ago|reply
What would the sysadmins do in this context? Read the release notes of the update? The only thing they would do is update and then be responsible for the problem, and in that case you're back to this exact problem.

It's not like they'd read the source code or examine every file that's been changed or downloaded for a proprietary kernel module for every crowdstrike update (there must be a LOT of them).

[+] travoc|1 year ago|reply
It was a data update that triggered a software bug. It was not a software update. I don't think it's reasonable to make data updates illegal.
[+] scrollaway|1 year ago|reply
Those focusing on QA, staged rollouts, permission management etc are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.

They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”. The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).

I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)

[+] noduerme|1 year ago|reply
In a slightly less threatening but equally noxious box-checking racket, a company I work with is being sued for their website not being sufficiently ADA-compliant. But the first they heard of the lawsuit, before they were even served, was an email from a vendor who specializes in adding junk code to your website that's supposed to tick this box. The vendor happens to work closely with several of the law firms who file and defend these suits.
[+] JCM9|1 year ago|reply
It’s looking like many impacted end-user machines are hard bricked unless you can get into the hard drive to delete the file causing this. Even if you can do that it’s not something that is easily (or potentially even possible) to automate at scale so looking like this is going to be an ugly fix for many impacted devices. This is basically the nightmare scenario for fleet management… devices broken and can’t remotely fix them. Need to send hands on keyboard folks in the field to touch each device.
[+] bluedino|1 year ago|reply
DevSecOps should have you know, tested these updates before they were approved for release company-wide.

If I can't commit code to our app without a branch, pull requests, code review...why can the infrastructure team just send shit out willy-nilly?

"Always allow new updates" must have been checked, or someone just goes through a dashboard and blindly clicks "Approve"

[+] jug|1 year ago|reply
That is what has surprised me. I can understand if small businesses were caught here because they lack financial resources for the infrastructure and staff, but those large corporations like airlines etc... Why don't they have a staging environment where everything goes first? I naively assumed this was established best practice due to the risk of update issues bricking your organization.

But maybe anti-malware is given a blind eye because instant updates for zero day security issues are obviously attractive.

Still, though... In hindsight it's not workable for especially anything running system drivers with liberal kernel access.

[+] pas|1 year ago|reply
It's automatic, no? The whole "promise" (oh sorry, the "added value proposition") of CS is that they "keep you safe" automatically! It was a content update. Meaning basically antivirus signatures ... and oops, some minor non-functional changes to the filtering kernel driver.
[+] ploxiln|1 year ago|reply
Security and Compliance gets to violate all good sense, because it's just sooo important. They can run un-reviewed un-sandboxed daemons as root on every system if they really want, they can have changes pushed automatically without review or control, because "security" is just so important, and due to "compliance" you really have no choice as your company gets larger, you just have to do it. That's why, despite being obviously pretty dumb to many skilled engineers, it seems like everyone does it. No choice. Security, Compliance. So dumb ...
[+] fire_lake|1 year ago|reply
Maybe it was checked but the CI didn’t cover this edge case.

I think the team writing the parsers for these data files deserves some blame. This should have been fuzzed, property tested, etc.

[+] Klonoar|1 year ago|reply
Who says they sent it out willy-nilly?

It’s not unheard of for things to slip by testing and CI.

[+] munchler|1 year ago|reply
So, in other words, there's a race condition in the CrowdStrike Falcon driver at startup time. That, in itself, should be a major cause for alarm, but here we are depending on it to fix this problem.
[+] rahkiin|1 year ago|reply
No, it takes a while to load that definition file. Before the loading it _might_ be able to pull the update that fixes it. If you keep trying the chance this update is pulled increases
[+] t-writescode|1 year ago|reply
The individual person that pressed the "go" button (if there was a person), is going to henceforth be __the best__ DevOps person to ever have on your team. They have learned a multi-trillion-dollar lesson that no amount of training could have prepared them for.

And the Crowdstrike CTO has either been given the ammunition to get __whatever they ask for, ever again__ with regard to appropriate allocation of resources for devops *or* they'll be fired (whether or not it's their fault).

And let me be very clear. This is absolutely, positively and wholly not the person that pressed the button's fault. Not even a little. At a company as integral as CrowdStrike, the number of mistakes and errors that had to have happened long before it got to "Joe the Intern Press Button" is huge and absurd. But many of us have been in (a much, much, *MUCH* smaller version of) Joe's shoes, and we know the gut sinking feeling that hits when something bad happens. A good company and team won't blame Joe and will do everything they can to protect Joe from the hilariously bad systemic issues that allowed this to happen.

[+] ilkkao|1 year ago|reply
Some government should force them to release a technical postmortem. Feels that they don't do it otherwise.
[+] educasean|1 year ago|reply
There should be congressional hearings on this. Not just post mortems.
[+] gen3|1 year ago|reply
I don’t think a cybersecurity company can take down half the US and not release a postmortem
[+] AlienRobot|1 year ago|reply
>The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike's non-broken update before the bad driver can cause the BSOD.

I thought it was BSOD'ing on boot? I don't understand how this works. It auto-updates on boot? From the internet?

[+] tux3|1 year ago|reply
One of the first things the falcon driver does on boot is connect to the server, report some basic info, and start loading these data files, the "channel" files that Crowdstrike frequently updates.

The BSOD is because one of the data files that they previously pushed is horribly mangled, and their driver explodes about it. But if you get lucky, the driver can receive an update notification on boot, connect to the separate file server, and finish overwriting the broken file on disk before the rest of the driver (that would crash) has loaded the broken file

And they do all of that very early on boot. The justification being that you don't want the antivirus to start booting after a rootkit has already installed itself

[+] Ekaros|1 year ago|reply
I love how this is solution for security, while sounding like most insanely unsecure thing...
[+] ukuina|1 year ago|reply
Yes, that is as bad as it sounds.
[+] JumpCrisscross|1 year ago|reply
> It auto-updates on boot? From the internet?

Apparently!

[+] TestingWithEdd|1 year ago|reply
Does this mean a computer without internet access and with CrowdStrike would be unable to start up?
[+] peterleiser|1 year ago|reply
They should change their name to "IT CrowdStrike"
[+] greenavocado|1 year ago|reply
Who bought massive quantities of put options in anticipation of this event?
[+] smsm42|1 year ago|reply
Wow we're progressing from "if it doesn't work just reboot it" to "if the reboot doesn't fix it, you're just not rebooting it hard enough!"
[+] devwastaken|1 year ago|reply
Fine crowdstrike for 10% their companies value. Only way to ensure they won't try to kill people in the future.
[+] MangoCoffee|1 year ago|reply
All the comments are asking why run Windows. CrowdStrike runs on macOS and Linux too. It’s just that this time, CrowdStrike fuck up on Windows. It doesn't mean CrowdStrike won't fuck up on other OS, and it seems like CrowdStrike fuck up on Linux as well. https://news.ycombinator.com/item?id=41005936

I feel like we are better off running open-source software. Everyone can see where the mistakes are instead of running around like a chicken with its head cut off.

[+] seydor|1 year ago|reply
I would like to have the power to press the button that deploys this update
[+] sershe|1 year ago|reply
It's surprising that people mention all kind of bogeymen but don't mention automatic updates.

Automatic updates should be considered harmful. At the minimum, there should be staged rollouts, with a significant gap (days) for issues to arise in the consumer case. Ideally, in the banks/hospitals/... example, their IT should be reading release notes and pushing the update only when necessary, starting with their own machines in a staged manner. As one 90ies IT guy I worked with used to say "you don't roll out a new Windows version before SP1 comes out"