I've been wondering the same. I did just see [1], where it's apparently trying to read memory from an unmapped address, but I haven't seen anything about how r8 got to the point of having said unmapped address.
It seems the affected update file seems to have been over written with 0s on the 42kb file, whereas the before and after sys files have obfuscated ays/config file info as expected.
This is a global multi-layer failure: Microsoft allowing kernel mods by third-party software, CrowdStrike not testing this, DevSecOps not doing a staged/canary deployment, half the world running the same OS, things that should not be connected to the internet but are by default. Microsoft and CrowdStrike drove a horse and a cart through all redundancy and failover designs and showed very clearly where there were no such designs in place.
While I will be the last person in line to defend Microsoft, I am not sure that disallowing 3P kernel mods is a workable solution. Crowdstrike and companies like it exist to fill a very real need within the windows ecosystem. I don’t foresee that suddenly going away now or Microsoft unilaterally forcing every company like crowdstrike out of business and taking over this role themselves
Literally every OS allows you to install 3rd party kernel modules or plugins. If Microsoft banned them, people would be up in arms about them being a controlling walled garden. There is no winning.
Hello, IT, have you tried turning it on and off again 15 times?
Seriously though - this entire outage is the poster child for why you NEVER have software that updates without explicit permission from a sysadmin. If I were in congress, I would make it illegal, it's an obvious national security issue.
That's not the big no-no here. Lack of any real DRP is. Sure, it's cheaper to just buy CS Falcon (and who knows what other amazing vendors supplied timebombs are ticking silently) than paying sysadmins and developers ... and letting them build something that does what it needs, not much else, so there's no need to put these fantastic "single agents" from these RCE-as-a-service vendors on all the fucking servers.
What % of those sysadmins are then going to turn around and script something to auto-approve those updates, once they realize that they are A) requested at inconvenient times and B) are related to security?
Who's going to take the risk of appearing to have sat on an important update, while the org they support is ravaged by ThreatOfTheDay, because they thought they knew better than a multi-billion dollar, tops-in-their-field company?
(I'm not necessarily saying that's actually objectively correct, but I can't imagine that many folks are willing to risk the downside)
> why you NEVER have software that updates without explicit permission from a sysadmin
In general I agree, but this case is quite messy. It's more like your anti-virus had a bug since forever that if it loads a broken virus definition it bricks your system. And a broken virus definition finally happened today.
Do you want every virus definition (that is updated every few hours) to require explicit permission from a sysadmin?
You’re learning the wrong lesson here. Automatic security updates in Debian and Ubuntu actually get tested and work.
The RCE in ssh a week ago is an argument for enabling automatic security updates. (And for security in depth, putting everything behind VPN for example)
This example is probably an argument for not running windows on critical systems due to insufficient focus on security from the beginning which has lead to a need for things like crowdstrike.
They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.
I understand the logic of this but it is somewhat based on the assumption - which most industries have in droves - that people in THAT industry are the competent bullwhark against stupidity.
I consulted for a company for a while where the 'sysadmin' was the owner's mother - who bought laptops from walmart. Not only could she NOT have approved updates like this, even if she could she would have she wouldn't have had any knowledge whatsoever with which to make a determination if it worked.
In an abstraction, the problem really is with externalities. These approaches to updates exist because people who CAN'T do what you describe are likely a more dominant part of the threat model than this happening to people you do describe. The resulting fix, as we're seeing, is very reliable until it isn't...and if the isn't is enormous in scale the systems aren't setup to fail gracefully.
If you want to make a rule...require graceful failure.
What would the sysadmins do in this context? Read the release notes of the update? The only thing they would do is update and then be responsible for the problem, and in that case you're back to this exact problem.
It's not like they'd read the source code or examine every file that's been changed or downloaded for a proprietary kernel module for every crowdstrike update (there must be a LOT of them).
Those focusing on QA, staged rollouts, permission management etc are misguided. Yes of course a serious company should do it but CrowdStrike is a compliance checkbox ticker.
They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”.
The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).
I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)
In a slightly less threatening but equally noxious box-checking racket, a company I work with is being sued for their website not being sufficiently ADA-compliant. But the first they heard of the lawsuit, before they were even served, was an email from a vendor who specializes in adding junk code to your website that's supposed to tick this box. The vendor happens to work closely with several of the law firms who file and defend these suits.
It’s looking like many impacted end-user machines are hard bricked unless you can get into the hard drive to delete the file causing this. Even if you can do that it’s not something that is easily (or potentially even possible) to automate at scale so looking like this is going to be an ugly fix for many impacted devices. This is basically the nightmare scenario for fleet management… devices broken and can’t remotely fix them. Need to send hands on keyboard folks in the field to touch each device.
That is what has surprised me. I can understand if small businesses were caught here because they lack financial resources for the infrastructure and staff, but those large corporations like airlines etc... Why don't they have a staging environment where everything goes first? I naively assumed this was established best practice due to the risk of update issues bricking your organization.
But maybe anti-malware is given a blind eye because instant updates for zero day security issues are obviously attractive.
Still, though... In hindsight it's not workable for especially anything running system drivers with liberal kernel access.
It's automatic, no? The whole "promise" (oh sorry, the "added value proposition") of CS is that they "keep you safe" automatically! It was a content update. Meaning basically antivirus signatures ... and oops, some minor non-functional changes to the filtering kernel driver.
Security and Compliance gets to violate all good sense, because it's just sooo important. They can run un-reviewed un-sandboxed daemons as root on every system if they really want, they can have changes pushed automatically without review or control, because "security" is just so important, and due to "compliance" you really have no choice as your company gets larger, you just have to do it. That's why, despite being obviously pretty dumb to many skilled engineers, it seems like everyone does it. No choice. Security, Compliance. So dumb ...
So, in other words, there's a race condition in the CrowdStrike Falcon driver at startup time. That, in itself, should be a major cause for alarm, but here we are depending on it to fix this problem.
No, it takes a while to load that definition file. Before the loading it _might_ be able to pull the update that fixes it. If you keep trying the chance this update is pulled increases
The individual person that pressed the "go" button (if there was a person), is going to henceforth be __the best__ DevOps person to ever have on your team. They have learned a multi-trillion-dollar lesson that no amount of training could have prepared them for.
And the Crowdstrike CTO has either been given the ammunition to get __whatever they ask for, ever again__ with regard to appropriate allocation of resources for devops *or* they'll be fired (whether or not it's their fault).
And let me be very clear. This is absolutely, positively and wholly not the person that pressed the button's fault. Not even a little. At a company as integral as CrowdStrike, the number of mistakes and errors that had to have happened long before it got to "Joe the Intern Press Button" is huge and absurd. But many of us have been in (a much, much, *MUCH* smaller version of) Joe's shoes, and we know the gut sinking feeling that hits when something bad happens. A good company and team won't blame Joe and will do everything they can to protect Joe from the hilariously bad systemic issues that allowed this to happen.
>The first and easiest is simply to try to reboot affected machines over and over, which gives affected machines multiple chances to try to grab CrowdStrike's non-broken update before the bad driver can cause the BSOD.
I thought it was BSOD'ing on boot? I don't understand how this works. It auto-updates on boot? From the internet?
One of the first things the falcon driver does on boot is connect to the server, report some basic info, and start loading these data files, the "channel" files that Crowdstrike frequently updates.
The BSOD is because one of the data files that they previously pushed is horribly mangled, and their driver explodes about it. But if you get lucky, the driver can receive an update notification on boot, connect to the separate file server, and finish overwriting the broken file on disk before the rest of the driver (that would crash) has loaded the broken file
And they do all of that very early on boot. The justification being that you don't want the antivirus to start booting after a rootkit has already installed itself
All the comments are asking why run Windows. CrowdStrike runs on macOS and Linux too. It’s just that this time, CrowdStrike fuck up on Windows. It doesn't mean CrowdStrike won't fuck up on other OS, and it seems like CrowdStrike fuck up on Linux as well. https://news.ycombinator.com/item?id=41005936
I feel like we are better off running open-source software. Everyone can see where the mistakes are instead of running around like a chicken with its head cut off.
It's surprising that people mention all kind of bogeymen but don't mention automatic updates.
Automatic updates should be considered harmful. At the minimum, there should be staged rollouts, with a significant gap (days) for issues to arise in the consumer case. Ideally, in the banks/hospitals/... example, their IT should be reading release notes and pushing the update only when necessary, starting with their own machines in a staged manner. As one 90ies IT guy I worked with used to say "you don't roll out a new Windows version before SP1 comes out"
[+] [-] ziizii|1 year ago|reply
As in, what exactly is wrong in these C00000291-*.sys files that triggers the crash in csagent.sys, and why?
[+] [-] the_plus_one|1 year ago|reply
[1]: https://x.com/patrickwardle/status/1814343502886477857
[+] [-] crawancon|1 year ago|reply
[+] [-] fsflover|1 year ago|reply
[+] [-] surfingdino|1 year ago|reply
[+] [-] LordKeren|1 year ago|reply
[+] [-] santoshalper|1 year ago|reply
[+] [-] Connector2542|1 year ago|reply
Seriously though - this entire outage is the poster child for why you NEVER have software that updates without explicit permission from a sysadmin. If I were in congress, I would make it illegal, it's an obvious national security issue.
[+] [-] MBCook|1 year ago|reply
Kernel level code blindly loading arbitrary files?
Panicking when the file doesn’t parse because it’s not a memory safe language?
Not validating the files before loading them?
Not validating the files before SHIPPING them? No CI? No safety net?
No staged rollout in case of explosion?
There are far FAR bigger mistakes here than “sys admin didn’t have to press button”.
[+] [-] pas|1 year ago|reply
[+] [-] johnkizer|1 year ago|reply
Who's going to take the risk of appearing to have sat on an important update, while the org they support is ravaged by ThreatOfTheDay, because they thought they knew better than a multi-billion dollar, tops-in-their-field company?
(I'm not necessarily saying that's actually objectively correct, but I can't imagine that many folks are willing to risk the downside)
[+] [-] rfoo|1 year ago|reply
In general I agree, but this case is quite messy. It's more like your anti-virus had a bug since forever that if it loads a broken virus definition it bricks your system. And a broken virus definition finally happened today.
Do you want every virus definition (that is updated every few hours) to require explicit permission from a sysadmin?
[+] [-] more_corn|1 year ago|reply
This example is probably an argument for not running windows on critical systems due to insufficient focus on security from the beginning which has lead to a need for things like crowdstrike.
They do make a version of CS for Linux but nobody runs it unless they’re forced to by overzealous compliance drones.
[+] [-] a0123|1 year ago|reply
The really important machines are still on Win 3.1.
[+] [-] avs733|1 year ago|reply
I consulted for a company for a while where the 'sysadmin' was the owner's mother - who bought laptops from walmart. Not only could she NOT have approved updates like this, even if she could she would have she wouldn't have had any knowledge whatsoever with which to make a determination if it worked.
In an abstraction, the problem really is with externalities. These approaches to updates exist because people who CAN'T do what you describe are likely a more dominant part of the threat model than this happening to people you do describe. The resulting fix, as we're seeing, is very reliable until it isn't...and if the isn't is enormous in scale the systems aren't setup to fail gracefully.
If you want to make a rule...require graceful failure.
[+] [-] mardifoufs|1 year ago|reply
It's not like they'd read the source code or examine every file that's been changed or downloaded for a proprietary kernel module for every crowdstrike update (there must be a LOT of them).
[+] [-] travoc|1 year ago|reply
[+] [-] scrollaway|1 year ago|reply
They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”. The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).
I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)
[+] [-] noduerme|1 year ago|reply
[+] [-] JCM9|1 year ago|reply
[+] [-] bluedino|1 year ago|reply
If I can't commit code to our app without a branch, pull requests, code review...why can the infrastructure team just send shit out willy-nilly?
"Always allow new updates" must have been checked, or someone just goes through a dashboard and blindly clicks "Approve"
[+] [-] jug|1 year ago|reply
But maybe anti-malware is given a blind eye because instant updates for zero day security issues are obviously attractive.
Still, though... In hindsight it's not workable for especially anything running system drivers with liberal kernel access.
[+] [-] pas|1 year ago|reply
[+] [-] ploxiln|1 year ago|reply
[+] [-] fire_lake|1 year ago|reply
I think the team writing the parsers for these data files deserves some blame. This should have been fuzzed, property tested, etc.
[+] [-] Klonoar|1 year ago|reply
It’s not unheard of for things to slip by testing and CI.
[+] [-] munchler|1 year ago|reply
[+] [-] rahkiin|1 year ago|reply
[+] [-] t-writescode|1 year ago|reply
And the Crowdstrike CTO has either been given the ammunition to get __whatever they ask for, ever again__ with regard to appropriate allocation of resources for devops *or* they'll be fired (whether or not it's their fault).
And let me be very clear. This is absolutely, positively and wholly not the person that pressed the button's fault. Not even a little. At a company as integral as CrowdStrike, the number of mistakes and errors that had to have happened long before it got to "Joe the Intern Press Button" is huge and absurd. But many of us have been in (a much, much, *MUCH* smaller version of) Joe's shoes, and we know the gut sinking feeling that hits when something bad happens. A good company and team won't blame Joe and will do everything they can to protect Joe from the hilariously bad systemic issues that allowed this to happen.
[+] [-] idiotlogical|1 year ago|reply
I see my orgs SCCM admins have been consulted
[+] [-] senectus1|1 year ago|reply
[+] [-] ilkkao|1 year ago|reply
[+] [-] educasean|1 year ago|reply
[+] [-] gen3|1 year ago|reply
[+] [-] aeyes|1 year ago|reply
Supposedly they have all kinds of certifications but not even having basic QA demonstrates that this is all just a smokeshow: https://www.crowdstrike.com/why-crowdstrike/crowdstrike-comp...
[+] [-] AlienRobot|1 year ago|reply
I thought it was BSOD'ing on boot? I don't understand how this works. It auto-updates on boot? From the internet?
[+] [-] tux3|1 year ago|reply
The BSOD is because one of the data files that they previously pushed is horribly mangled, and their driver explodes about it. But if you get lucky, the driver can receive an update notification on boot, connect to the separate file server, and finish overwriting the broken file on disk before the rest of the driver (that would crash) has loaded the broken file
And they do all of that very early on boot. The justification being that you don't want the antivirus to start booting after a rootkit has already installed itself
[+] [-] Ekaros|1 year ago|reply
[+] [-] ukuina|1 year ago|reply
[+] [-] JumpCrisscross|1 year ago|reply
Apparently!
[+] [-] TestingWithEdd|1 year ago|reply
[+] [-] peterleiser|1 year ago|reply
[+] [-] greenavocado|1 year ago|reply
[+] [-] smsm42|1 year ago|reply
[+] [-] mystickphoenix|1 year ago|reply
"the truth is everything is breaking all the time, everywhere, for everyone"
https://www.stilldrinking.org/programming-sucks
[+] [-] devwastaken|1 year ago|reply
[+] [-] MangoCoffee|1 year ago|reply
I feel like we are better off running open-source software. Everyone can see where the mistakes are instead of running around like a chicken with its head cut off.
[+] [-] seydor|1 year ago|reply
[+] [-] breakingcups|1 year ago|reply
[+] [-] sershe|1 year ago|reply
Automatic updates should be considered harmful. At the minimum, there should be staged rollouts, with a significant gap (days) for issues to arise in the consumer case. Ideally, in the banks/hospitals/... example, their IT should be reading release notes and pushing the update only when necessary, starting with their own machines in a staged manner. As one 90ies IT guy I worked with used to say "you don't roll out a new Windows version before SP1 comes out"