Initial details about why CrowdStrike's CSAgent.sys crashed

[+] js2|1 year ago|reply

https://threadreaderapp.com/thread/1814343502886477857.html

[+] delta_p_delta_x|1 year ago|reply

The moment I read 'it is a content update that causes the BSOD, deleting it solves the problem', I was immediately willing to bet a hundred quid (for the non-British, that's £100) that it was a combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data (in this case, read an array of pointers, didn't verify that all of them were both non-null and pointed to valid data/code).

In the past ten years or so of having done somewhat serious computing and zero cybersecurity whatsoever, I have my mind concluded, feel free to disagree.

Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

This includes everything from: decompression algorithms; font outline readers; image, video, and audio parsers; video game data parsers; XML and HTML parsers; the various certificate/signature/key parsers in OpenSSL (and derivatives); and now, this CrowdStrike content parser in its EDR program.

That wager stands, by the way, and I'm happy to up the ante by £50 to account for my second theory.

[+] mtlmtlmtlmtl|1 year ago|reply

There's at least five different things that went wrong simultaneously.

1. Poorly written code in the kernel module crashed the whole OS, and kept trying to parse the corrupted files, causing a boot loop. Instead of handling the error gracefully and deleting/marking the files as corrupt.

2. Either the corrupted files slipped through internal testing, or there is no internal testing.

3. Individual settings for when to apply such updates were apparently ignored. It's unclear whether this was a glitch or standard practice. Either way I consider it a bug(it's just a matter of whether it's a software bug or a bug in their procedures).

4. This was pushed out everywhere simultaneously instead of staggered to limit any potential damage.

5. Whatever caused the corruption in the first place, which is anyone's guess.

[+] throw0101d|1 year ago|reply

> Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures.

For the record, the top 25 common weaknesses for 2023 are listed at:

* https://cwe.mitre.org/top25/archive/2023/2023_top25_list.htm...

Deserialization of Untrusted Data (CWE-502) was number fifteen. Number one was Out-of-bounds Write (CWE-787), Use After Free (CWE-416) was number four.

CWEs that have been in every list since they started doing this (2019):

* https://cwe.mitre.org/top25/archive/2023/2023_stubborn_weakn...

[+] bostik|1 year ago|reply

> Approximately 100% of CVEs, crashes, bugs, [...], deserialising binary data

I'd make that 98%. Outside of rounding errors in the margins, the remaining two percent is made up of logic bugs, configuration errors, bad defaults, and outright insecure design choices.

Disclosure: infosec for more than three decades.

[+] eru|1 year ago|reply

> Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

I wouldn't blame imperative programming.

Eg Rust is imperative, and pretty good at telling you off when you forgot a case in your switch.

By contrast the variant of Scheme I used twenty years ago was functional, but didn't have checks for covering all cases. (And Haskell's ghc didn't have that checked turned on by default a few years ago. Not sure if they changed that.)

[+] Sakos|1 year ago|reply

I can't decide what's more damning. The fact that there was effectively no error/failure handling or this:

> Note "channel updates ...bypassed client's staging controls and was rolled out to everyone regardless"

> A few IT folks who had set the CS policy to ignore latest version confirmed this was, ya, bypassed, as this was "content" update (vs. a version update)

If your content updates can break clients, they should not be able to bypass staging controls or policies.

[+] fire_lake|1 year ago|reply

Yes indeed. If you are doing this kind of job, reach for a parser generator framework and fuzz your program.

Also go read Parse Don’t Validate https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

[+] praptak|1 year ago|reply

> imperative languages allow us to do so

This problem has a promising solution, WUFFS, "a memory-safe programming language (and a standard library written in that language) for Wrangling Untrusted File Formats Safely."

HN discussion: https://news.ycombinator.com/item?id=40378433

HN discussion of Wuffs implementation of PNG parser: https://news.ycombinator.com/item?id=26714831

[+] bradley13|1 year ago|reply

No bet. There are two failures here. (1) Failing to check the data for validity, and (2) Failing to handle an error gracefully.

Both of these are undergraduate-level techniques. Heck, they are covered in most first-semester programming courses. Either of these failures is inexcusable in a professional product, much less one that is running with kernel-level privileges.

Bet: CrowdStrike has outsourced much of its development work.

[+] seymore_12|1 year ago|reply

>Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

This. One year ago UK air traffic control collapsed due to inability to properly parse "faulty" flight plan: https://news.ycombinator.com/item?id=37461695

[+] stefan_|1 year ago|reply

People are target fixating too much. Sure, this parser crashed and caused the system to go down. But in an alternative universe they push a definition file that rejects every openat() or connect() syscall. Your system is now equally as dead, except it probably won't even have the grace to restart.

The whole concept of "we fuck with the system in kernel based on data downloaded from the internet" is just not very sound and safe.

[+] noobermin|1 year ago|reply

So, I also have near zero cybersecurity expertise (I took an online intro course on cryptography due to curiousity) and no expertise in writing kernel modules actually, but why if ever would you parse an array of pointers...in a file...instead of any other way of serializing data that doesn't include hardcoded array offsets in an on-disk file...

Ignore this failure which was catastrophic, this was a bad design asking to be exploited by criminals.

[+] lol768|1 year ago|reply

> I'm happy to up the ante by £50 to account for my second theory

What's that, three pints in a pub inside the M25? :P

Completely agree with this sentiment though, we've known that handling of binary data in memory unsafe languages has been risky for yonks. At the very least, fuzzing should've been employed here to try and detect these sorts of issues. More fundamentally though, where was their QA? These "channel files" just went out of the door without any idea as to their validity? Was there no continuous integration check to just .. ensure they parsed with the same parser as was deployed to the endpoints? And why were the channel files not deployed gradually?

[+] back_to_basics|1 year ago|reply

"human programmers forget to account for edge cases"

Which is precisely the rationale which led to Standard Operating Procedures and Best Practices (much like any other Sector of business has developed).

I submit to you, respectfully, that a corporation shall never rise to a $75 Billion Market Cap without a bullet-proof adherence to such, and thus, this "event" should be properly characterized and viewed as a very suspicious anomaly, at the least

https://news.ycombinator.com/item?id=41023539 fleshes out the proper context.

[+] divan|1 year ago|reply

Related talk:

28c3: The Science of Insecurity (2011)

https://www.youtube.com/watch?v=3kEfedtQVOY

[+] smsm42|1 year ago|reply

> combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data

By now, if you write any parser that deals with any outside data and don't fuzz the heck out of it, you are willfully negligent. Fuzzers are pretty easy to use, automatic and would likely catch any such problem pretty soon. So, did they fuzz and got very very unlucky or do they just like to live dangerously?

[+] xxs|1 year ago|reply

>(for the non-British, that's £100)

next time you'd be adding /s to your posts

[+] variadix|1 year ago|reply

More or less. Binary parsers are the easiest place to find exploits because of how hard it is to do correctly. Bounds checks, overflow checks, pointer checks, etc. Especially when the data format is complicated.

[+] miohtama|1 year ago|reply

I was immediately willing to bet a hundred quid this was C/C++ code :)

[+] cedws|1 year ago|reply

I’d say that it is a bug by definition if your program ungracefully crashes when it’s passed malformed data at runtime.

[+] Log_out_|1 year ago|reply

Soon, update revert watchdogs

[+] MBCook|1 year ago|reply

https://twitter-thread.com/t/1814343502886477857

[+] unknown|1 year ago|reply

[deleted]

[+] G3rn0ti|1 year ago|reply

By-passing the discussion whether one actually needs root kit powered endpoint surveillance software such as CS perhaps an open-source solution would be a killer to move this whole sector to more ethical standards. So the main tool would be open source and it would be transparent what it does exactly and that it is free of backdoors or really bad bugs. It could be audited by the public. On the other hand it could still be a business model to supply malware signatures as a security team feeding this system.

[+] imiric|1 year ago|reply

I'd say no. Kolide is one such attempt, and their practices, and how it's used in companies, are as insidious as those from a proprietary product. As a user, it gives me no assurance that an open source surveillance rootkit is better tested and developed, or that it has my best interests in mind.

The problem is the entire category of surveillance software. It should not exist. Companies that use it don't understand security, and don't trust their employees. They're not good places to work at.

[+] plantain|1 year ago|reply

There is an open source alternative. GRR:

https://github.com/google/grr

Every Google client device has it.

[+] giantpotato|1 year ago|reply

> By-passing the discussion whether one actually needs root kit powered endpoint surveillance software such as CS perhaps an open-source solution would be a killer to move this whole sector to more ethical standards.

As a red teamer developing malware for my team to evade EDR solutions we come across, I can tell you that EDR systems are essential. The phrase "root kit powered endpoint surveillance" is a mischaracterization, often fueled by misconceptions from the gaming community. These tools provide essential protection against sophisticated threats, and they catch them. Without them, my job would be 90% easier when doing a test where Windows boxes are included.

> So the main tool would be open source and it would be transparent what it does exactly and that it is free of backdoors or really bad bugs.

Open-source EDR solutions, like OpenEDR [1], exist but are outdated and offer poor telemetry. Assembling various GitHub POCs that exist for production EDR is impractical and insecure.

The EDR sensor itself becomes the targeted thing. As a threat actor, the EDR is the only thing in your way most of the time. Open sourcing them increases the risk of attackers contributing malicious code to slow down development or introduce vulnerabilities. It becomes a nightmare for development, as you can't be sure who is on the other side of the pull request. TAs will do everything to slow down the development of a security sensor. It is a very adversarial atmosphere.

> On the other hand it could still be a business model to supply malware signatures as a security team feeding this system.

It is actually the other way around. Open-source malware heuristic rules do exist, such as Elastic Security's detection rules [2]. Elastic also provides EDR solutions that include kernel drivers and is, in my experience, the harder one to bypass. Again, please make an EDR without drivers for Windows, it makes my job easier.

> *It could be audited by the public."

The EDR sensors already do get "audited" by security researchers and the threat actors themselves. Reverse engineering and debugging the EDR sensors to spot weaknesses that can be "abused." If I spot things like the EDR just plainly accepting kernel mode shellcode and executing it, I will, of course, publicly disclose that. EDR sensors are under a lot of scrutiny.

[1] https://github.com/ComodoSecurity/openedr [2] https://github.com/elastic/detection-rules

[+] cedws|1 year ago|reply

The value CrowdStrike provides is the maintenance of the signature database, and being able to monitor attack campaigns worldwide. That takes a fair amount of resources that an open source project wouldn’t have. It’s a bit more complicated than a basic hash lookup program.

[+] unknown|1 year ago|reply

[deleted]

[+] intelVISA|1 year ago|reply

Security isn't really a product you can just buy or outsource, but here we are.

[+] ndr_|1 year ago|reply

There used to be Winpooch Watchguard, based on ClamAV. Stopped using it when it caused Bluescreens. A "Killer" indeed.

[+] ymck|1 year ago|reply

There are a number of OSS EDRs. They all suck.

DAT-style content updates and signature-based prevention are very archaic. Directly loading content into memory and a hard-coded list of threats? I was honestly shocked that CS was still doing DAT-style updates in an age of ML and real-time threat feeds. There are a number of vendors who've offered it for almost a decade. We use one. We have to run updates a couple of times a year.

SMH. The 90's want their endpoint tech back.

[+] matheusmoreira|1 year ago|reply

There are no "ethical standards" to move to. Nobody should be able to usurp control of our computers. That should simply be declared illegal. Creating contractual obligations that require people to cede control of their computers should also be prohibited. Anything that does this is malware and malware does not become justified or "ethical" when some corporation does it. Open source malware is still malware.

[+] golemiprague|1 year ago|reply

But how come they didn't catch it in the testing deployments? what was the difference that caused it to happen when they deployed to the outside world. I find it hard to believe that they didn't test it before deployment. I also think companies should all have a testing environment before deploying 3rd party components. I mean, we all install some packages during development that fails or cause some problems but nobody think it is a good idea to do it directly in their production environment before testing, so how is this different?

[+] siscia|1 year ago|reply

The thing I don't understand about all of this is another, much less technical and much more important.

Why the blas radius was so huge?

I have deployed much less important services much more slowly with automatic monitoring and rollback in place.

You first deploy to beta, where you don't get customers traffic, if everything goes right to a small part of your fleet, and slowly increase the percentage of hosts that receives the updates.

This would have stopped the issue immediately, and I somehow I thought it was common practices...

[+] apatheticonion|1 year ago|reply

One thing I am surprised no one has been discussing is the role Microsoft have played in this and how they set the stage for the CrowdStrike outage through a lack of incentive (profit, competition) to make Windows resilient to this sort of situation.

While they were not directly responsible for the bug that caused the crashes, Microsoft does hold an effective monopoly position over workstation computing space (I'd consider this as infrastructure at this point) and therefore have a duty of care to ensure the security/reliability and capabilities of their product.

Without competition, Microsoft have been asleep at the wheel on innovations to Windows - some of which could have prevented this outage.

For example; Crowdstrike runs in user space on MacOS and Linux - does Windows not provide the capabilities needed to run Crowdstrike in user space?

What about innovations in application sandboxing which could mitigate the need for level of control CrowdStrike requires?

The fact is; Microsoft is largely uncontested in holding the keys to the world's computing infrastructure and they have virtually no oversight.

Windows has fallen from making over 80% of Microsoft's revenue to 10% today - there is nothing wrong with being a private company chasing money - but when your product is critical to the operation of hospitals, airlines, critical infrastructure, you can't be out there tickling your undercarriage on AI assistants and advertisements to increase the product's profitability.

IMO Microsoft have dropped the ball on their duty of care to consumers and CrowdStrike is a symptom of that. Governments need to seriously consider encouraging competition in the desktop workspace market. That, or regulate Microsoft's Windows product

[+] jonhohle|1 year ago|reply

I don’t run CrowdStrike and to the best of my knowledge haven’t had it installed on one of my systems (something similar ran on my machine at the last corporate Jon I had), so correct me if I’m wrong.

It seems great pains are made to ensure the CS driver is installed first _and_ cannot be uninstalled (presumably the remote monitor will notice) or tampered with (signed driver).

Then the driver goes and loads unsigned data files that can be arbitrarily deleted by end users? Can these files also be arbitrarily added by end users to get the driver to behave in ways that it shouldn’t? What prevents a malicious actor from writing a malicious data file and starting another cascade of failing machines or worse, getting kernel privileges?

[+] hannasm|1 year ago|reply

Do these customers of crowd strike even have a say in these updates going out or do they all just bend over and let crowd strike have full RCE on every machine in their enterprise.

I sure hope the certificate authorities and other crypto folks get to keep that stuff off their systems at least.

[+] cedws|1 year ago|reply

Does anybody know if these “channel files” are signed and verified by the CS driver? Because if not, that seems like a gaping hole for a ring 0 rootkit. Yeah, you need privileges to install the channel files, but once you have it you can hide yourself much deeper in the system. If the channel files can cause a segfault, they can probably do more.

Any input for something that runs at such high privilege should be at least integrity checked. That’s the basics.

And the fact that you can simply delete these channel files suggests there isn’t even an anti-tamper mechanism.

[+] throwyhrowghjj|1 year ago|reply

This is a pretty brief 'analysis'. The poster traces back one stack frame in assembler, it basically amounts to just reading out a stack dump from gdb. It's a good starting point I guess.

[+] kachapopopow|1 year ago|reply

These "channel files" sound like they could be used to execute arbitrary code... Would be a big embarrassment if it shows up in KDU as a provider...

(This is just an early guess from looking at some of the csagent in ida decompiler, haven't validated that all the sanity checks can be bypassed as these channel files appear to have some kind of signature attached to them.)

[+] mianos|1 year ago|reply

A 'channel file' is a file interpreted by their signature detection system. How far is this from a bytecode compiled domain specific language? Javascript anyone?

eBPF, much the same thing, is actually thought about and well designed. If it wasn't it would be easy to crash linux.

This is what they do and they are doing badly. I bet it's just shit on shit under the hood, developed by somewhat competent engineers, all gone or promoted to management.

[+] nickm12|1 year ago|reply

It's really difficult to evaluate the risk the CrowdStrike system imposed. Was this a confluence of improbable events or an inevitable disaster waiting to happen?

Some still-open questions in my mind:

- was the broken rule in the config file (C-00000291-...32.sys) human authored and reviewed or machine-generated?

- was the config file syntactically or semantically invalid according to its spec?

- what is the intended failure mode of the kernel driver that encounters an invalid config (presumably it's not "go into a boot loop")?

- what automated testing was done on both the file going out and the kernel driver code? Where would we have expected to catch this bug?

- what release strategy, if any, was in place to limit the blast radius of a bug? Was there a bug in the release gates or were there simply no release gates?

Given what we know so far, it seems much more likely that this was a "disaster waiting to happen" but I still think there's a lot more to know. I look forward to the public post-mortem.

[+] iwontberude|1 year ago|reply

Crowdstrike isn’t a company anymore, this is probably their end. The litigation will be death by thousand cuts.

[+] calrain|1 year ago|reply

This reminds me of the vulnerability that hit jwt tokens a few years ago, when you could set the 'alg' to 'none'.

Surely CrowdStrike encrypts and signs their channel files, and I'm wondering if a file full of 0's inadvertently signaled to the validating software than a 'null' or 'none' encryption algo was being used.

This could imply the file full of zeros is just fine, as the null encryption passes, because it's not encrypted.

That could explain why it tried to reference the null memory location, because the null encryption file full of zeroes just forced it to run to memory location zero.

The risk is, if this is true, then their channel loading verification system is critically exposed by being able to load malicious channel drivers through disabled encryption on channel files.

Just a hunch.

[+] system2|1 year ago|reply

Maybe one day people will learn what a blog is.

[+] flappyeagle|1 year ago|reply

The only thing I know about crowdstrike is they hired a large percentage of the underperforming engineers we fired at multiple companies I’ve worked at

[+] canistel|1 year ago|reply

Out of curiosity: In the old days, SoftIce could have been used which was a kernel mode debugger. What tool can be used these days?

[+] MuffinFlavored|1 year ago|reply

Did this cause the Azure outage https://status.dev.azure.com/_event/524064579 that happened like 12 hours before or were they separate?

664 comments