Systemic risk exposed: the Crowdstrike global outage

August 1, 2024

On July 19th, 2024, Crowdstrike pushed a bad update to all of its Windows clients, creating a global it outage. Crowdstrike is a provider of security solutions to enterprises, the most well-known and well-regarded in its domain. However, their driver could not handle a buggy update consisting of all zeros, and sent the systems into a boot loop with the blue screen of death.

During the outage, flights across the US and across the world were grounded; hospitals couldn’t treat patients; government services were down. Although the fix for the issue ended up being relatively straightforward to perform on each individual system, a huge number of systems (estimated at 8.5 million) were brought down and needed manual intervention to come back online. This is likely the single largest cyber incident in terms of number of people impacted.

The issue with Crowdstrike highlights the problem with having single points of failure, and correlated risk because everyone runs the same stack. Since Crowdstrike had such a good reputation, many of the world’s most successful organizations used them. But the same could be said for any other software component which has obtained a monopoly in today’s digital monoculture. We discussed some potential mitigations against this problem.

@christian_tail: You’re correct. Full of zeros at least.
https://x.com/christian_tail/status/1814299095261147448

Microsoft-CrowdStrike outage: From airports, ATMs, banks and hospitals — we see what was impacted
https://www.livemint.com/companies/news/microsoft-crowdstrike-outage-software-what-was-impacted-it-upgrade-airports-chaos-atms-banks-hospitals-cybersecurity-11721451512275.html

Technical Details: Falcon Content Update for Windows Hosts
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/

Has something as catastrophic as Crowdstrike ever happened in the Linux world?

Has something as catastrophic as Crowdstrike ever happened in the Linux world?
byu/cof666 inlinux

When Security Becomes the Threat: The Crowdstrike Incident
https://medium.com/@confusedcyberwarrior/when-security-becomes-the-threat-the-crowdstrike-incident-9bbaeab9db9d

Ask HN: Can anyone from Crowdstrike explain the back story?
https://news.ycombinator.com/item?id=41016373

CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says
https://www.bbc.com/news/articles/cpe3zgznwjno

#crowdstrike #outage #systemicrisk

0:00 Intro
0:32 Contents
0:39 Part 1: What happened with Crowdstrike?
1:00 Details of Crowdstrike’s business
1:23 EDR tool, superset of antivirus
1:48 Crowdstrike outage July 19th, 2024
2:29 Comparison against other cyber incidents
2:53 Target machines were running critical infrastructure
3:21 Large economic impact
3:27 Example: airports coming to a standstill
4:03 Example: hospitals and pharmacies
4:14 Example: banks and financial transactions
4:38 Example: FBI, FedEx, US courts
5:05 Other companies do mess up too
5:30 Outages can be a business killer
5:54 McAfee also had an outage: George Kurtz
6:33 Part 2: Under the hood
6:41 My experience in similar domain
6:55 Crowdstrike and Sentinel One
7:32 Technical details of issue, auto updater
8:05 New malware discoveries require updates
8:39 What really happened with the bad update
9:22 Update file was full of zeros
10:17 Push code carefully…!
11:07 Windows driver choked on file of zeros
11:39 How much is Microsoft to blame?
12:31 Anti-malware hooks are harder to handle
13:16 Stringent driver signing process
13:39 Will Crowdstrike survive this as a company?
14:38 Part 3: Addressing systemic risk
14:52 Mitigate or diversify
15:08 Mitigation: Quality Assurance (QA) testing
16:00 Mitigation: staged deployment
16:22 Mitigation: stateless infrastructure
16:51 Mitigation and diversification: backup systems or hot spares
17:22 Diversification: replicate on different stack
17:57 Table top exercises to simulate security scenarios
18:50 Software is a winner take all market
19:19 Correlated failures have high probability
19:34 Attack waiting to happen
19:52 Attackers that use advanced AI
20:46 Need to start using defensive AI
21:02 Conclusion
21:27 Pushing frequent updates requires QA
22:06 Outro

source

by Dr Waku

linux dns configuration

So, I can't help myself, as I work with a similar product called Carbon Black EDR, to push back on these channels.
People were grounded, stranded and had the outage because companies DO NOT support IT. How can I say that, because a tool that is operating at the kernel, you ALWAYS test before you deploy. When you get the produce from the vendor, YOU TEST before deploying to your fleet.

I initially had some concerns, that maybe an attacker got at crowdstrike's build environment, or that a windows patch was not caught. What it sounds like is the update package was empty, and broke the driver.
If manually handling this, there would be md5sum sum checks to confirm a bit for bit copying of the files necessary that crowdstrike was sending.

But here is the thing, stuff breaks. Adding any software is risk. The most secure computer is one that is turned off and at the bottom of the ocean. There is risk in every update and every change.

So why do these companies trust anyone to upload and configure their CRITICAL systems without testing? Because they don't want to pay for IT staff that will push back and insist on testing.

Heck, they could have a small segment of machine and sensors that get the auto update and report on failure, before pushing to every other machine.

But instead, all these Youtube channels are popping up, claiming that it's crowdstrike's fault or windows fault. The fault is not respecting the software, especially the "sharp" software that cyber security products need you be, in order to protect and remediate your systems. Do these same companies cry like this when windows blue screens a segment of the market? No, because more than likely they are testing windows updates. Or worse, they update so late that criminal organizations have been using their systems as a malware distribution system since day 2 of the security update.

So is it they didn't know what they bought, or they are just completely trusting of everyone who operates in the cloud. None of this is magic. We've been building and breaking computers for years now. We have university programs for god sake.
So, when an ops, dev, or security engineer makes any sort of noise along the lines of, WE SHOULD TEST ON A SMALL SECTION BEFORE WE FULLY DEPLOYED ACROSS THE ENTERPRISE, don't argue with them and insist on auto updates. Instead, pay them a living wage and have them test before you deploy.

In the long run, it will always be cheaper for your business. Also you have to endure is someone saying, "Maybe this isn't a good idea for us."

My $0.02. But also, all the companies that were hit with this, please get you heads out of you ass. There is no way I would trust you with my data if you don't understand a simple test before deploy paradigm.

26 thoughts on “Systemic risk exposed: the Crowdstrike global outage”

@DrWaku

August 1, 2024 at 10:09 am

Do you think I should do more videos on security or the intersection of security and AI?
@FantasticErg0

August 1, 2024 at 10:09 am

I have watched many videos about the CloudStrike incident. This is by far my favorite. Dr.Waku your breakdown and analysis is amazing. I particularly appreciate the way you use simple language to convey complex thoughts. Thank you!
@bobtarmac1828

August 1, 2024 at 10:09 am

Cease ai. This accident was caused by Ai writing critical code. It will only get worse.
@JonathanTucker1ls

August 1, 2024 at 10:09 am

It's better to fail down, than to fail up. One advantage is 8.5 million machines, could not be hacked!
@WippSheridan

August 1, 2024 at 10:09 am

Dereferencing NULL Pointers: As you mentioned, this error occurs when a pointer that has not been initialized or has been set to NULL is dereferenced. This leads to system crashes or the notorious Blue Screen of Death (BSOD) because the NULL address (typically address 0) is invalid and not accessible by programs. This kind of error is particularly dangerous and common in kernel mode because of the high privileges and lack of safety nets found in user space.

Memory Leaks: In kernel mode, memory management errors can be catastrophic. Memory leaks occur when allocated memory is not freed, causing the system to run out of resources. This can lead to performance degradation or system crashes.

Buffer Overflows: These occur when data exceeds the allocated space, potentially overwriting other important data. This can lead to unexpected behavior, data corruption, security vulnerabilities, or system crashes.

Race Conditions: These happen when multiple threads or processes operate on shared data without proper synchronization, leading to unpredictable results. In driver code, where operations may be interrupt or device-driven, race conditions can be particularly problematic.

Invalid Device Handle Usage: Errors can occur when drivers use invalid or previously closed device handles. Such usage can lead to unpredictable behavior or system crashes.

Improper Synchronization: Failing to correctly synchronize access to shared resources can lead to corrupted data or crashes. This includes improper use of mutexes, semaphores, or other synchronization primitives.
@stevedoetsch

August 1, 2024 at 10:09 am

Is it just a coincidence that a company that named itself "crowdstrike" has positioned itselfed in the business sphere to be able strike at masses of people at once, and then did exactly that? Crowdstrike, did, in fact, strike the crowd. Most people are too brain dead and spiritually decimated to comprehend that these things are done intentionally, both the naming and the action of the company, even when it is right in front of their faces.
@genxlife

August 1, 2024 at 10:09 am

I don't think Michigan First Credit Union is among the world's biggest organizations.
@johnnydoe3603

August 1, 2024 at 10:09 am

If it was Deployed during Office Hours in India, then why did it Start affecting Payment Systems in Australia first ?

In fact the Airports from India to Japan were the last to get affected.

Now i am curious about the Scheduling of the Deployment.
@jasonbrault5273

August 1, 2024 at 10:09 am

So, I can't help myself, as I work with a similar product called Carbon Black EDR, to push back on these channels.
People were grounded, stranded and had the outage because companies DO NOT support IT. How can I say that, because a tool that is operating at the kernel, you ALWAYS test before you deploy. When you get the produce from the vendor, YOU TEST before deploying to your fleet.

I initially had some concerns, that maybe an attacker got at crowdstrike's build environment, or that a windows patch was not caught. What it sounds like is the update package was empty, and broke the driver.
If manually handling this, there would be md5sum sum checks to confirm a bit for bit copying of the files necessary that crowdstrike was sending.

But here is the thing, stuff breaks. Adding any software is risk. The most secure computer is one that is turned off and at the bottom of the ocean. There is risk in every update and every change.

So why do these companies trust anyone to upload and configure their CRITICAL systems without testing? Because they don't want to pay for IT staff that will push back and insist on testing.

Heck, they could have a small segment of machine and sensors that get the auto update and report on failure, before pushing to every other machine.

But instead, all these Youtube channels are popping up, claiming that it's crowdstrike's fault or windows fault. The fault is not respecting the software, especially the "sharp" software that cyber security products need you be, in order to protect and remediate your systems. Do these same companies cry like this when windows blue screens a segment of the market? No, because more than likely they are testing windows updates. Or worse, they update so late that criminal organizations have been using their systems as a malware distribution system since day 2 of the security update.

So is it they didn't know what they bought, or they are just completely trusting of everyone who operates in the cloud. None of this is magic. We've been building and breaking computers for years now. We have university programs for god sake.
So, when an ops, dev, or security engineer makes any sort of noise along the lines of, WE SHOULD TEST ON A SMALL SECTION BEFORE WE FULLY DEPLOYED ACROSS THE ENTERPRISE, don't argue with them and insist on auto updates. Instead, pay them a living wage and have them test before you deploy.

In the long run, it will always be cheaper for your business. Also you have to endure is someone saying, "Maybe this isn't a good idea for us."

My $0.02. But also, all the companies that were hit with this, please get you heads out of you ass. There is no way I would trust you with my data if you don't understand a simple test before deploy paradigm.
@averywyatt-luth1642

August 1, 2024 at 10:09 am

Hi @DrWaku, which ivy league institution did you go to?
@BigWhoopZH

August 1, 2024 at 10:09 am

Although this was of course a very bad and also carelessly caused outage it still was nowhere near the "global internet outage" that is often titled as. The internet only went down at very few providers, all messenger worked just fine, electricity didn't cut out. Only one of 5 supermarkets in my area had problems (cash only for a few hours) This was not good but certainly no catastrophe.
@K.F-R

August 1, 2024 at 10:09 am

This is truly excellent reporting. Outstanding work.
Thank you.
@energytherapiestoboostwell8113

August 1, 2024 at 10:09 am

When we say that we can understand why the immediate deployment of the update to all systems might make sense, we are pointing to serious defect in the human adaptation to technology. To our armies of brilliant software engineers, now supported by AI capable of writing brilliant code, we need to add armies of human systems engineers.

When are you consider the unknown, or perhaps even unknowable, state of the future interaction (1) OS code, (2) driver code, (3) hardware built by an uncoordinated collection of manufacturers, (4) data, and (5) encryption (in many cases with lost encryption keys), it is quite obvious that the question of the arrival of update deployment accidents is far more a question of “when” rather than one of “if”.

This is so obvious in hindsight that our great corporate leaders should be deeply embarrassed by their failure to see this issue and reflect on how an accident my play out in their organizations.

This is simply not a question of finding the best software solution. At the very least the leaders of our powerful elites should reflect upon the national security vulnerabilities connected with how we have decided to deploy technology. When, for example, you see multiple banks and other great companies keep imploring us to ‘go paperless’, and accept that legal notices have been delivered merely because the deliverer has pressed a Send button at her/his computer, you can understand that the level of dumbness among us human beings is something over which we could easily shed tears.
@shexec32

August 1, 2024 at 10:09 am

2:41 Oh, crowdstrike was absolutely a cyberattack 💯.
It's that this cyberattack, unlike Wanacry, that was 13 years in the making.
@TreeLuvBurdpu

August 1, 2024 at 10:09 am

Microsoft had to make a 2009 regulatory agreement with EU regulators that required them to allow kernel-level updates by security software providers.
@Zekrom569

August 1, 2024 at 10:09 am

So Microsoft now is blaming an EU commission decision to address competition concerns that forced them to keep the kernel open to third party vendors, but what i am seeing in the actual document that was published as part of the committee process is that "Microsoft should provide the ability to uninstall internet explorer, Microsoft should allow OEMs to preinstall the browser of their choice without retalation, etc", so this decision Microsoft refers to isnt about granting third parties access to the kernel but it is about letting the end users and OEMs decide which browser they should install on their systems. In general it was about the system not forcing a certain choice of software to their users.
@Sancarn

August 1, 2024 at 10:09 am

They should likely be broken up as they are currently a monopoly essentially…
@symphony_baritone

August 1, 2024 at 10:09 am

When I think of Crowdstrike … I cant help but think of the line from "A Few Good Men". "You want me on that wall." (firewall) "You NEED me on that wall." No one will complain as the kernel itself needs protection of this magnitude as the kernel is designed poorly.
@gslim7337

August 1, 2024 at 10:09 am

The arsehole politicians in Australia pushing the cashless agenda have become very quiet in the last week.
@rodmxw11

August 1, 2024 at 10:09 am

This is the most cogent presentation of the issues involved.
@Ena-ck3kb

August 1, 2024 at 10:09 am

i said it was AGENNNNNNTICCC!!! best get fishing. Mother XXXXX[]
@paul_shuler

August 1, 2024 at 10:09 am

so dystopian… no worries about the coming agi/asi… lol
@nbrown5907

August 1, 2024 at 10:09 am

They have kernel access and do not have time for WHQL testing I hear. The kernel thing means that any OS with bad code would crash it might be that the updates for the other OS's were ok.
@ericjbowman1708

August 1, 2024 at 10:09 am

When I was a chairlift operator, a guy on my crew quit, by dumping the emergency brake, then taking the handle for the hydraulic brake-release pump (150x) and chucking it into the forest. Was this a disgruntled employee, nobody was listening to about how dangerous CS's code was? You gen Z'ers change jobs all the time, maybe a disgruntled employee's last day…
@mdean3801

August 1, 2024 at 10:09 am

I guess you think the Nordstream accident was like the Crowdstrike critical infrastructure accident..
@Kylelf

August 1, 2024 at 10:09 am

Amazon AWS RDS didn't want me to do incremental rollouts when I was there 2018-2022 or so. My first couple of rollouts I did do incremental and I actually caught some stuff, but then because of pressures and wanting to follow company alignment, I did all my rollouts single shot.