Kopec Explains Software
Computing Concepts Simplified
2 months ago

#134 The CrowdStrike Incident

Bringing Down Millions of Windows PCs

Transcript
David Kopec

On July 19, 2024, large american cybersecurity firm CrowdStrike deployed an update for its Falcon sensor security software that took down over 8 million Microsoft Windows computers around the world, including some involved in critical infrastructure. In this episode, we'll explain what happens and what some of the implications are long term for the industry. Welcome to Copeck explains software, the podcast where we make computing intelligible. In this episode, we're going to discuss a major it incident that affected over 8 million Windows computers around the world. But believe it or not, it wasn't caused by Microsoft. It was caused by a third party software maker called CrowdStrike. In the cybersecurity sphere. They make a software called Falcon Sensor that's deployed in large organizations all around the world to defend against cybersecurity threats. But this wasn't actually a cybersecurity attack. No, this was just a bug, literally a programming mistake that the programmers at CrowdStrike had made that its update exploited and caused many, many computers to crash.

Rebecca Kopec

Before we dive into the bug, let's talk a little bit just about CrowdStrike. Who are they? What do they do?

David Kopec

So to make an analogy, for those that don't know anything about cybersecurity, I think we all at least know about the basic idea of an antivirus program. Now, that's not exactly what Falcon sensor is, but an antivirus program, as most of you know, is something that runs on your computer and says, is this a virus? Is this a virus? Is this a virus? And if it finds a virus, it quarantines the files, deletes them, or alerts you to the problem or might even stop something being transmitted onto your computer. So think about CrowdStrike's Falcon sensor software as a much, much more advanced version of that. Not only is it doing things like analyzing data that's on the computer, but it's also analyzing traffic that's coming in and out of the computer and all across the network and gathering data from all the different nodes on the network and seeing are there certain traffic patterns that seem like they're patterns of an attacker? Are they malicious patterns? And so it's using both traditional basic scanning techniques and also machine learning techniques to analyze patterns and see malicious patterns as they happen and stop right when the incident is just starting, before it starts to spread all across the network.

Rebecca Kopec

And CrowdStrike isn't something just an everyday consumer would necessarily use. It's more of like an enterprise sort of software.

David Kopec

Right? CrowdStrike and their products most appeal to large organizations. We're talking about airlines, hospitals, schools, governments and large corporations. In fact, I read that more than 50% of the Fortune 1000 uses CrowdStrike software on at least some of their computers. So these are deployed to very large organizations. They're not usually just home users that are using CrowdStrike's Falcon sensor software. Wouldn't really make sense because it's analyzing threats from across the network. At the same time, though, of course, there are individual PCs in these networks and individual servers in these networks. And so it is taking down just individual machines kind of one at a time. When this bug happens, how does Falcon.

Rebecca Kopec

Sensor analyze the traffic?

David Kopec

Right. In order to analyze everything that's going on in your machine, it needs increased privileges. It can't just be operating like any other app. Now, back in the nineties, if you were running classic Mac OS or Windows 95, any piece of software could basically do anything it wanted to your system. There was almost no protection in place. But on modern operating systems like Windows eleven or the latest version of macOS, any individual app can't just go and analyze what other apps are doing. It can't go and change things that are fundamental to the system or see network traffic or see individual regions of memory that it doesn't control. In order to do that, Falcon sensor on Windows is actually hooked into the kernel. That's the core of the Microsoft Windows operating system. The core of any operating system is called the kernel. That's the part that's responsible for actually launching individual processes, for managing memory or managing multitasking. And so having that direct access to the kernel of the operating system enables Crowdstrike's Falcon sensor to go and analyze what all the other apps on the system are doing and what all the different memory and traffic patterns and network patterns that are happening on the computer are. And so it has really direct access at a very, very low level. And that's one of the reasons this bug was so fatal. Because when you have an error at the kernel level of the operating system, it doesn't just take down one app, it takes down the whole system.

Rebecca Kopec

This is a good time to point out that we actually did a prior episode on operating systems, which would be a good one for folks to listen to and we'll link to in the show notes. So Falcon sensor has to have this really base level access to work effectively. Does that how it works for all computers like Windows OS Linux?

David Kopec

Yeah, unfortunately, this is really specific to Microsoft Windows. So Microsoft Windows, in order for security software to work in this way, needs to have kernel level access on Linux and on macOS, more modern versions actually have a third party framework that Apple makes. For example, on Mac OS, that enables security software to tap in without actually going and interfering at the kernel level. Now, I've read some things that say that Microsoft doesn't actually have that same sort of third party access for security software because of some kind of government regulations. There was a whole argument made that this is unintended consequences of some of the anti monopoly regulations on Microsoft. I didn't dig into the details of that for if that's really the case. But the fact of the matter is no, Crowdstrike does not need to operate the same way on macOS as it does on Windows, because on Mac OS, there is a security framework for third party apps to tap into that does not require them to get as low level access. But on Microsoft Windows, you need to have this kernel level access if your security software that operates to do what Crowdstrike's Falcon sensor does. Now, at the same time, it's important to mention that the update that came in is separate from the core software that's operating at the kernel level, in that the core kernel extension had to go through an approval process by Microsoft. So Microsoft actually certifies and cryptographically signs the core Crowdstrike Falcon sensor software. But the updates that are sent to the software do not pass through that same certification process because they're not executable code.

Rebecca Kopec

And that loophole almost in some ways wound up being the big, big problem here, right?

David Kopec

So on July 19, they sent an update that modifies what Falcon sensor looks for. And they do these updates all the time. They do them on a weekly basis because threats are constantly evolving. What security software needs to look for changes. As new viruses come out, new kinds of attacks are discovered. These pieces of software need to have updates. But this update triggered some bad code within falcon sensor. Specifically, the code was looking for 21 values when only 20 values were present. Now, that might sound absurd to you, but that's actually a classic bug that happens all the time in c and C programs. We run it over the end of an array. We think we're looking for a certain number of values. We look one pass where we were supposed to be looking for because there's actually less than that number of values there, and we end up going into some arbitrary part of memory and reading something that's not valid. And when we read that invalid piece of memory, what's the program going to do? It crashes. It doesn't know what to do. It executes things that it can't execute. And when a program crashes that's tied into the kernel, then the whole system crashes.

Rebecca Kopec

It doesn't even get to the buddha. Like, you don't even get past the welcome or the like.

David Kopec

Well, yeah, that did happen in this instance. So when it initially updated, it might have been live. The operating system is running, and then we're the restarting, and the kernel is loading, and it's loading all the third party software that's part of the kernel, and then that's causing a crash. And so we don't even get back to our desktop. And so that's what actually happened in the crowdstrike instance. And all of these 8.5 million windows machines couldn't even get to the desktop unless they were restarted in recovery mode or what used to be called safe mode. And then what had to be done to fix them was an IT person had to restart in that safe mode and go in and delete this update file that was being read and then triggering this bug. And so it was a manual process that took about an hour. Theyre saying, on average, for each machine to get fixed and imagine all the it people who on that morning had to be woken up and, hey, we got an emergency, you got to come out. And then every individual computer, there could be 10,000 at some place that all had to be fixed and somebody to go to each one individually and fix them. So that was really terrible. But not only did this bring down individual peoples PCs, but a lot of infrastructure windows computers as well, some that were running hospitals, airlines, schools, almost definitely. I hate to say it, but people died as a result of this incident. And I don't say that to be dramatic. I say that to illustrate the reality of the fact that when software is connected to the kernel, very bad things can happen. And you might say, why do people die? Okay, somebody's flight's delayed and they were going for a medical procedure, or hospitals had to actually, it's been written about, had to delay some of their procedures because the computers for organizing the drugs and the people that were supposed to perform them were not working. And surely when some procedures get delayed, some people die. So, yes, people died as a result of this incident. And so these companies that make kernel level software have a huge responsibility, especially when it's deployed amongst so many machines that run in such critical industries.

Rebecca Kopec

How long did it take crowdstrike to know there was a problem like that? Must have been a very bad day.

David Kopec

To their credit, and I'm not going to give them a lot of credit in this episode, they actually realized the problem happened within an hour and a half of the update going out, and they shut down the update at that point, but it was already too late because these updates are deployed automatically and those 8.5 million machines had already received the update. But surely shutting it down so quickly did stop some machines that hadn't yet gotten the update from actually going into this failure mode.

Rebecca Kopec

So you said you're not going to give Crowdstrike a ton of credit. Why is that?

David Kopec

Well, three reasons. Number one is obviously the bug was their fault. This is a pretty basic kind of bug, and it can happen to anybody. I'm sure I've written, I know I've written this bug before reading off the end of an array and okay, so bugs happen, but that's not it. There's two other reasons. Another reason is that they deploy their software, at least in this instance, to all the machines at once. They didn't deploy to. And what's usually done when you have millions of consumers is you deploy to submachines, see what happens, and then a few hours later, if everything's okay, you deploy to everybody else, but they deploy to all of their customers at once. And when you make such critical infrastructure software, I really think that's a no no, especially when you have that many customers. If you're somebody like me, some of my products have hundreds or thousands of customers. If I deploy to all of them, that's basically all I can do, because what am I going to do? Deploy to just 50 of them and try to see if, like, those 50 machines were okay. I might not even get a report back from all those 50 machines. Right. So that doesn't make sense. But when you have millions of customers, you have a responsibility to do these deployments a lot more carefully, especially when you're infrastructure critical software. And that's kind of standard practice in the industry today. The other reason is from their own documents that they've put out there that I'll link to in the show notes describing what happened in the incident. They didn't do proper testing, so they had this update that crashed every computer it was installed on. They obviously didn't have full integration tests where they tested it on the actual software in a real world environment, because then they would have seen the crash themselves. It seems from their document they did have all kinds of testing in place, just not that critical real world integration test. In fact, they had testing in place that specifically overlooked this kind of bug. And they describe in their internal document how their testing software overlooked this kind of bug. And I won't get into those technical details in this episode, but so three problems. They created the bug, they deployed to everybody at once, and they didn't have proper tests in place that actually were integration tests that tested in the same way that it would be in the real world.

Rebecca Kopec

So what's going to be the repercussions for crowdstrike?

David Kopec

So according to their terms of use of their software, they are only taking liability for the fees that companies pay to use their software. Now, of course, as if you've ever read one of the fine prints on some software products you use. Basically, all commercial software companies do this. They say, not liable if anything happens to you, basically at all. Is that going to uphold in court? I don't know. And of course, there's many different jurisdictions here, not just the United States. So it's going to depend on how those kinds of terms of use are interpreted across legal systems across the world. So I don't think they're going to be liable for the billions and billions and billions of dollars of damage that actually happened here. But people should certainly be scrutinizing their use of a company that would have such poor deployment practices as well as such poor testing practices. For example, Elon Musk has said he's no longer going to use Crowdstrike software in any of his companies. I imagine other CEO's or ctos are thinking some of the same things. So I think it'll certainly have some negative commercial consequences for the company. Some reporters say this is the largest incident that's ever happened in terms of bringing that number of machines down at that number of systems down from a bug, we're actually lucky in some sense that this was just a bug. If somebody was a malicious actor and it somehow infiltrated crowdstrike and been able to, you know, get into their systems, this could have been much, much worse. Imagine if it wasn't just shutting down the systems, but it was locking them up with ransomware or deleting their data. This could have actually been much, much worse. And so I think the other side of how we look at this going forward is do we allow third parties to get into the kernel in the way the crowd strikes? Falcon sensor does? So we see there already are alternatives to that, like what Mac OS has. So should Microsoft Windows have something similar? I don't think we should actually completely blame Microsoft for this. This was not a Windows bug, and this was not something that Microsoft had certified. Microsoft had certified the core CrowdStrike software, but it had not certified. The update, and the update didn't go through their certification mechanism. And the fact that they don't have this third party framework for security software, some people say, is due to some government regulations on them. So we could actually blame the government in some sense, indirectly. But of course, the main thing we should be blaming is crowdstrike themselves. Again, I don't think it would have been impossible for many other companies to have a similar bug. It's not that they had the bug, it's that they didn't have the proper testing and deployment processes in place to prevent such a large incident from happening. Thanks for listening to us this week, Rebecca. How can people get in touch with us on x?

Rebecca Kopec

We're explains k o P e c e x e l a I n.

David Kopec

S. Thanks for listening, and we'll talk to you in a few weeks. Bye.

On July 19th, 2024, the cybersecurity firm CrowdStrike released an update for its Falcon Sensor software that brought down millions of Windows computers around the world. Some of these computers were involved in critical infrastructure like airlines, hospitals, and governments. Falcon Sensor, which hooks into the Windows kernel, was certified by Microsoft and cryptographically signed. The update triggered a classic (and common) bug reading past the end of an array. It not only crashed Windows, but didn't even allow it to complete its boot process. In this episode we explain what the bug was and why CrowdStrike did not catch it before it had the opportunity to bring down millions of machines.

Show Notes

Follow us on X @KopecExplains.

Theme “Place on Fire” Copyright 2019 Creo, CC BY 4.0

Find out more at http://kopec.live