CrowdStrike has hired two outside security firms to review the Falcon functionality that sparked a global IT outage last month – but it may not have an awful lot to find, because CrowdStrike has identified the simple mistake that caused the meltdown.
News of the external review emerged in a root causes analysis [PDF] published on Tuesday by the infosec vendor.
As we learned from CrowdStrike’s earlier post-incident write-up of the flawed Falcon update, which boot-looped millions of Windows machines worldwide, the problem began back in February.
That was when the developer added to Falcon, its threat-detection suite, the ability to spot and block the novel exploitation of named pipes and other Windows interprocess communication (IPC) mechanisms; seeing such attacks occur in the wild is a strong indication that the box has been compromised, which is good to flag up and stop.
That new detection functionality went through the usual development and testing before CrowdStrike pushed it as a new “template type” to customers’ Falcon installations in sensor version 7.11.
These template types are as the name suggests: Templates. They are generalized routines, each picking up a different type of potentially bad activity on a system. For Falcon to use them to detect specific threats, so-called “template instances” are defined and issued by CrowdStrike that customize the template code to identify particular forms of exploitation, intrusions, and other bad stuff.
CrowdStrike explains this architecture thus: “Template Types represent a sensor capability that enables new telemetry and detection, and their runtime behavior is configured dynamically by the Template Instance (ie, Rapid Response Content).”
Since March, CrowdStrike has pushed from its cloud to remote Falcon deployments a few template instances that made use of the IPC template type to detect specific threats. These updates, delivered as so-called Rapid Response Content, were stored in a channel file numbered 291. Falcon would download an updated channel 291 file, and have its Content Interpreter parse the data.
The template instances in that data would tell Falcon how to use the template type to detect particular threats. The root causes analysis provides a deeper look at what went wrong next:
What this means is: The template type detecting malicious IPC use had 21 possible input values to customize its actions, though the code plugging the channel file’s instances into the type only provided 20. In the first supplied instances, this wasn’t a problem as the instances didn’t cause the interpreter to use the missing 21st parameter. All seemed fine.
Then, as CrowdStrike also previously explained, two further IPC-related template instances were automatically deployed to Falcon users on July 19 in that fateful channel file. One of these instances instructed the interpreter to use the 21st parameter, but only 20 were provided to it. That caused the interpreter, running in Windows kernel mode unfortunately, to use an unpopulated field in the input array as a pointer to memory to access, which ultimately caused the operating system to crash.
“The attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash,” the security shop explained in the root cause analysis.
CrowdStrike has programmed a fix to ensure that mismatches of the number of inputs validated versus number of actual inputs doesn’t happen again. It’s a patch for the Sensor Content Compiler – this is the function that validates the number of inputs provided by the template type – and it went into production July 27.
CrowdStrike also wrote that it has added runtime input array bounds checks to the Content Interpreter for Rapid Response updates, to ensure the size of the input array matches the number of expected inputs. These fixes are currently being backported to all Windows sensor versions 7.11 and above with a sensor software hotfix. The release will be generally available by August 9.
Additionally, the chastened security vendor is doing more tests – including checks to ensure that flawed files aren’t pushed to Falcon customers in the future. Despite the mismatch in parameters, CrowdStrike’s validation engine missed that, and allowed the faulty channel file to go out to users.
Further, as CrowdStrike had noted in its earlier analysis, every template instance will henceforth be deployed to customers in a staged roll-out, rather than being pushed to all users all at once. That will reduce the blast radius of any further broken updates.
It’s worth noting that the biz is being sued by investors for not using this type of phased approach in sending updates to customers in the first place.
“Looking ahead, CrowdStrike is focused on using the lessons learned from this incident to better serve our customers,” a spokesperson declared. “CrowdStrike remains steadfast in our mission to protect customers and stop breaches.”
But not so steadfast that it’s naming the partners it hired to review its programming. Those reviews have commenced, and are focused on the code and processes that led to the July 19 fiasco.
“We are not providing information on the vendors who are doing work for us beyond what is referenced in the root causes analysis,” the CrowdStrike spokesperson told The Register. ®