Expert Says There Are 5 Key Aspects For IT Leaders To Ponder After The CrowdStrike Outage

A risk management and software QA assurance expert weighs in on lessons that can be learned after the CrowdStrike debacle.

Presumably, there has been a lot of reflection in IT operations around the world since the massive CrowdStrike outage in July.

To recap: Businesses across the globe came to almost a full halt after a software update from CrowdStrike brought the "blue screen of death" (BSOD) to millions of Windows machines, crashing them.

The crash was specifically caused by an updated sensor configuration file that CrowdStrike released for its Falcon sensor software running on Windows machines. "This configuration update triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems," CrowdStrike wrote on its site.

The updates were also pushed out automatically to customers, leaving them scrambling to recover crashed computers after the update was applied.

Fallout from the outage continues. Delta, which had thousands of its customers stranded after its Microsoft-based terminals crashed, is reportedly preparing to seek compensation against CrowdStrike and Microsoft for the downtime.

CrowdStrike's investors have filed a lawsuit as well against the cybersecurity company, Forbes reported.

CrowdStrike itself released a detailed post-incident review to the public, stating that it had “engaged two independent third-party software security vendors to conduct further review of the Falcon sensor code for both security and quality assurance. Additionally, we are conducting an independent review of the end-to-end quality process from development through deployment.”

Amid the anger and lawsuits, however, one expert said there are lessons IT leaders can learn from the outage.

"Computer software runs the world economy, it is truly a mission-critical asset, but we don't treat it as such. We rely on volunteer open-source projects for our key infrastructure components and then blame them when things go wrong. We give short shrift to quality engineering and quality assurance and then wonder why half the world's economy can be impacted in one morning," said Adam Sandman, an expert in risk management and software quality assurance, and the CEO of Inflectra – a provider of software testing tools.

Sandman said in the wake of the CrowdStrike debacle, there are five key aspects IT leaders should ponder when it comes to software:

Risk

"Understanding the mission-critical nature of a piece of software,” Sandman said in a statement to MES Computing. “If it goes down, what is the impact?”

"One thing that could be a solution in the future would be to adopt a risk-based approach to testing (called risk-based testing in the industry) whereby the risks associated with each change in the system are evaluated for probability and impact," he said.

“The risks are mapped to all of the testing activities to see if there are any gaps. In this case, when CrowdStrike added the new template type, they could have documented all of the risks with the change at that point in time,” he added.

“One of them might have been that the fields in the new template type had not been formally validated against the range of input data. Another would have been that the use of wildcards in the initial templates meant that future failures were being masked. Mitigations to these risks might have included more detailed automated validation of the interpreter with a large dataset of possible input permutations. There is a very high degree of likelihood that this approach would have caught the failure before rollout occurred," he said.

(Adam Sandman, CEO, Inflectra)

Rollback

"If we deploy something and it fails, how easy is it to roll back. In [the case of the CrowdStrike outage], it required manual intervention per computer."

Deployment

"Why did we deploy to all devices immediately versus a rolling deployment that might have caught it before it became a global outage?"

Testing

"How did this bug get through testing when it appeared to affect all machines running Windows not just some subset?"

Development

"How did this bug get introduced in the first place, and what checks (e.g. code reviews) were in place?"