Midmarket Reacts, Recovers From CrowdStrike Outage

Needless to say, the outage placed additional burden on IT departments, particularly those in the midmarket where budgets and team sizes can be limited.

Midmarket Reacts, Recovers From CrowdStrike Outage

It was an all-hands-on-deck weekend for IT operations across the globe after a massive technology outage--reportedly the most widespread in history--brought down computer systems in airports, hospitals, first responder communications, retail and more industries for hours beginning Friday, with some businesses still recovering days later.

The business world came to almost a full halt after a software update from CrowdStrike brought the "blue screen of death" (BSOD) to millions of Windows machines, crashing them.

The crash was specifically caused by an updated sensor configuration file that CrowdStrike released for its Falcon sensor software running on Windows machines. "This configuration update triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems," CrowdStrike wrote on its site.

The updates were also pushed out automatically to customers, leaving them scrambling to recover crashed computers after the update was applied.

A 'Manual Effort' For Midmarket IT

Some midmarket IT executives spoke about the impact of the outage on their organizations.

"We were impacted. Took us three hours to get our servers online. Then, most of the day to fix the individual PCs," said one IT executive who wished to remain anonymous. "I get that we shouldn't have immediate updates. These were channel updates, aka virus definitions. To minimize risk, it's important to apply them quickly to reduce zero-day risk," he added.

While another IT executive said that his organization was not a CrowdStrike customer, "we are, have, and will be impacted by some of our business partners that were directly impacted. ... The fix is simple enough, but it is a manual effort that requires touching every impacted device and this is an issue that will take days, if not weeks, for some organizations to fully recover depending upon their size."

Richard Richison, senior director of IT infrastructure and cybersecurity operations at Repligen, often provides advice to his IT executive peers and he sees a valuable lesson with fallout from the outage.

"This is a good reminder not to access vendor updates as soon as they are released," Richison said. "The CrowdStrike customers who control when their devices are updated are in a much better position than those who accept the updates as soon as they are released."

Needless to say, the outage placed additional burden on IT departments, particularly those in the midmarket where budgets and team sizes can be limited when putting out IT fires. MSPs, many with midmarket customers, also spent significant time working with those affected by the outage.

Stuart Macintyre, head of infrastructure at Vesta Software Group, said that Hyve, their managed service provider, helped them get up and running again after the CrowdStrike incident.

"We started Friday with alerts that much of our production environment was down (for three separate business units). While we investigated, we logged a ticket with the Hyve team. Hyve immediately started recovering systems by way of a reboot. Once a CrowdStrike update was identified as the root cause, the fix was shared with the Hyve team, who, again, immediately started to help apply the fix to the remainder of our systems. All key systems were back online within a couple of hours with the remaining systems all recovered by mid-day. It gives us peace of mind knowing that when we need their help, the Hyve team are always there," Macintyre said.

Omer Grossman, CIO of CyberArk, said that the outage "will be one of the most significant cyber issues of 2024."

"The damage to business processes at the global level is dramatic," Grossman said in a statement to MES Computing. He also said that businesses were looking at days of recovery.

"The [issue] is how customers get back online and regain continuity of business processes. It turns out that because endpoints have crashed ... they cannot be updated remotely and this problem must be solved manually, endpoint by endpoint. This is expected to be a process that will take days."

Grossman said a deeper dive is also needed into how this happened.

"What caused the malfunction? The range of possibilities ranges from human error -- for instance, a developer who downloaded an update without sufficient quality control -- to the complex and intriguing scenario of a deep cyberattack, prepared ahead of time and involving an attacker activating a 'doomsday command' or 'kill switch.' CrowdStrike's analysis and updates in the coming days will be of the utmost interest."

Brian Gagnon, CTO at Uprise Partners, a managed provider that offers IT services for organizations, with a particular focus on the midmarket, said that while none of its customers had direct issues with CrowdStrike, they did have issues "indirect through other vendors."

"We did advise some midmarket customers" on a "move forward plan," Gagnon said. Some customers put too many of "their eggs in one basket" and become overly reliant on one or too few vendors to run their operations.

Gagnon offered more advice for IT leaders including using the incident to do a bit of housekeeping in a blog post:

"Recent events, such as the CrowdStrike internet outage, have underscored the importance of maintaining continuous compliance and operational resilience. The outage, which affected a wide range of services provided by CrowdStrike, a leading cybersecurity firm, highlighted the vulnerabilities that can arise even in well-established IT environments," Gagnon wrote.

"Many businesses that rely on CrowdStrike for threat detection and incident response faced significant disruptions during the outage. This incident has driven some companies to rethink their approach to IT compliance and resilience," he added.

The outage highlighted the importance of continuous monitoring, Gagnon said. "Businesses are now prioritizing real-time monitoring solutions to detect and address issues before they escalate."

Investing in more resilient IT infrastructure to maintain compliance and operational continuity during a disaster is also critical, Gagnon said in his post.

The outage also highlighted the need for proactive risk management.

"Businesses are adopting a more proactive approach to compliance, regularly reviewing and updating their policies and procedures to address emerging threats and regulatory changes," he wrote.

While much of the chaos has lulled, trickle-down snarls remain, such as reports from local news stations that Delta is still experiencing delays and flight cancellations due to the outage.