Navigating the CrowdStrike Outage
Lessons and Insights for Business Continuity
This article shares expert insight on the CrowdStrike outage from a recent interview with Managed Solutions Director of Security and Identity Management, Richard Swaisgood.
In a recent interview, Richard Swaisgood from Managed Solution provided a detailed overview of the CrowdStrike outage that impacted organizations globally.
Continue reading to delve into the outage's immediate effects, Microsoft’s collaboration with third-party vendors, and the essential lessons learned for the business community.
Overview of the CrowdStrike Outage
The recent CrowdStrike outage primarily affected Windows 10 and newer workstations and server products due to a Windows Update. According to Richard, CrowdStrike released a "rapid response" update, which, unlike typical software updates, is a definition or signature file guiding antivirus or EDR systems on what to detect.
These updates usually bypass rigorous testing to address active threats swiftly. However, this update did not follow standard testing and validation processes, leading to boot failures and devices stuck in a boot loop, including critical servers in Azure, AWS, and GCP.
Microsoft's Role in Incident Management
Microsoft collaborates closely with third-party security vendors like CrowdStrike. Upon receiving reports of the issue, Microsoft worked with CrowdStrike to identify the problem and develop a remediation plan.
Advisories were posted on Azure, and remediation information was shared with partners like Managed Solution. Engineers were deployed to help remediate affected workstations and servers, although the fix required manual intervention.
Impact on Managed Solution’s Clients
Managed Solution had two clients using CrowdStrike, both of whom experienced significant impacts. The team mobilized staff to address the issues, working tirelessly to bring clients' systems back online.
They coordinated with internal IT staff and leveraged partnerships to expedite the remediation process.
Identification and Response to the Issue
Managed Solution identified the issue through their partnership with Microsoft, who provided the necessary fix. Despite primarily using SentinelOne, Managed Solution's internal and client IT staff worked diligently to restore systems for the affected clients.
Immediate Steps Taken for Impacted Clients
Managed Solution quickly reached out to Microsoft, obtained the fix, and disseminated the information internally.
Teams worked through the night to implement the remediation, showcasing the company's proactive approach and efficient collaboration with Microsoft.
Thanks to the hard work of these engineers, our clients were able to get back-up and running with their data both restored and secured. Our team received the following feedback after achieving this remediation:
“I wanted to take a moment to thank each of you for all of your assistance with the remediation efforts around the CrowdStrike debacle that took place last week.
Some of you were up overnight assisting and others driving all over the Denver area assisting with manual fixes that were necessary.
You’re all so appreciated, and we cannot thank you enough for stepping up when needed.”
Benefits of Having a Managed Service Provider (MSP)
This incident underscored the value of having an MSP. Managed Solution was able to quickly mobilize a large team of engineers, demonstrating the scalability and responsiveness that MSPs offer.
They paused other projects and reallocated resources to address the outage, minimizing downtime for clients.
Communication from CrowdStrike
CrowdStrike communicated the issue through blogs, social media, support portals, and direct outreach. Despite their efforts, the widespread impact made it challenging to be fully effective.
Nonetheless, they engaged proactively with customers and partners like Microsoft and Amazon to disseminate information.
Key Lessons Learned
Key lessons from this incident include the importance of having documented disaster recovery scenarios and alternate ways to access critical systems.
Organizations should focus on communication and recovery plans, ensuring multiple communication methods and documenting all processes.
Influence on Vendor Management
The incident has refocused attention on ensuring vendors follow best practices for updates and have proper controls in place.
Managed Solution already emphasizes these questions with their vendors and clients, but continuous evaluation and transparency are crucial.
Purpose of the Rapid Response Update
The rapid response update was intended for emerging threats, but it was not responding to an active threat. This distinction made the impact worse, as the update was for threat markers, not direct threat mitigation.
Criteria for Evaluating Security Vendors
Essential criteria for evaluating security vendors include a commitment to industry best practices, especially for update rollouts, and having control over updates.
Security products need high-level permissions, so strict deployment and update practices are necessary to prevent widespread issues.
Integrating Microsoft Security Solutions
Best practices for integrating Microsoft security solutions with third-party vendors include ensuring minimal disruption during outages, having robust disaster recovery plans, and alternate communication methods.
Preventive Measures for Minimizing Risk
Organizations should have detailed disaster recovery plans, document all steps and responsible parties, and ensure multiple communication methods.
Avoiding single points of failure by diversifying security measures and planning for various scenarios is also crucial.
Microsoft Defender as a Security Option
Microsoft Defender for Endpoint is a solid EDR solution built into Windows. It integrates well with the OS and has minimal impact, making it a reliable option for organizations already using Microsoft products.
Additional Microsoft Security Tools
Tools like Endpoint Analytics (related to Intune) and Azure Monitor provide valuable insights into device performance and issues.
They help organizations monitor trends and extend Azure features to on-premises environments using Azure Arc.
Upcoming Enhancements in Microsoft Security Offerings
Microsoft aims to prevent similar incidents by potentially restricting developer access to certain OS controls.
They strive to balance developer freedom with security, learning from incidents like this to improve overall system resilience.
Best Practices for Handling Security Service Disruptions
Organizations should ensure robust backup and disaster recovery plans, clearly document all processes, and have alternate communication methods.
Regularly reviewing security infrastructure to identify and mitigate potential single points of failure is also essential.
The CrowdStrike outage highlighted the importance of preparedness, vendor management, and the benefits of having an MSP. By learning from this incident, organizations can enhance their resilience and ensure continuity during future disruptions.
Continued Reading
June 19, 2024
How to Use Copilot with Office 365 Applications
How to Use Copilot with Office 365 Applications In today's […]
LEARN MOREJuly 29, 2024
Simplified IT Management for Azure, Entra ID, O365, and Exchange
Simplified IT Management for Azure, Entra ID, O365, and Exchange […]
LEARN MORE