The Day the world went blue: What happened and what does it mean?
Friday morning started like any other for most people, but with a few minor inconveniences such as trains being delayed or tickets not working (not unlike most days), the local supermarket not taking card payments, or some live news channels being completely out of action.
For those trying to rectify these issues, they were seeing this on the backend of their systems:
Due to one flawed update on a single piece of software, Microsoft estimates 8.5 million Windows devices were affected, potentially 20,000 companies impacted, over 7000 flights cancelled, and multiple non-urgent operations delayed. This article covers what happened, how it happened, remediation help, the present and future impact of this incident, and how it can be prevented from happening again.
What caused the global IT outage?
Early Friday morning users began installing a routine update issued for their CrowdStrike Falcon sensor. Shortly after, all Windows devices running this piece of software began showing the ‘blue screen of death’, rendering numerous systems inoperable with no way of currently fixing it.
CrowdStrike is a well-known cyber security firm, with a reported customer base close to 30,000 that includes half of the Fortune 500 companies, according to Reuters. This incident was specifically related to CrowdStrike’s endpoint security monitoring services intended to protect individual devices from cyber-attacks.
The cause of such widespread impact was one update in this endpoint monitoring software.
How did this happen?
All applications will receive regular updates that must be installed to maintain their functionality and security capabilities. Applications monitoring endpoint security can receive updates daily, or sometimes even hourly, to make sure they have the required rules and signatures to stop the latest cyber-attacks.
All program changes, especially ones from Big Tech companies like Microsoft, Google, and CrowdStrike, go through a quality assurance and testing process to ensure they are compatible with the target systems they supply their software to. During these tests, it would become apparent if there is an issue in the software update and it should revert to the developers to rectify.
For this update, however, this QA process could have been missed. The incident has not been classified as a security breach, which means the problematic code wasn’t intended to disrupt systems. Rather, the issues were simply not picked up during QA checks.
At the time of writing, this remains unconfirmed, but CrowdStrike has reported that they “released a sensor configuration update to Windows systems. Sensor configuration updates are an ongoing part of the protection mechanisms of the Falcon platform. This configuration update triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems.”
The point of failure was due to a channel file, and they state, “Updates to Channel Files are a normal part of the sensor’s operation and occur several times a day in response to novel tactics, techniques, and procedures discovered by CrowdStrike”.
Are you still experiencing downtime?
There is not an automated way to roll out a fix, which makes remediation complex in this instance. It requires physical access to devices, which is not simple for companies that employ third-party IT companies or have multiple remote workers across the world. Currently, the NHS, while starting to get devices operational again, says it won’t be able to get all endpoints back online until next week.
If you have physical access, the fix can be done via the below advice, however, we recommend you visit CrowdStrike’s website which has fixes for endpoints, Azure and AWS appliances.
Fix for CrowdStrike Falcon sensor
As of 21/07/204, Microsoft has released a recovery tool which scripts the fix while your device is blue-screened – Microsoft Tool
Another reason why companies are taking a while to fix this issue is due to disk encryption like BitLocker. Full disk encryption protects the data on your device in the event it is lost or stolen. Without full disk encryption, if the data drive in the computer is removed, the data can be easily read and accessed. Companies who have noted their key for the impacted systems are fine and can remediate, but companies storing keys on another device (which might also be affected), or didn’t note it down, are running into problems. Without this key, you can’t follow the remediation advice. CrowdStrike and Microsoft have produced guidance around this, but this is still considered experimental – BitLocker recovery without keys
What is the current and future impact?
This incident affected all sectors and businesses running Windows systems with the CrowdStrike Falcon Sensor. The issue caused grounded flights, forced broadcasters off the air, and left customers without access to services such as healthcare or banking.
Affected sectors included:
Airlines
Around 7,000 flights were cancelled worldwide
Banks
Some banking and financial institutions said digital services were affected by the outage Friday, though many said throughout the day that most customer services were restored
Medical Institutions
The IT outage is "causing disruption in the majority of GP practices" in England, NHS England has said (As of 20/07/2024 11:00 am)
Media Companies
Numerous news channels were knocked off the air until systems could be returned to normal
Logistics
Both FedEx and UPS warned there could be delivery delays as the companies dealt with the global outage
Retailers
Multiple retailers said some parts of their operations had been affected, but widespread disruption for customers was minimal.
It is expected that systems will not become fully operational again until the end of the week when the full impact of the issue will become apparent.
Widespread supply chain issues are also expected in multiple areas due to the problems caused in delivery systems, ports, and manufacturers that will cause a knock-on effect throughout various industries
The cyber impact and implications
Across the sector, we are already seeing mass phishing emails from threat actors pretending to be CrowdStrike support, or offering a solution to the outages. As with most phishing attacks, these threat actors prey on emotional weaknesses within their targets, which makes them more likely to engage with malicious email. The following phishing domains have been recorded by CrowdStrike:
crowdstrike.phpartners[.]org
crowdstrike0day[.]com
crowdstrikebluescreen[.]com
crowdstrike-bsod[.]com
crowdstrikeupdate[.]com
crowdstrikebsod[.]com
www.crowdstrike0day[.]com
www.fix-crowdstrike-bsod[.]com
crowdstrikeoutage[.]info
www.microsoftcrowdstrike[.]com
crowdstrikeodayl[.]com
crowdstrike[.]buzz
www.crowdstriketoken[.]com
www.crowdstrikefix[.]com
fix-crowdstrike-apocalypse[.]com
microsoftcrowdstrike[.]com
crowdstrikedoomsday[.]com
crowdstrikedown[.]com
whatiscrowdstrike[.]com
crowdstrike-helpdesk[.]com
crowdstrikefix[.]com
fix-crowdstrike-bsod[.]com
crowdstrikedown[.]site
crowdstuck[.]org
crowdfalcon-immed-update[.]com
crowdstriketoken[.]com
crowdstrikeclaim[.]com
crowdstrikeblueteam[.]com
crowdstrikefix[.]zip
crowdstrikereport[.]com
The phishing emails in circulation claim to assist users in getting their systems operational again and are claiming to be from CrowdStrike support. For any information on this incident please visit the source, or if you are receiving emails from what you believe to be CrowdStrike, please only use a legitimate email address you have previously used to ensure you are only communicating with CrowdStrike.
Can this be stopped from happening again? What are the next steps?
This incident was caused by one simple mistake in the code for the channel update. The mistake can be prevented with defined, strong, enforceable quality processes, alongside thorough checks during testing. This issue is preventable but has highlighted a weakness in the global technology industry. One simple, non-malicious mistake caused this much disruption. What if, in the future, this wasn’t a mistake but an intentional change to cause mass chaos?
Most people have probably never heard of CrowdStrike, but this incident caused this much damage because of who it supplies services to; Microsoft, NHS, major airlines, banks and more. Industries and businesses that make up much of the world’s critical infrastructure. This wasn’t caused by an attack but by a massive vendor that deployed software which, potentially, wasn’t properly tested or debugged, and not deployed in a staged rollout.
This incident emphasises how important supply chain management is for the affected companies. Typically, software vendors will have a process in place for the rollout of new updates, and testing it in a secure environment before all devices receive it. In this incident, it is likely that internal processes weren’t fully followed by CrowdStrike for pushing out the update, however, this will be confirmed by the organisation once they have conducted their investigation. This incident, and subsequent global disruption, could lead to more regulations on the technology industry in an attempt to prevent this type of incident happening again. However, controlling that or defining that will be a difficult action.
For the average company, what can you do?
Firstly, all organisations should be aware of the increase in potential phishing emails imitating CrowdStrike, and ensure all employees are also aware. These phishing emails are likely to target the organisations who use that specific software, however, threat actors will likely target multiple organisations with the same messaging quickly without researching who exactly uses the software.
Once you're operational, we recommend reviewing your supply chain to determine if they were using CrowdStrike and whether they are continuing to use it. Identifying where downtime could return within your supply chain means that you can create incident response and business continuity plans successfully. Some organisations might also begin to not trust software following a major incident, and simply turn it off, rather than decommission it. This has the potential to cause wider issues and create vulnerabilities in the software when it is not updated. Being aware of this will help your organisation assess your risks successfully.
As well as reviewing your supply chain, it would be beneficial to conduct incident response simulations similar to this incident and evaluate what your organisation would do if critical suppliers or systems become inoperable. If you have a plan of action for when something like this happens, it will reduce the impact and downtime for your systems.
It is also worthwhile to revise your change management process so that when deploying updates from third-party vendors, the process means it won’t affect all systems at the same time. Whilst the QA/testing element should be conducted by the vendor, there’s no reason you can’t have an additional testing and rollout plan to minimise any internal risk. If your organisation conducts development work for customers, it is also worthwhile to review your software development life cycle to ensure that you are fully testing all new releases/deployments. Even if your organisation isn’t the same size and scale of CrowdStrike, if a major incident affected your customer base, how effectively could you recover from this?
For any help or guidance on this incident, please contact info@purecyber.com.
You can learn more about managing supply chain security in our webinar recording and free checklist resource.
**PureCyber does not use CrowdStrike in its technology stack and therefore there should be no impact on our clients through any of our systems.
Sources
https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/