Please follow our System Status Page
Public Post Mortem (times in CEST):
On Friday the 19th of July 2024 parts of our nShift infrastructure was affected by the global CrowdStrike crisis, below we have an incident timeline. Many of our services were unavailable for several hours. There is no data breach or suspicion of a cyber attack related to this incident.
Preventative actions:
Cyber Security is very important for us in nShift, CrowdStrike is a highly trusted cyber security vendor of many companies. Since this was a global IT incident, there was no immediate action we could take that could prevent this scenario. We will have further dialogues with CrowdStrike on how to avoid reoccurrence of similar incidents and reduce recovery time.
Improved recovery time actions:
This is the first time we have experienced an outage of this magnitude and we have picked up several observations about our infrastructure. As part of our commitment towards continuous service improvement, we continue to evaluate our current infrastructure to improve our recovery time.
Incident resolution timeline:
06:40 Our first monitoring tools started alerting us of issues with our infrastructure.
06:47 Our 24/7 Emergency Response Team (ERC) started investigating if they were able to recover some instances (Servers hosting our applications)
07:00 ERC escalated to our Problem Manager
07:10 Problem Manager initiated processes to establish a Crisis Team and start the Major Incident Process
07:15 The first Status page Notifications was sent.
07:20 Direct Cause confirmed it was due to the Blue Screen of Death (BSOD) bug caused by the CrowdStrike services we are using.
07:20 - 10:00 Crisis Team was in this time period evaluating several recovery strategies. During this time frame we were in contact with AWS Support and CrowdStrike. Several of the recovery options published online were not applicable as they were targeting desktop workstations and not cloud servers.
10:10 Together with AWS we had found a recovery method that we had verified working on some of our instances in AWS.
10:10 - 14:30 The crisis team recovered instances in a prioritized order, all instances were back online at 14:30, however there was some degradation of service due to AWS still having issues (this was fully resolved by 15:00).
15:00 Major Incident was closed.