Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
    • About Us
    • Colophon
    • Survey
  • Reliability.fm
  • Articles
    • CRE Preparation Notes
    • NoMTBF
    • on Leadership & Career
      • Advanced Engineering Culture
      • ASQR&R
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Maintenance Management
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • RCM Blitz®
      • ReliabilityXperience
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Field Reliability Data Analysis
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability by Design
      • Reliability Competence
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
      • Reliability Knowledge
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Institute of Quality & Reliability
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Statistical Methods for Failure-Time Data
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Glossary
    • Feed Forward Publications
    • Openings
    • Books
    • Webinar Sources
    • Podcasts
  • Courses
    • Your Courses
    • Live Courses
      • Introduction to Reliability Engineering & Accelerated Testings Course Landing Page
      • Advanced Accelerated Testing Course Landing Page
    • Integral Concepts Courses
      • Reliability Analysis Methods Course Landing Page
      • Applied Reliability Analysis Course Landing Page
      • Statistics, Hypothesis Testing, & Regression Modeling Course Landing Page
      • Measurement System Assessment Course Landing Page
      • SPC & Process Capability Course Landing Page
      • Design of Experiments Course Landing Page
    • The Manufacturing Academy Courses
      • An Introduction to Reliability Engineering
      • Reliability Engineering Statistics
      • An Introduction to Quality Engineering
      • Quality Engineering Statistics
      • FMEA in Practice
      • Process Capability Analysis course
      • Root Cause Analysis and the 8D Corrective Action Process course
      • Return on Investment online course
    • Industrial Metallurgist Courses
    • FMEA courses Powered by The Luminous Group
    • Foundations of RCM online course
    • Reliability Engineering for Heavy Industry
    • How to be an Online Student
    • Quondam Courses
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home
  • Barringer Process Reliability Introduction Course Landing Page
  • Upcoming Live Events
You are here: Home / Articles / And Now to the Biggest IT Outage … Ever!

by Christopher Jackson Leave a Comment

And Now to the Biggest IT Outage … Ever!

And Now to the Biggest IT Outage … Ever!

It is no small irony that a software application that is designed to protect IT systems from malicious actors was behind the biggest IT outage in the history of computers. A company called Crowdstrike provides a ‘Falcon Sensor’ product that is intended to scan computers that use Microsoft operating systems for vulnerabilities. And this product is deployed so deeply into its host operating systems that it has access to the ‘kernel,’ which is the program that runs the basic code that links applications to the computer hardware (like memory, central processing unit and other devices). Unfortunately a Falcon Sensor update that Crowdstrike sent to its customers had a bug that was not picked up by its own validation programs (because it too had a bug). And unfortunately, it accesses a ‘forbidden’ part of the memory that causes the infamous BSOD or ‘blue screen of death.’ So airlines, hospitals, banks, hotels and lots of other companies simply couldn’t operate.

A conservative estimate of losses sustained by the top 500 companies in the US was $ 5.4 billion. But because this outage impacted hundreds of thousands of smaller entities across the world, this represents a tiny amount of the overall loss. For example, electronic purchasing companies (the ones that facilitate credit and debit card transactions) were affected, meaning all the corner stores that use them were also hit. 

So what costs can be recuperated?

Not many. 

First there is insurance. Many people see insurance as a ‘charitable’ service that you pay money to access. So instead of mitigating or worrying about risk and bad consequences, we can buy peace of mind with insurance.

Wrong. All insurance companies do is average the risk, add a margin, and charge customers accordingly. So if you make a claim on your vehicle after you get into an accident, your premiums will go up as your insurer tries to ‘average’ the amount of money they will end up paying you. Some of those 500 companies above were insured, but the best guess at costs that will be covered is around one tenth of the total losses. And the insurers will eventually get that money back when they increase premiums. 

But what about Crowdstrike? Aren’t they liable for the losses their product caused? That’s where terms and conditions are important. Crowdstrikes’s terms and conditions essentially limit liability to the amount the customer pays to use their Falcon Sensor product. So you can get the purchase costs reimbursed if you use their software (which will be next to nothing compared to your business’s losses). But if you were that corner store that couldn’t process electronic transactions because the electronic purchasing company wasn’t able to provide their services … you get nothing because you aren’t a direct Crowdstrike customer.

But what about Microsoft … shouldn’t they ensure their operating system is this ‘sensitive’?

Yes. Although Microsoft is arguing the point a little bit.

Microsoft is blaming the European Union (EU) for not being allowed to lock down its kernel. In reality, the EU simply requires Microsoft to allow third party security applications the same Application Programming Interfaces (APIs) that Microsoft own security applications use. And since 2020, Apple has been able to ‘lock down’ its kernel with no pushback from the EU (and anyone who is familiar with the history between Apple and the EU know that the EU is not afraid to push back when it feels the need to).

The comparison between Apple and Microsoft is sometimes difficult to make, given how Apple computing systems are not used in many of the same ‘security sensitive’ settings that Microsoft computing systems are. But there is certainly wiggle room for improvements. 

But … this shows how fragile ‘things’ are right now

Most of the world’s computers run Microsoft operating systems. This makes Microsoft’s kernel a ‘centralized’ and common element of lots of different services across the world. So if anything goes wrong with the kernel, bad things will happen all at once.

So you might have airlines offline at the same time train and bussing services are offline. So there are literally no alternative modes of transportation for stranded passengers. And when they try to check into a hotel … you can’t! 

Is there an upside?

Yes.

Microsoft is better at providing these operating systems than anyone else. That is why they have an oligopoly (bordering on monopoly) on the market. Linux and Apple are relatively small competitors in the world of large scale commercial computing services. 

This means that we have an incredible level of functionality in today’s computing systems as the pace that Microsoft has historically set is very impressive. So the ability to concentrate resources into today’s operating systems means the functionality we currently enjoy offsets (at least in part) the costs we experience when there is an outage.

But this is itself fragile. Many companies (like Boeing) get into a space of market dominance and then stop doing their ‘core business’ that got them there, and instead turn into cynical competition killing hit squads. Their focus changes to squeezing as much money as possible out of every decision, and this leads to long term struggles.

Microsoft could fall into this trap (if it hasn’t already), so it is an ongoing concern when one company controls so much. But the customers have lapped it up so far because the functionality they provide is more powerful and better priced than its competitors. 

So this upside might be fleeting.

OK … but what about resilience in general?

Netflix is famous for creating one of the most resilient IT systems in the world. It has created what is called the ‘Simian Army’ which is a group of programs that deliberately ‘sabotage’ its own infrastructure. One program is called ‘Chaos Monkey’ and it is designed to randomly disable a server every hour. 

Why do they do this? It’s because it sends a message to its software engineers that their code needs to be able to tolerate the havoc that ‘Chaos Monkey,’ ‘Chaos Gorilla,’ ‘Chaos Kong’ and a plethora of other programs will cause when they run amok. And software that can tolerate these deliberate acts of damage is now automatically designed to overcome the vast majority of issues that the ‘real world’ will throw its way. 

Can Microsoft learn from this? Yes. But to be clear, the kernel of an operating system is very different to Netflix whose main aim is to stream video content and little else. But having a kernel structured in a way that it can deal with routine havoc (like illegal access of memory) is not only useful for security, but also business continuity.

Is there a lesson?

Yes.

Humans don’t deal with risk well. We often need the consequences of poor decisions to be thrown in our face every day for us to deal with these poor decisions in a logical and rational way. Let’s call everything else a ‘rare event.’

Eating a greasy hamburger won’t immediately cause our body weight to drastically increase. And we can use logic where we can justify one thing at a time to turn into a passive acceptance of long-term destructive behaviours.

Same goes with software. A very ‘easy’ risk management approach for patches and updates would be to have an option where we ‘wait a week.’ This means we spend another week exposed to emerging malicious threats, but it would also mean that any issues experienced by other customers would give us a week’s worth of notice before we install the same update. 

Of course if everyone does this, then there is no risk being mitigated. But the fact that we rarely think about these sorts of things when we type emails on our computers shows how every day we don’t experience an outage builds our (lazy) confidence that everything is just fine. So we forget about the bad things that will eventually happen. 

And while regulation might be the law, it is not the answer. Governments these days think the only tools they can apply are restrictive laws. But there is scope for ongoing conversations and ‘back and forth’ that can help us move in the right direction.

The key for you (and your organization) is to force yourself to think about those ‘rare events’ that don’t make the agenda of your meetings and conferences. Just because they are rare, doesn’t mean they aren’t crippling. And, they are huge opportunities. For example, Southwest Airlines (for some reason) wasn’t affected by the Crowdstrike outage. They will have absolutely benefited from desperate passengers from other airlines who needed to get somewhere. And a fraction of those customers who were forced to use Southwest Airlines will remain Southwest Airline customers. And Southwest Airlines now has another competitive advantage in terms of additional cash that its competitors had to burn on rectifying their thousands of flight cancellations. This additional cash can go towards marketing, additional routes and so on.

It just so happens that those companies that are successful in the long term tend to be able to deal with (and perhaps capitalize) on those rare events.So … I hope you enjoyed this article. If your computer allowed you to read it.

Filed Under: Articles, on Product Reliability, Reliability in Emerging Technology

About Christopher Jackson

Chris is a reliability engineering teacher ... which means that after working with many organizations to make lasting cultural changes, he is now focusing on developing online, avatar-based courses that will hopefully make the 'complex' art of reliability engineering into a simple, understandable activity that you feel confident of doing (and understanding what you are doing).

« What is Acceptable Test Duration?
Essentials Elements of RCA »

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Article by Chris Jackson
in the Reliability in Emerging Technology series

Join Accendo

Receive information and updates about articles and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Recent Posts

  • Gremlins today
  • The Power of Vision in Leadership and Organizational Success
  • 3 Types of MTBF Stories
  • ALT: An in Depth Description
  • Project Email Economics

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy