Understanding root causes is key to understanding failure.
A Perspective on the February 2021 Texas Winter Storm
Jul 9, 2021
by Brian Y. Webster, CRE, and Frank Michell
Some of the consequences of the February 2021 Texas power failure are well known: Interrupted power to millions of people with cascading consequences of water system failures, loss of heat, loss of essential services like hospitals, and loss of life. However, not every potential consequence manifested during the event. The situation could have been much worse for much longer, including total collapse of the electric grid in Texas. If we do not invest in understanding the multitude of root causes leading to, and the preconditions of the failures for complexity of the interconnected systems, the next event will be worse. Failing to resolve the root causes will leave Texas vulnerable to similar events in the future.
Determining all root causes for a catastrophic event is necessary to identify the measures needed to prevent a recurrence. Resources need to be allocated to understand the causes for the failures, loss of not just to clean up after the fact. Reliability requires financial investment, which is a lower cost than the costs of unreliability. Unreliable systems continue to increase capital and operating expenses as well as recovery costs which end up being borne by customers and investors. Public safety must come first. Clear, concise, and informative communication methods need to be a component of initiating mitigation measures for a successful outcome.
In the book “Normal Accidents, Living with High Risk Technologies,” Charles Perrow introduced a framework that accidents are inevitable in extremely complex systems based on the levels of complexity and coupling of the system. Throughout the February 2021 winter storm, cascading failures were observed indicating that critical infrastructures for fuel supply, electric power generation and potable water supply have become more coupled and intertwined over time resulting in some of the most highly complex coupled systems that have ever been created by humankind. As the complexity of the individual systems and the interconnections between these systems have increased over time, we have reached the point where preventing simple equipment malfunctions and failures is inadequate to avoid the significant systemic risk.
As mechanical engineers and associated professionals, we are providing you and our communities a perspective in this article to promote learning, understanding, and root causes associated with this event. We urge an extensive effort to understand the multitude of root causes, preconditions, and how the failures cascaded through this complex interconnected system and others. Some of the root causes and preconditions may be the environmental conditions being beyond the basis of current design for the power plants, electric grid, fuel supply, and water infrastructure. Some of the root causes stem from engineering design and operations decisions. Other causes involve human and organizational factors. Many of the root causes may have common cause failures in different complex systems. Some of the root causes may be within the market structures, economic incentives, and optimizations. Some of the root causes may be the complexity of the system where tight coupling carries risk from one complex system to another complex system while operators, designers, and regulators found the events unexpected, incomprehensible, uncontrollable, and unavoidable.
It is likely that the outcome that was experienced in Texas in February 2021 resulted from how the existing systems were designed to operate. We should expect the system to continue to function as designed until we understand the root causes and follow-through with prudent and meaningful changes. The systems functioned as designed and built, which might be different than what was expected, such as unexpected system behavior, environmental changes, or changes in the original design assumptions.
Determine Root Causes
Low-Probability High-Consequence (LPHC) events are uncommon, but when they happen, they cause extreme impact. Complex, interconnected systems are more prone to experience LPHC events, in part, because single-point failures have been designed out. Evaluating the possibility of experiencing an LPHC event for systems that have the potential for significant damage and/or loss of life is critical. The objective is to identify and implement measures to mitigate the possibility of the event happening. Determining and prioritizing the root causes and contributing factors behind a catastrophic event are necessary to identify the measures that need to be implemented to prevent a recurrence.
Resources need to be allocated to improve reliability. The financial investment to improve reliability is low relative to the potential financial losses and loss of life when there is failure. The loss of life, high cost of system failures, loss of heat and other essential services will be shouldered by customers, investors and most of the time by the broader population. Investing in reliability makes good business sense.
During the aftermath of a catastrophic (or major) infrastructure reliability failure, we will see emotions clouding our understanding of the event, we will observe deflections and assignment of blame, we will observe a significant number of “solution oriented” conclusions being touted that may suggest a band-aid fix for a few root causes instead of addressing the real root causes or pre-conditions. We will see insistence that this event was “unprecedented.” We will observe “defensive reasoning” that focuses on things that “did not happen.” While defensive reasoning is a common human response, our thoughtful caution is to remember that anything that did not happen or did not exist in the system cannot be casual to what did happen. A true root cause will be based on items and actions that did occur. A false or incomplete root cause will be populated by “did not” and negative reasoning.
If we develop solutions or make changes without understanding what really happened, then we fail to learn from the past, and we will only increase our risk, add additional complexity, and create new ways to fail. When we first seek to understand what did occur, we can begin to implicitly address the root causes and pre-conditions.
We must understand why things fail before rushing to solve the easiest part of the problem. We strongly urge a focus on understanding and learning the root causes by utilizing best practices from reliability engineering, risk assessment, and “Causal Learning” prior to looking for solutions.
The Path Ahead
The path forward is for engineering communities like ASME to engage challenges from disruptive forces so that outcomes of events can be:
- Expert driven.
- Analytically evaluated.
- Able to address causality / risk / economics now and over time.
- Able to be applied across other large complex system problems.
This will not be a one-time fix but needs to be a continuous improvement process to address the ever-changing complexity and coupling in our critical infrastructure.
We believe the future success of the Texas electric energy supply is dependent on understanding the root causes and preconditions of this event. We further believe that this understanding must inform actions and decisions to prevent a tragic reoccurrence or a worse disaster.
The creation of some of the most intricate and highly coupled systems of complex systems is a triumph of engineering. Thoroughly investigating the root cause of failures and fully understanding the interdependency of complex and highly coupled systems that has ever been created by humankind, the decision on how we uncover the facts and understand the root causes along with the competency of the team investigating this disaster will be critical to Texas’s successful future.
All opinions expressed in the article are of the authors and do not necessarily reflect the views of ASME.