1. Introduction
The safety and efficient operation of nuclear power plants (NPPs) are primarily contingent upon the actions and decisions of operators in the main control room [
1]. These operators are at the core of managing plant safety and ensuring smooth operation, making their role critical not only for the plant’s safety but also for its economy. NPPs are operated with comprehensive operating procedures that prepare operators for a range of anticipated operating conditions, including normal operation as well as abnormal and accident scenarios [
2]. When abnormal events occur, these procedures help guide the operators in choosing the appropriate response to bring the plant back to a safe state. Therefore, the effective recognition of operating events is crucial, and it represents one of the most challenging tasks that operators face [
3].
Despite the well-established operating procedures, correctly diagnosing and responding to faults in real time remains a significant challenge, especially when multiple faults occur simultaneously [
4]. In such complex situations, the failure of the operator to identify the root cause of a fault is one of the leading contributors to nuclear accidents. This challenge is exacerbated in the modern era, where digital control systems have transformed plant operations, providing a vast amount of data. While these systems enhance the capabilities of NPPs, alarms remain the primary tool for operators to detect anomalies and make decisions. The overwhelming volume of alarm signals, particularly in the case of multiple concurrent faults, can lead to alarm floods, overwhelming operators and impairing their ability to process information effectively. This was tragically illustrated by the Three Mile Island accident, where the combination of multiple failures and unclear alarm signals led to a catastrophic breakdown in the decision-making process [
5].
Although the likelihood of multiple failures occurring simultaneously is low, operator performance under such conditions remains a significant concern. To mitigate the risk of misdiagnosis during critical situations, advanced NPPs have implemented symptom-based operating procedures (SOPs) [
6]. These procedures allow operators to respond to safety-critical conditions based on key system parameters, rather than diagnosing the exact cause of the problem. While this approach can prevent unsafe actions during emergencies, it leads to the plant entering a state of emergency shutdown, resulting in economic losses.
Therefore, accurately diagnosing the operational faults in a power plant while avoiding unnecessary transitions into SOP conditions is of significant importance for ensuring its economic efficiency. Despite their limitations, alarm systems remain an indispensable tool for operators in fault detection and diagnosis.
This paper investigates the application of intelligent alarm technologies to support operators in comprehensive fault diagnosis by alerting them to the potential presence of multiple faults. Additionally, through positing advanced alarm analysis techniques, it enables operators to better understand the impacts of faults, assisting in the evaluation of the value of various fault management measures.
The following sections are outlined as follows.
Section 2 provides a review of fault diagnosis methods proposed for nuclear power plants, presenting an overview of existing techniques and highlighting the challenges faced in the field.
Section 3 outlines the objectives of the research and identifies the key problems that need to be addressed. In
Section 4, the fault diagnosis method based on alarm analysis is introduced, accompanied by a case study to demonstrate its application.
Section 5 and
Section 6 present the discussion and conclusions, respectively, where key findings are summarized, limitations of the study are addressed, and future research directions are proposed.
2. A Review of Fault Diagnosis Methods in Nuclear Power Plants
NPPs are critical infrastructure that generates substantial energy while minimizing environmental pollution. However, the potential consequences of accidents, such as nuclear leaks, can be significantly more severe than those associated with traditional power-generation facilities. Therefore, the nuclear industry has a strong interest in effective fault detection and diagnosis (FDD) to ensure the safety and reliability of NPP operations. The following review categorizes fault diagnosis methods based on the techniques that they use and their application scenarios, focusing on both traditional and artificial intelligence (AI)-based approaches.
2.1. Traditional Fault Diagnosis Methods
The traditional FDD methods primarily rely on mathematical models of the system, which can be difficult to accurately obtain in real-world scenarios. These methods can be broadly classified into the following categories:
2.1.1. Model-Based Methods
Model-based methods utilize mathematical models to represent the behavior of a nuclear power system. These models can be used to detect deviations from the expected performance of the system. Common techniques include:
- (1)
Observer-based methods [
7,
8]: These methods use observers to estimate the state of the system and compare it with the measured state. Any discrepancies indicate potential faults;
- (2)
Kalman filtering [
9,
10]: This statistical approach estimates the state of a dynamic system from a series of incomplete and noisy measurements. It is widely used for real-time monitoring and fault detection.
2.1.2. Knowledge-Based Systems
Knowledge-based systems leverage expert knowledge to diagnose faults. These systems often use rule-based approaches or decision trees to infer the presence of faults based on observed data. Examples of these include:
- (1)
Expert systems [
11,
12]: These systems use a set of rules derived from expert knowledge to diagnose faults and provide operational guidance;
- (2)
Decision trees [
13,
14]: Decision tree algorithms classify data points based on feature values, making them useful for diagnosing specific faults in NPPs.
2.1.3. Signal Processing Techniques
Signal processing methods analyze signals from various sensors within the NPP to identify anomalies. The signal processing techniques include:
- (1)
Fourier transform [
15]: This method transforms time-domain signals into frequency-domain representations, allowing for the identification of frequency components that may indicate faults;
- (2)
Wavelet transform [
16]: Wavelet analysis provides time-frequency representations of signals, making it effective for detecting transient faults.
2.2. AI-Based Fault Diagnosis Methods
With the advancement of AI technologies, new FDD methods have emerged that enhance detection accuracy and reliability. These methods can be categorized as follows:
2.2.1. Machine Learning Approaches
Machine learning techniques enable systems to learn from data and improve their diagnostic capabilities over time. Key methods include:
- (1)
Support vector machines (SVMs) [
17]: SVMs are used for classification tasks, making them suitable for fault detection by separating normal operating conditions from faulty ones;
- (2)
Random forests [
18]: This ensemble learning method combines multiple decision trees to improve diagnostic accuracy and the robustness against overfitting.
2.2.2. Deep Learning Techniques
Deep learning, a subset of machine learning, employs neural networks with multiple layers to model complex relationships among data. Notable applications of this technique in NPPs include:
- (1)
Convolutional neural networks (CNNs) [
19]: CNNs are particularly effective for image and signal processing tasks, making them suitable for diagnosing faults from sensor data;
- (2)
Recurrent neural networks (RNNs) [
20]: RNNs, including long short-term memory (LSTM) networks, are designed for sequential data analysis, making them ideal for time-series fault diagnosis.
2.2.3. Hybrid Approaches
Hybrid methods combine traditional techniques with AI to enhance their diagnostic performance. Examples include:
- (1)
Hybrid fault diagnosis with SVMs and particle swarm optimization [
21]: This approach integrates SVMs for classification with particle swarm optimization to fine-tune model parameters, improving the fault detection accuracy;
- (2)
Deep transfer learning [
22]: This method leverages knowledge from one domain to improve the fault diagnosis in another, and is particularly useful in scenarios with limited labeled data.
2.3. Application Scenarios
The application of these fault diagnosis methods varies across different components and systems within NPPs. Key areas include:
2.3.1. Reactor Systems [23]
Fault diagnosis methods are crucial for monitoring reactor systems, including monitoring the system’s core temperature, pressure, and coolant flow. AI-based methods, such as deep learning, can analyze sensor data in real-time to detect anomalies that are indicative of potential faults.
2.3.2. Turbine and Generator Systems [24]
Turbine and generator systems are critical for energy conversion in NPPs. Traditional signal processing techniques, combined with machine learning algorithms, can monitor vibrations and operational parameters to identify mechanical faults.
2.3.3. Safety Systems [25]
Safety systems are designed to prevent accidents and mitigate their consequences. Knowledge-based systems and expert systems play a vital role in diagnosing faults in safety-related components, ensuring that emergency operating procedures are followed.
2.4. Challenges and Future Directions
Despite the advancements in FDD methods, several challenges remain. The high complexity of NPPs makes it difficult to develop accurate models, and due to the high reliability of NPP design and operation, there is a relative lack of data accumulation on faults and accidents, creating a bottleneck for the development of intelligent diagnosis methods using AI technologies. Additionally, the nuclear power industry’s strong reliance on nuclear safety means that the understanding and assessment of potential risks associated with the introduction of AI technologies are still insufficient. Consequently, there is a cautious attitude towards the use of AI-assisted decision-making by operators. Although AI technologies have rapidly developed and been successfully applied in other industries, they have yet to see practical application in the main control rooms of nuclear power plants. One significant reason for this is the lack of verification and validation methods for AI-assisted operational decision-making systems that can undergo safety regulation testing.
Therefore, for the foreseeable future, rule-based expert systems for fault diagnosis represent a practical application option due to their clear logic and ease of verification. Other AI technologies may have more promising applications in field operations and maintenance than in the decision-making domain of main control room operators. This is due to the relatively lower demand for real-time decision-making in field and maintenance scenarios compared to those in the control room.
3. Objectives of Research
The primary objective of this research is to address the ongoing challenges faced by operators in the main control rooms of nuclear power plants, particularly in fault diagnosis and alarm management. This study specifically targets two critical scenarios:
- (1)
Alarm flooding: This issue arises when strong interdependencies between systems and equipment within the nuclear power plant cause cascading failures. As a result, a large number of alarm signals are triggered simultaneously, overwhelming operators, who are typically capable of managing only 6–8 alarms at once. The goal of this research is to develop effective methods for filtering and prioritizing alarm signals, allowing operators to respond quickly and efficiently;
- (2)
Multiple faults diagnosis: In multi-alarm situations, it is crucial to identify the underlying root causes and their combinations that lead to abnormal conditions. This research aims to alert operators to the potential occurrence of multiple faults simultaneously. The current alarm cards in nuclear power plants are designed for single-fault scenarios, outlining root causes such as equipment malfunctions or instrument failures. However, in cases involving multiple faults, operators often have to rely on their experience to determine the appropriate alarm card, which may lead to overlooking critical faults or improper responses. This study seeks to develop an intelligent alarm analysis system that can help operators navigate complex alarm situations by uncovering the relationships between multiple alarms and their underlying causes.
The significance of this research lies in its development of fault diagnosis methods that leverage alarm signals to support operators’ situational awareness, system state recognition, and decision-making during nuclear power plant operations. By using functional models to analyze alarm signals, this study will conduct a causal analysis to identify the root causes of alarm events.
Given that nuclear power plants already use alarm cards that detail the fault causes behind alarm signals (e.g., equipment failures or instrument malfunctions), the integration of intelligent alarm analysis can expedite fault localization in scenarios involving numerous interconnected alarms triggered by equipment issues. This would significantly reduce the burden on operators and enhance their ability to respond to alarm floods.
Moreover, functional models offer several advantages, such as robustness, ease of understanding, and ease of verification. These characteristics would provide operators with a framework to validate and confirm analysis results, improving the acceptance and applicability of the methodology within the operational environment of NPPs.
The following sections of this paper will introduce alarm analysis and fault diagnosis support methods based on functional models, demonstrating how this approach can enhance operational efficiency and safety in nuclear power plants.
4. Fault Diagnosis Approach Based on Alarm Analysis
4.1. Method Selection
The fault diagnosis method used in this study adopts the multilevel flow models (MFM), which was proposed by Professor Morten Lind [
26]. It was initially designed to provide a comprehensive human–machine function description for complex-process systems such as nuclear power plants. This innovative model allows for a structured representation of system functions, emphasizing their relationships with operational goals.
Over time, the MFM has evolved into a powerful causal reasoning method, enabling users to analyze and understand the intricate dynamics of complex systems. Its versatility has led to extensive research and application across multiple fields, including fault diagnosis [
27], alarm analysis [
28], reliability analysis [
29], hazard and operability analysis (HAZOP) [
30], and risk analysis [
31].
In the context of fault diagnosis, the MFM facilitates the identification of potential failures by mapping out the functional relationships and dependencies within the system. This approach not only aids in fault identification but also enhances operators’ understanding of how different system functions interact under various operating conditions. By leveraging the MFM, plant personnel can develop more effective diagnostic strategies, ultimately improving the safety and reliability of complex nuclear power plant systems.
4.2. General Principles of MFM
As illustrated in
Figure 1, the MFM consists of several key elements: objectives (or goals), functions, structures, and relations. The MFM is particularly well-suited for describing artifacts with specific design purposes, meaning that the modeling and analysis within MFMs are centered around these objectives.
In the design of complex systems such as nuclear power plants, the design goals are initially and explicitly defined; for instance, the primary objectives typically encompass safety and availability. Functional analysis and allocation are then conducted in alignment with these established goals. This process ultimately leads to the identification of the physical components necessary to achieve particular functions. Early MFM modeling reflected a hierarchical relationship among goals, functions, and physical components. However, contemporary MFM modeling primarily focuses on the relationships between goals and functions, with physical components often not being represented in the MFM.
The functions are organized into a functional structure based on the flow of materials or energy, or the transmission of control signals. Typically, a functional structure corresponds to a set of material, energy, or control structures, which is why it is also referred to as a flow structure. The purpose of this modeling approach is to ensure that the functional composition of each flow structure is relatively simple, allowing for functional analysis (causal analysis) based on the principles of conservation of mass or energy, or control theory.
A function can achieve a specific goal, and the flow network in which the function resides provides the contextual framework for this goal’s realization. Goals are categorized into primary goals and sub-goals, where sub-goals can serve as conditions for the realization of other functions, thereby coupling different flow networks together to achieve the primary goal.
4.3. Functions, Patterns, and Causal Relationships of MFM
Functions, patterns, and causal relationships are central elements in the reasoning processes of the multilevel flow model (MFM). As illustrated in
Figure 1, the MFM encapsulates several fundamental functions that are pertinent to process industries, including Source, Transport, Storage, Balance, Barrier, and Sink. Each of these functions plays a critical role in the overall operation and management of industrial processes. Additionally, Conversion, Separation, and Distribution can be viewed as specific instances of the Balance function, as they contribute to a more nuanced understanding of instances of Balance within the modeling framework. This clarification enhances the model’s comprehensibility and facilitates better decision-making in complex systems.
In recent years, Professor Morten Lind has further expanded the MFM framework to encompass descriptions of control functions [
32]. This expansion includes the introduction of various Means–End, Control, and Influence relationships, which allow the MFM to effectively model and describe system actions and interactions. These relationships illustrate how different functions can influence one another and how specific actions can lead to desired outcomes within a system.
Despite these advancements, the hierarchical structure and fundamental reasoning principles of the MFM remain fundamentally unchanged. The integrity of the original model allows for continuity in its application and in understanding of it. Consequently, this paper will focus on the established elements of the MFM without delving into the newer expansions, thereby emphasizing the core functions, patterns, and causal relationships that underpin effective reasoning using the model.
One requirement of MFM modeling is that all functions, except for Transport and Barrier functions, must be connected through Transport or Barrier functions. This means that the Transport and Barrier functions serve as interfaces that facilitate or obstruct the flow of materials, energy, or information. The advantage of this approach is that it enhances the standardization and consistency of the model, thereby reducing the likelihood of overlooking system functions. Consequently, each (non-Transport and non-Barrier) function, when connected to upstream or downstream Transport or Barrier functions, creates a specific pattern that corresponds to different causal relationships.
For example, when the Storage function is linked to a Transport function upstream, it indicates that the Transport function provides input to the Storage function. If the Transport function is in a low flow state, conservation principles suggest a tendency for the Storage function to also be in a low volume state.
The reasoning rules of the MFM are detailed in
Table 1, which lists some functional combinations and their corresponding causal relationships. These patterns are relevant to the case studies presented later in this paper. For a comprehensive description of the causal relationships within the MFM, please refer to the literature [
26,
32]. The MFM framework emphasizes the interdependencies between different functions, allowing for a structured analysis of system behavior and fault diagnosis based on these foundational principles.
4.4. Case Study of Fault Reasoning with MFM
4.4.1. Functional Analysis of a Dual-Tank System
This paper uses a dual-tank system as a case study to illustrate the principles of modeling and reasoning using an MFM. The design purpose of the dual-tank system is to provide a stable hot water supply with its temperature maintained at a preset value. As shown in
Figure 2, the system consists of two interconnected tanks: Tank 1 and Tank 2. These tanks work in coordination to optimize the water heating and management, ensuring the stable operation of the system.
In this dual-tank configuration, Tank 1 functions as a reservoir for cold water sourced from an external supply, referred to as Source. Cold water flows from Tank 1 to Tank 2 via Valve 1. Within Tank 2, the incoming cold water is heated by an Electric Heater, allowing the system to supply hot water at the desired temperature to users.
The operation of the dual-tank system is regulated by two valves, Valve 1 and Valve 2, which control the outlet flow rates F1 and F2 from Tank 1 and Tank 2, respectively. This flow control mechanism is essential for maintaining the water level in Tank 2, ensuring that the Electric Heater remains submerged below a critical water level, L2. This precaution prevents the heating element from being exposed and losing its cooling effect.
The main goal of this dual-tank system is to consistently deliver hot water at a stable temperature to users. To simplify the MFM modeling and reasoning processes, the power regulation unit of the Electric Heater is excluded from this analysis. As a result, the developed MFM does not incorporate a control module for adjusting the heater’s power.
4.4.2. Modeling the Dual-Tank System with MFM
The MFM established for the dual-tank system is shown in
Figure 3. The model focuses on analyzing three objectives (O1, O2, O3). Below is an explanation of how these system objectives are achieved.
- (1)
Objective O1: maintain the temperature of the water supplied to the users.
This objective is realized through the energy flow (S1) of the dual-tank system. The flow describes the process of heating the cooling water (F1) in Tank 2 (F3) using the energy from the heater (F6). The heated water is then exported from Tank 2 (F4) and supplied to the users (F5), fulfilling the temperature maintenance requirement. F4 (heat energy exported from Tank 2) is the main function for achieving O1, as it delivers the heated water to the users. The producer–product relationship (as seen in
Figure 1) connects F4 directly to O1, with O1 being achieved through F4. The main function required for this objective is highlighted in the green background box near the producer–product relationship, indicating that F4 is the key function in realizing O1.
- (2)
Objective O2: maintain the water level of Tank 2.
This objective is focused on maintaining the water level in Tank 2 to ensure that the cooling water required for heating is available. In the MFM, O2 is realized through the mass flow (S2), which represents the movement and storage of the cooling water across the system. F12 (cooling water and hot water are mixed in Tank 2) is the main function for achieving O2, as it directly impacts the water level in Tank 2 and ensures that the cooling water is adequately stored and mixed with the hot water. The producer–product relationship between O2 and F3 (Tank 2’s Storage function) indicates that F3 is a necessary condition for the realization of O2. This means that O2 depends on the functionality of F3 (the storage and energy transfer function within Tank 2), which contributes to maintaining the water level. Other functions in S2, such as F9, F10, and F11, support the main function F12 by facilitating the flow and storage of cooling water, providing the necessary context for the water level maintenance.
- (3)
Objective O3: drive cooling water
O3 is concerned with ensuring that the cooling water flows correctly through the system. This objective is realized through the energy flow S3, which focuses on converting electrical energy into mechanical energy to drive the pump that circulates the cooling water. F17 (cooling water is injected into Tank 1) is a key function within S3, where electrical power (F15, F16) is converted into mechanical energy (F17) by the pump to drive the cooling water through the system. S3’s energy conversion process provides the necessary mechanical force to facilitate the movement of cooling water through the system, thus achieving O3.
This multi-layered approach using the MFM effectively maps the system’s objectives to its functional flows and interactions, reflecting the system’s means and ends to stably supply hot water, regulate its water levels, and effectively drive the cooling water circulation.
4.4.3. Fault Diagnosis Method Based on MFM Reasoning
- (1)
Overall strategy
The overall strategy used herein for fault reasoning based on the MFM can be summarized as follows:
Hierarchical structure for causal relationships: The MFM represents the causal relationships between functions within a system, organized into multiple layers. In this structure, each subsequent layer serves to support the function of the previous layer, meaning that a failure in a lower layer causes a failure in the upper layer, rather than being the result of it. This hierarchical framework enables the MFM to effectively map functional failures across different causal levels, facilitating the rapid identification of the underlying cause of a failure. This fault diagnosis strategy is especially valuable when diagnosing single failures or in scenarios where intelligent diagnostic support is unavailable. It would provide significant assistance to operators by enabling them to quickly identify the root cause of a failure, reducing the diagnostic time requirement and improving accuracy;
When a fault is localized to a specific level based on the hierarchical structure of the MFM, the fault localization process is then carried out by analyzing the causal relationships between functions within the flow structure. Within a single flow structure, the fault cause is traced by examining the causal relationships between adjacent functions (as shown in
Table 1). At this stage, the states of the MFM functions and their causal relationships form a causal dependency graph (CDG), enabling causal analysis to be performed using a directed acyclic graph approach [
33].
- (2)
CDG based on MFM
The CDG consists of two main components: nodes and directed edges. Nodes represent the states of system functions, while directed edges illustrate the causal relationships between these function states. The construction of the CDG involves the following key elements:
Nodes: Each node in the CDG corresponds to the state of a specific system function. Each node is characterized by a unique identifier and the state of the function it represents. The function’s state can be either high or low, corresponding to various physical parameters such as the flow, liquid level, or capacity. For example, a node (F1, high) indicates that function F1 is in a high state, and a node (F2, low) indicates that function F2 is in a low state. These nodes provide the foundational representation of the system’s operational status;
Directed Edges: Directed edges in the CDG represent the causal relationships between function states, indicating how the state of one function can influence the state of another. Each directed edge has a direction, pointing from the “cause” node to the “effect” node. For instance, an edge (F3, low) → (F7, high) suggests that a low state in function F3 could lead to a high state in function F7. These directed edges capture the interdependencies between system functions and help trace the propagation paths of potential faults.
By constructing the CDG, the relationships between various system functions and their mutual influences are clearly revealed. This enables fault diagnosticians to trace causal chains and pinpoint the root cause of a fault, facilitating effective remedial actions. The structured representation and logical nature of causal chains in the CDG make it a powerful tool for fault diagnosis.
- (3)
Using CDG for Fault Reasoning
When a fault occurs, the CDG helps trace it back to the root cause. For example, in the case of the dual-tank system MFM shown in
Figure 3, a low water temperature in Tank 2 (represented by the low state in function F3) will lead to a decrease in the water temperature at the outlet of Tank 2 (the low state in function F4). Simultaneously, it will trigger an increase in the power of the Electric Heater (the high state in function F7) to compensate for the temperature drop in Tank 2. The CDG can trace this causal chain, identifying that the low water temperature in F3 is the primary cause behind the alarms in F4 (low water temperature) and F7 (high heater power). By following this causal relationship, the CDG aids in more accurately pinpointing the origin of the system fault, providing a solid basis for maintenance decisions and system optimization.
As shown in
Figure 4, the CDG represents a complete set of functional causal relationships in the system, known as the complete CDG. This model provides a full mapping of how different functions and their states are interconnected. However, in real-time fault detection, the system’s actual operating conditions may differ from what is predicted by the complete CDG. After measuring the system parameters, some functional states may not align with the observed data, indicating inconsistencies between the model and real-world observations. These inconsistent states and their associated causal relationships should be removed to ensure that the CDG remains accurate and reflects the current system state. The resulting CDG, after removing these inconsistencies, is referred to as the inferred CDG, an example of which is shown in
Figure 5. The inferred CDG represents the corrected set of functional states and causal relationships that are consistent with the current system status. This refined model allows for more accurate fault diagnosis and provides a clearer basis for troubleshooting and system optimization.
The existing fault diagnosis method based on MFM utilizes alarm analysis techniques to perform causal reasoning among a large set of alarm events, aiming to identify the root cause alarm that leads to other related alarms. During the reasoning process, for functions that are not directly observed or measured, unknown states are inferred through upstream and downstream constraints, linking the causal chains between the known states of different functions. This strategy primarily focuses on the causal analysis of alarm events and assumes that unmeasured or unobserved functions are not the cause of the fault.
In the previous research of the authors [
33], a root cause identification method was proposed that assumes each observed or measured function can be in either a high or low abnormal state. These assumed abnormal states are then combined with alarm events to generate various interpretations of the current system’s abnormal conditions. This approach helps to reveal potential fault causes, including those for cases of multiple faults occurring simultaneously. By leveraging the causal relationships between the MFM and CDG, this method enhances the accuracy of root cause identification, providing a comprehensive understanding of system failures and enabling more effective troubleshooting.
A key challenge identified in previous research, particularly in practical engineering applications, is that operators often need to prioritize potential root causes to focus their limited attention on the most critical issues. To address this challenge, this paper builds upon prior work and introduces an algorithm based on the analysis of the longest causal path. The rationale behind this approach is that the longest causal path represents the most significant sequence of events leading to a fault, as it involves the most extended and potentially disruptive chain of dependencies within the system. By analyzing these longest causal paths, the algorithm helps operators focus on the root causes that are most likely to have the greatest impact on the system performance, thereby optimizing fault diagnosis and decision-making processes.
Appendix A provides the pseudocode for the fault reasoning algorithm based on the inferred CDG. The find_longest_causal_paths algorithm is a recursive approach designed to identify the longest causal paths in a CDG, starting from a specified root node (which represents a potential root cause of a fault) while avoiding conflict nodes. The algorithm uses a depth-first search (DFS) method to explore all possible causal paths in the graph. The key elements of the algorithm are summarized as follows:
Graph representation: The graph is represented as a dictionary where each node in the system has a list of its causal dependencies. These dependencies define how nodes influence one another within the system. A node’s neighbors (or causal dependencies) are the nodes it causally affects or is affected by;
Conflict Nodes: Conflict nodes refer to two or more functional states that cannot exist simultaneously. A typical example of conflict nodes is different states of the same function. For instance, when the state of an observable function is observed (e.g., a high state), its other states (e.g., a low state) become conflict nodes and cannot occur at the same time. As a result, these conflict nodes and their associated causal relationships need to be removed from the CDG. For functions with unobservable states, their functional states may be uncertain, but conflict nodes must still be avoided in the same causal path. This requires conflict node detection during the causal path search to ensure consistency and feasibility along the path;
Depth-first search (DFS): The algorithm employs a recursive DFS approach to explore all causal paths originating from the root node. This means the algorithm starts at the root node, traverses its neighbors, and continues this process until it reaches a dead-end (i.e., no further causal dependencies are found or it encounters a conflict node). During each DFS call, the algorithm maintains a record of the current path it is following. This path is updated by appending the current node to it as the DFS explores new nodes;
Path length tracking: As the DFS explores different paths, the algorithm compares the length of each path it encounters to those it has already encountered. If a path’s length exceeds the current longest path (max_length), the algorithm updates max_length and resets the list of longest paths to include only this new longest path. If a path’s length is equal to the current longest length (max_length), the algorithm adds this path to the list of longest paths, ensuring that all causal paths of equal length are captured. This process ensures that once the DFS is completed, the algorithm will have identified all longest causal paths from the root node;
Termination condition: When the algorithm encounters a conflict node (from the conflict_nodes set), it terminates the search for that particular path. This avoids the incorporation of paths that might lead to uncertain or invalid causal relationships;
Recursive exploration: The DFS recursively explores all neighboring nodes (i.e., causal dependencies) of the current node. Each recursive call traverses deeper into the graph, following each potential causal chain from the current node to its connected nodes;
Return value: After the DFS is completed, the algorithm returns the longest_paths list, which contains all the longest causal paths found from the root node, ensuring that the most significant sequences that could lead to a fault are identified.
The algorithm begins with initialization, where longest_paths is created to store all the longest causal paths found during the search, and max_length is set to 0 to track the length of the longest path discovered at that point. The core of the algorithm is the recursive DFS function, which takes two parameters: current_node, representing the node currently being explored, and path, which accumulates nodes as the DFS proceeds. At the start of each DFS call, the algorithm checks if the current_node is a conflict node; if it is, the search for that path is terminated immediately. If the length of the current path exceeds the current max_length, the algorithm updates max_length and resets longest_paths to contain only this new path. If the current path’s length matches the max_length, the path is added to longest_paths. The DFS function then recursively explores each neighboring node by calling dfs(neighbor, path + [neighbor]), continuing the search through all causal dependencies. The search starts by invoking dfs(root, [root]), with the root node as the starting point and an initial path containing only the root node. Once the DFS is completed, the algorithm returns the list longest_paths, which contains all the longest causal paths from the root node, excluding any conflict nodes.
4.4.4. Case Study
- (1)
Scenario Setting
In the dual-tank system fault scenario, Tank 1’s water level is high, while Tank 2’s is low, and the temperature in Tank 2 is high, with the pump operating normally. For simplicity, we assume that the cooling water supply (F8) and wastewater discharge (F14) are functioning normally, and that the other system states are unknown.
The observed behavior suggests potential faults. The water level discrepancy between the tanks and the high temperature in Tank 2 could indicate an issue. Since the pump is working, the fault is likely not in the pump itself. The normal pump operation helps narrow the diagnosis down to other components.
The water level difference between the tanks may be caused by issues like valve failure, improper pressure regulation, or water distribution problems. The high temperature and low water level in Tank 2 could signal faults in the heating or flow control systems, such as a malfunctioning heating element or a blocked water inlet. These factors, along with the pump’s normal operation, point to faults in the water level or temperature control components, rather than in the pump itself. This analysis illustrates how MFM-based fault diagnosis can isolate the root cause by examining the system behavior and relationships between components.
Figure 6 highlights the abnormal functional states in this fault scenario. The upward arrow next to the function symbol indicates that the function is in the high state, while the downward arrow indicates that the function is in the low state.
Table 2 provides the element description of the MFM model for the dual-tank system shown in
Figure 6. It can be observed that heat exchange in Tank 2 (corresponding to function F3) has a condition supported by goal O2, which is implemented by function F12. Since F12 (low) corresponds to an abnormal water level in Tank 2, goal O2 is also in an abnormal state. As the condition for F3 is not satisfied, the analysis of the high temperature anomaly in F3 should follow the MFM from top to bottom, with the path S1 → F3 (low) → O2 (failed) → F12 (low), and be located within the flow structure S2.
In this fault analysis, the MFM helps to systematically trace the root cause of the issue. The fault in F3 (high temperature) is directly related to an unmet condition, which stems from the abnormal state of goal O2. The abnormality in goal O2, in turn, originates from the malfunction in function F12, which is responsible for maintaining the water level in Tank 2. By following the MFM hierarchical structure, the fault can be progressively isolated from the top level (S1) to the functional levels (F3 and F12), ensuring accurate fault identification and effective isolation. This method ensures that the root cause of the abnormal temperature in Tank 2 is accurately traced back to the abnormal water level, thus enabling a more efficient and targeted fault diagnosis process.
- (2)
Reasoning Results
The complete and inferred CDGs of flow structure S2 are shown in
Figure 7 and
Figure 8, respectively. In the MFM, each function’s high, low, and default normal states are treated as conflicting nodes. Once the actual state of a function is observed, conflicting nodes and their associated causal relationships are removed from the inferred CDG. The maximum causal paths for each node considered as the root cause, based on the given functional states, are provided in
Table 3.
Table 3 also includes the proportion of observed abnormal states within the maximum causal path relative to the total number of observed abnormal states.
A higher proportion of abnormal states within the causal path indicates greater value in addressing that specific alarm state to restore the system to its normal condition. As indicated in
Table 3, the proportion for the low state of F11 is 100%, suggesting that, in the case of a single fault, the causal path originating from this state provides a complete explanation for the system anomaly. Therefore, the low state of F11 (such as blocked water inlet of Tank 2 or small valve opening of Valve 1) can be considered the root cause of the observed fault under a single-fault scenario.
However, when diagnosing faults, it is crucial to consider other potential causes in order to fully explain the system anomaly.
Table 3 provides various interpretations of case scenarios formed by different combinations of functional states acting as root causes. These interpretations are illustrated in
Figure 9 [
33]. While the low state of F11 may still play a significant role in the system behavior, it is important to recognize that other faults must be considered together to form a comprehensive explanation of the root cause. Notably, even though F13 (high) needs to be considered in conjunction with other faults to form a complete scenario explanation, it is worth emphasizing that F13 (high) accounts for 50% of the total of eight fault scenarios in Cases 1, 3, 5, and 7. Therefore, this fault mode is highly significant and should not be overlooked.
Restoring F11 to a low state (e.g., clearing blockages in the pipeline or increasing valve opening) can help alleviate the water level in Tank 2. However, if F13 (high) occurs independently (e.g., due to excessive user consumption), and if this fault mode is not addressed, the water level in F12 may not truly resolve, which could lead to the potential risk of exposing or damaging the Electric Heater in Tank 2. Thus, a more effective fault management strategy would be to reduce the flow of F13 (i.e., reducing user consumption by reducing the opening of Valve 2) while simultaneously increasing the flow of F11. Additionally, when necessary, lowering the Electric Heater’s power would help prevent overheating and damage.
5. Discussion
The safety and efficient operation of nuclear power plants (NPPs) heavily depend on the decisions made by operators in the main control room, especially indiagnosing and responding to faults. Traditional fault management systems primarily rely on pre-configured alarm cards that display the direct causes of potential faults. While these traditional alarm management techniques are useful for maintaining system safety, they often fail to reveal the root causes, particularly in cases involving multiple interconnected faults. Moreover, traditional alarm management systems tend to be overly conservative, which can lead to unnecessary unplanned shutdowns of the reactor. Additionally, these methods impose a significant cognitive burden on operators, requiring them to process large volumes of alarm information rapidly and accurately in order to identify the most critical faults. In multi-fault scenarios, this becomes especially challenging and can result in misjudgments or delays in corrective actions, ultimately affecting both safety and operational efficiency.
In contrast, the intelligent fault diagnosis method proposed in this study utilizes the multilevel flow models (MFM) framework. This approach organizes alarms hierarchically and links them to system goals and functions, providing a clearer and more structured representation of causal relationships. This is a significant departure from traditional alarm management techniques. The method proposed in this study uses the longest causal path for fault diagnosis, enabling operators to quickly and efficiently identify single faults. Furthermore, it includes a root cause combination method, which helps identify interactions between multiple faults, which would allow operators to uncover deeper connections between potential failures. In complex scenarios with multiple alarms, MFM-based multi-level fault diagnosis and causal analysis offer significant advantages over traditional systems. This approach would reduce the cognitive load on operators and support more accurate decision-making, improving response times and the quality of corrective actions.
One key distinction between the fault diagnosis method proposed in this paper and other existing MFM-based methods is that the existing MFM approaches mainly focus on the causal relationships between alarm events. For unmeasurable system states, existing MFM approaches rely on reasoning based on known functional states, assuming that unmeasurable functions are only results and not causes. While this assumption is reasonable, as it is critical to measure important system parameters, in high-risk, complex operational environments like nuclear power plants, many equipment states cannot be directly measured due to constraints such as high temperatures, high pressures, and radioactive environments. This unmeasurability introduces uncertainty, which becomes a key challenge for operators when making decisions. Furthermore, the introduction of digital control systems has significantly increased the number of alarm events, with alarms being distributed across multiple interfaces. Operators are required to check each interface to obtain a complete view of the system state, which contributes to what is known as the “key-hole effect”. This increases the difficulty for operators, as the alarm information they use for decision-making is often incomplete. The method proposed in this paper addresses these limitations in existing alarm management techniques.
The proposed method was validated in a real-world setting at a nuclear power plant in China, demonstrating its effectiveness and feasibility. The results showed that the MFM-based approach enables the faster and more accurate identification of fault sources in the presence of multiple alarms. This capability allows operators to implement more targeted corrective actions, reducing the system’s downtime and enhancing its overall operational efficiency. Additionally, the method integrates well with traditional alarm prioritization systems, balancing safety with system availability. This balance is crucial for ensuring the economic operation of nuclear power plants.
Despite the many advantages of the MFM-based approach, it also has some limitations. The primary challenge lies in the need for accurate system modeling and integrating the MFM framework with existing plant systems. The complexity of the model may require significant computational resources and expertise, which could limit its applicability in some situations. Furthermore, although the method improves diagnostic efficiency, it does not entirely eliminate the possibility of human error, particularly in dynamic fault scenarios where time pressure is a critical factor.
6. Conclusions
This study introduces an intelligent fault diagnosis method based on an MFM, addressing the limitations of traditional alarm card systems in identifying root causes in multi-fault scenarios. By associating alarms with system goals and functions, the MFM provides a hierarchical structure that clearly illustrates the causal relationships between faults, offering operators a clear diagnostic path. This method not only improves fault diagnosis accuracy but also provides efficient decision support for operators during emergency situations.
Looking ahead, ongoing research efforts are focused on developing quantitative evaluation metrics and methods for identifying the root causes of faults within the MFM framework. These metrics aim to provide objective assessments of alarm performance, as well as prioritize faults and their root causes. Such evaluation methods will systematize fault diagnosis, enabling operators to not only identify the most critical faults but also to assess the reliability of alarms and their connections to root causes. A key aspect of future work is the development of algorithms that combine alarm severity, urgency, and root cause quantification metrics to rank and categorize alarms. This will ensure that operators can prioritize the most urgent issues while also identifying effective means to address the system’s availability.
Moreover, with the continued advancement of artificial intelligence (AI) technologies, the intelligent fault diagnosis method proposed in this study has the potential for further optimization and expansion to other types of nuclear power plants and industrial systems. This will help improve overall operational efficiency and enhance safety management practices. By integrating real-time data from various sensors and utilizing deep learning techniques, the method could evolve into a more autonomous and intelligent system. These advancements will significantly reduce the need for human intervention, enabling proactive fault detection and resolution, while further supporting the trend of smart operation and maintenance in nuclear power plants.