2. Anomaly Detection
The discovery of anomalies, abnormal, or unexpected behavior provides the first warning to operators that something unusual is about to take place. If successful, anomaly detection is an effective tool to warn operators about the incipient stages of anomalous behavior by extending the early-detection window, thus providing ample time before the anomalous condition worsens to the point of causing equipment damage. To achieve the maximum benefits of anomaly detection, monitoring for abnormal behavior requires two steps, the detection and classification of anomalies. The detection of an anomaly is recognizing a behavior has shifted from normal to abnormal, whereas the classification of an anomaly is identifying its source.
The value of automated anomaly detection using data from plant sensors has been recognized in many fields, including engineering systems [
1]—fossil [
2], oil and gas [
3], wind [
4], nuclear [
5,
6], aerospace [
7], medical [
8,
9], finance [
10,
11], military [
12], cybersecurity intrusion detection [
13,
14], etc. In the context of nuclear power, automated anomaly detection enhances equipment failure prediction capabilities and reduces the operators’ burden, especially because operators at an NPP are responsible for several tasks that can result in Operations becoming the most burdened organization in the plant.
Often the question of how to define anomalies arises. Qualitatively, anomalies, as they are referred to in the machine learning (ML) community, imply that an unexpected or rare event has happened, and the concern is that it may lead to an unpredictable system trajectory with dire safety- or economic-related consequences. This is different from an outlier, which is often used to describe bad data, rather than bad behavior. Transitioning from a qualitative definition of an anomaly into a method requires a well-defined mathematical approach for both the detection and classification of anomalies. The most straightforward strategy for anomaly detection is to compare process data with some baseline behavior representing the expected normal response [
15,
16,
17]. The deviation between measured process data and expected response is used as a basis for flagging anomalous behavior. Considering the large volume of process data, the approach to process these data is via ML techniques, which can be trained to classify the deviations as anomalous or normal disturbances. This turns the problem into a mathematical exercise with an unavoidable degree of subjectivity0F, rendering it vulnerable to false positives (i.e., normal behavior being incorrectly declared as anomalous) and true negatives (true anomalies going undetected). Several subjective decisions are usually made in this process. For example, it is necessary to determine whether the anomaly is a one-time event with no consequence—an outlier in the statistical community—thus requiring no further action or a pattern (i.e., a regularity structure) warranting further analysis. Additionally, the choice of the standard deviation or variance and size of each data window is an important decision by itself, which is commonly employed by the majority of anomaly detection techniques. Different values will lead to different classification results. This subjectivity aspect of data-driven anomaly detection methods has been the main driver for better informing the detection process by integrating the physics-based knowledge of the monitored system into the process. To understand what type of knowledge exists in the nuclear industry, it’s necessary to list the types of tools currently used in the nuclear industry for anomaly detection:
Setpoints and alarms associated with the plant computer system and other instrumentation systems, which set normal operating bands for various data points and alert operators if a parameter falls outside of the designated bands.
Equipment online monitoring solutions that typically include sensors that provide continuous or periodic equipment data not previously available, such as vibration data for rotating equipment.
Predictive maintenance, such as methods used in lubricant sampling and analysis, thermography measurements and trending, and vibration measurement and analysis.
Thermal performance models that form a holistic physics model of the thermodynamic cycle of a power plant. Output from thermal performance models can be compared with actual plant data to determine potential issues with individual pieces of plant equipment important to the power cycle.
Advanced pattern recognition, which uses models designed to establish correlations among multiple custom-fed data points to predict future values of a given parameter and are often referred to as data-driven models because they do not incorporate any physics models into their formulation.
Data validation and reconciliation that uses physics and data-based models to analyze entire power plant systems. The physics-based modeling takes the form of flow and energy balances.
Digital instrumentation and control systems that add instrumentation and automated decision-making (e.g., [
18]) to support the online monitoring of the plant.
Personnel monitoring is, at the present time in the nuclear industry, the core of the monitoring process. Regardless of which online monitoring tools are used, at some point, an actual subject-matter expert must recognize the meaning of plant data and any input from anomaly detection tools.
A remote monitoring center, which is a centralized repository where plant data from multiple plants are collected and various tools (from the list above) are used to analyze data and provide anomaly detection reports to plant personnel.
At their core, many of those tools deploy various forms of physics-based knowledge into the anomaly detection process. For example, the assignment of alarm set points performed by subject-matter experts that have developed a process-physics-based knowledge over decades of operations. This is the simplest form of data and physics integration in a hybrid model.
It is instructive to note that empirical models can be more accurate than physics models in data-rich domains. What this means is that data-driven models can perform better in terms of state awareness as compared to physics models, when the available data are abundant. Physics models can cause predictions to be less accurate than data-driven models for several reasons, including:
At a very fundamental level, physics models involve a subjective view of how patterns are established among the process variables.
Physics models rely on several parameters (material properties, geometry, species concentrations, etc.) that are generally uncertain.
Physics models may miss unanticipated and external phenomena, causing their predictions to be inconsistent with real data when such phenomena arise.
The following section introduces many common methods of pure empirical and hybrid models.
4. Decision-State Diagram for Empirical and Hybrid Methods
With the tool kit of techniques discussed above, the best anomaly detection approaches can be selected for a specific monitoring scope on a given process or equipment item. A criterion is needed to determine the best course of action for anomaly detection. This discussion addresses a gap area in the recent anomaly detection literature, which has primarily focused on recipe-based approaches for demonstrating the value of various hybridization approaches. Notwithstanding, much less focus has been placed on developing criteria to guide the hybridization process and thereby improve the performance of anomaly detection systems—i.e., shifting it from a subjective process that is controlled by experience to a systematic process that provides metrics on the value of a given physics-based or data-driven model.
To make an informed decision—i.e., to move away from an ad hoc trial-and-error approach—a series of key tests need to be performed. Most of these tests are trivial and can be performed by simply studying the scope, but some require performing some analysis to get to an answer because they are dependent on the result’s accuracy or methods performance.
Figure 2 presents the decision-making process in a user-friendly manner as a decision-state diagram. This diagram can be easily coded into a tool with a set of YES/NO questions to reach a conclusion on which method from the previous section to use.
While
Figure 2 shows a systematic and deterministic process taking the user through the steps required to march through the best anomaly detection approach, in reality, multiple approaches may be suitable, and the decision-state diagram can only be effectively used as a guide. Each situation is unique and requires consideration and planning (utilizing the strategies in this paper) to yield successful outcomes in online monitoring efforts.
As shown in
Figure 2, multiple outcomes of the decision-making tool lead to the point labeled “Install Sensors”. In many online monitoring applications, available data are insufficient for adequate anomaly detection, and thus more sensors are required. An important aspect of adding additional sensors is determining which sensors should be added to the system to provide the most benefit in anomaly detection.
The following part includes a summary section that is meant to clarify the decision points of
Figure 2 and provide simple guidance to direct the user to the appropriate answer for a given monitoring scope. The summary starts with a question that is an expanded version of the shortened question asked in the decision points of
Figure 2.
Is there at least one sensor available that directly measures the parameter of interest? For example, for anomaly detection in the feedwater flow rate, is there a sensor that directly measures that rate? There may be sensors that directly measure, for example, feedwater heater liquid level, but there are no sensors that directly measure, for example, bearing wear in condensate pump motors.
Simple Modeling
Decision Point in
Figure 2: Small Dataset?
Is the number of discrete sensor indications available in the data set small or correlated (typically one to five sensors giving the same type of data, such as all temperature or all vibration)? For example, one or a few vibration sensors on a pump can be analyzed using statistical methods for deviations, and thus such a dataset would be considered to be small.
Data Inference
Decision Point in
Figure 2: Inference Possible?
Is the available sensor data sufficiently related to the source of potential failure to allow anomalous indications to propagate to the sensors? That is, would it be possible to analyze the sensor data to extract the conditions of the equipment of interest? Alternatively, would the data uncertainty block the ability to infer the equipment condition? Note that this decision point is subtly different from the question of whether direct sensors exist. For example, temperature sensors in a room do not provide a direct indication of cooling fan function, but through inference, they could provide an indication of a cooling fan functioning properly.
Physics-Modeling Value
Decision Point in
Figure 2: Physics Model Return on Investment?
Is the cost to develop a physics model justified by the anticipated value added to the anomaly detection process? Note that there are three locations in
Figure 2 with a decision point about the physics model return on investment. Each of these decision points has slightly different considerations, but each will involve some type of cost-benefit analysis to determine whether additional physics modeling would create enough value to be worth the investment. The value can be materialized by augmenting missing data, enabling an empirical model to be trained and tested, or reducing uncertainty to improve the anomaly detection process.
Data Dimensionality
Decision Point in
Figure 2: High Number of Data Points?
Is the number of data points too large to be analyzed through the available resources and is there a need to shortlist the dataset i.e., to reduce the data dimensionality? For example, if it’s desired to train a method continuously, then it is desired to downselect the sensors list to be analyzable in the time frame desired.
Physics Knowledge
Decision Point in
Figure 2: Physics Knowledge?
Is the basic knowledge about the physics of the process sufficient to make valid decisions on the anomaly detection process without a detailed physics-based model or simulation? Note that there are three locations in
Figure 2 with a decision point about physics knowledge. These three decision points can be broken down into more specific questions that pertain to each decision point as follows:
Following “High Number of Data Points?”: Can sensors important to the detection of the anomaly be readily identified based on physics knowledge?
Following “Explainable Validation?”: Can a series of knowledge-based decisions be used to create a rules-encoded process to detect an anomaly?
Following “Cause-Effect Needed?”: Can the cause of an anomaly be known based on physics knowledge and used to automatically classify the cause of an anomaly without more detailed data analysis?
Method of Validation
Decision Point in
Figure 2: Explainable Validation?
Does the anomaly detection scheme for a critical piece of equipment require an explainable validation process (that contains the appropriate amount of rigor to meet applicable regulatory needs)?
Performance
Decision Point in
Figure 2: Performance Acceptable?
Does the performance of the method (empirical or hybrid) meet the scope requirements? For example, is the method accurate, robust, and capable of providing a sufficient lead time to failure? Note that there are four locations in
Figure 2 with a decision point about acceptable performance. The essential questions for each of these points are the same, with the goal of determining whether the anomaly detection system is adequate or whether more work is required to meet anomaly detection needs.
Is the available data sufficient and suitable to train an empirical model and test its performance in the operating conditions of interest?
Cause-Effect
Decision Point in
Figure 2: Cause-Effect Needed?
Is knowledge about the cause of an anomaly needed? That is, is the detection of the occurrence of an anomaly not sufficient?
Entropy Inference
Decision Point in
Figure 2: Noise Correlation Possible?
Is there enough noise within the data to create a PDF and evaluate PDF changes from the sensors as they get closer to the source of the anomaly?
Model Fitting
Decision Point in
Figure 2: Tunable Model?
Does the physics model not encompass the full scope of the physics and need to be tuned to represent some unknown properties or parameters? Can the model provide an adequate representation of reality when reality refers to the wide range of conditions expected during operation? Is the physics model expected to change in time and need to be retuned? That is, are the normal operating conditions over time dynamic rather than mostly static? For example, a monitoring method for an aging component might need to be adjusted to reflect the aging process in the physics model through some tunable parameters.
5. Strategy Use Cases—High-Pressure Coolant Injection System
This section aims to leverage the strategy introduced in this paper and summarized in
Figure 2 in a pilot use case with an industrial collaborator. The use case aims to detect a minor steam leak in the plant’s high-pressure coolant injection (HPCI) room.
The HPCI system consists of safety-related coolant injection equipment that is only operated in emergencies to compensate for the loss of coolant in the reactor coolant system. The HPCI pump room contains temperature instrumentation that provides input to the plant data system for the purpose of detecting steam leaks. However, the normal variability of temperature in the room makes it difficult to actually detect minor steam leaks using temperature measurements alone. The temperature sensors feed alarms that trigger only when the HPCI room’s temperature exceeds certain high-temperature limits.
In 2018, an HPCI valve packing leak at an NPP resulted in a plant outage to repair the valve. It was postulated that the leaking valve may have been identified and corrected earlier with enhanced anomaly detection methods. The goal of this study was to use NPP data to develop methods for detecting leaks from the HPCI system into the HPCI pump room with inference methods that utilize existing temperature instrumentation for anomaly detection.
5.1. Initial Strategy Application: An Empirical Approach
The strategy for HPCI room temperature anomaly detection is shown in
Figure 3 with the strategy path shown in blue arrows. While the system has a sensor to measure temperature in the HPCI room, no sensors directly measure the presence or absence of one or more steam leaks because steam leaks could potentially occur in multiple locations. A large data set (many data points over a relatively long period of time) was available for the analysis. Over a dozen individual NPP data points were aggregated and downloaded from the plant monitoring computer, called the PI system; then physics knowledge shortlisted the variables for use in the anomaly detection method.
Figure 4 shows a simplified schematic of the HPCI room. The reactor is a large thermal bath that transfers heat to its surroundings, including the HPCI room, both through heat transfer and steam movement. The outside air temperature affects room temperature through seasonal and daily temperature changes and semi-random weather effects.
The data were reduced to three influences on room temperature: contributions as a result of reactor power, contributions from the outside atmosphere, and potential heating input from a steam leak in the room. Data from the power plant included the actual HPCI room temperature and reactor power as a function of time. The outside air temperature was acquired from the National Centers for Environmental Information using a weather station 65 miles from the power plant.
As with most data processing efforts, the data used in this use case had multiple cases of out-of-range or missing values. To begin processing the data, the first step was to remove outliers that were statistically far from the mean. This included values that were out of the range of what could be reasonably expected; such values could be attributed to the sensors being calibrated or turned off for short periods. The second step involved replacing outliers and any other missing points with an average of the nearest values. In comparison to the amount of data collected, the outliers and missing points were a small fraction of the total and did not impact the analysis. The last step in the preprocessing phase was to resample data so that multiple data sets could be combined and analyzed. Various sensors contained in the plant were sampled at one sample per minute whereas other sensors were sampled at one sample per hour. To account for this mismatch in sampling frequency, all sensors were downsampled to one sample per hour to avoid the use of a priori temperature estimates between samples.
Next in the strategy, pattern inference was applied. Because the physics knowledge exists, the cause-effect relationship was known (i.e., a steam leak would increase the room temperature). The next step was to develop the pattern inference model and generate results for evaluation in the “performance acceptable?” step.
5.1.1. Empirical Model
A neural network was used as an empirical model to generate predicted values for the HPCI room temperature. Two methods were compared to determine the best predictive model: a feedforward neural network and an autoregressive neural network. The methods are similar, but an autoregressive method uses the output of the previous time step as an input to the current time step, as depicted graphically in
Figure 5. In both approaches, the outside air temperature and reactor power level predicted the HPCI room temperature. Once HPCI room temperature values were predicted by the models, these values were compared to the actual recorded temperatures over a long period of time. Anomaly detection methods then identified significant differences between the predicted and measured values.
In both the feedforward and autoregressive neural networks, the input to the prediction model was simply the reactor power and outside air temperature. Both input variables showed relatively high-frequency noise; thus, a low-pass filter with a cutoff frequency of 96 h was applied, reducing noise for both inputs. Both the feedforward and autoregressive neural networks captured the general trends in the data reasonably well. However, the feedforward method resulted in predictions that did not match as accurately near transient evolutions (such as reactor power shutdown and startup) and contained more noise in the predictions. Thus, only the autoregressive method was used in the next step of the process: utilizing the predicted values with the K-means clustering method to identify anomalous data points.
A K-means clustering algorithm was used as the anomaly detection method due to the simplicity and low dimensionality of the data. Repeating the K-means process while employing different numbers of K clusters determined the optimal number of clusters is five based on a balance between a figure of merit representing the average distance from the cluster centroid and the percentage of data points assigned to anomalous clusters. The features used for the K-means cluster map were the value of the error (difference between the predicted and actual values) squared and the derivative of the error (the value of the change in error from one-time step to the next) squared.
5.1.2. Results
As seen in the cluster plot in
Figure 6, the anomalous clusters can be identified as medium error, large error, medium derivative of error, or large derivative of error, which correspond to Clusters 2, 5, 4, and 3, respectively. The percentages shown in the figure for each cluster represent the portion of data points falling within that cluster.
Figure 7 shows the results of the autoregressive method for anomaly detection. The top plot shows the comparison between the actual sensor reading in blue and the temperature prediction from the autoregressive method in orange. The bottom plot shows the labeled data points plotted at their respective times, with each data point colored according to its assigned cluster. The data from anomalous clusters (all but the blue colors) are primarily grouped in time as distinct events. Data from plant outage periods are removed from the bottom plot. Overall, the neural-network empirical anomaly detection method identified 19 distinct anomalous events. These results are compared with results from a hybrid anomaly detection method discussed next.
5.2. Revised Strategy Application: A Hybrid Approach
An alternative strategy for predicting HPCI room temperature was developed. While the typical path shown in
Figure 8 does not necessarily require a physics model to create training and testing data, a physics model can be used to validate pattern inference methods. Thus, when it is assumed that sufficient data for training and testing a model are not available, the decision pathway becomes as shown in
Figure 8 (with the strategy path shown in blue arrows). The remainder of the decision process matched the initial strategy application.
5.2.1. Hybrid Model
Based on the strategy shown in
Figure 8, a physics-based model was also used to predict the HPCI room temperature for anomaly detection. The physics-based model included a linear-regression technique using physics-based analytical equations enhanced with plant data.
Ideally, all thermal contributors to the HPCI room temperature sensor output would be modeled to determine exactly what the sensor should read at any given time based on information available from the surroundings. However, this is not possible due to a scarcity of data and limits on resources available for modeling the system. Thus, a simplified physics model was developed that only incorporated reactor power and outside air temperature as input variables. This simplified physical analysis is shown schematically in
Figure 4. An additional simplifying assumption was that these variables only made an impact in purely linear relationships. With these simplifying assumptions, the equation of state for temperature in the HPCI room reduces to:
where
THPCI is the HPCI room temperature;
TOAT is the outside air temperature (OAT);
TRX is the reactor average temperature, which is assumed to be linearly related to reactor power;
C is the thermal capacity of the HPCI room;
UA1 is the product of overall heat transfer coefficient
U and surface area
A from the outside to the HPCI room (note that even if the HPCI room is not physically located next to the outside atmosphere, the heat transfer equations can still be set up in this way to approximate the overall effect of OAT on the HPCI room through its overall influence on the power plant structure); and
UA2 is the product of the overall heat transfer coefficient U and surface area A from the reactor to the HPCI room.
With some manipulation of the equation and application of time filtering, the HPCI room regression equation becomes:
where
THPCI(
t) is the HPCI room temperature as a function of time;
k1 is a coefficient to convert OAT to the HPCI room temperature, with units of °F/°F;
(
t) is the OAT as a function of time, filtered (with a characteristic time constant of 96 h) to reduce high-frequency contributions to the signal;
k2 is a coefficient to convert reactor power to HPCI room temperature, with units of °F/%power;
is the reactor power as a function of time, again filtered in to reduce high-frequency contributions to the signal; and
T0 is the temperature offset. Regression analysis was performed on the equations for the HPCI room temperature as a function of time to determine the optimum values for
k1,
k2, and
T0.
This model does not necessarily capture all of the various ways that heat flows into and out of the HPCI room, and the linear first-order behavior of the effect of OAT and reactor power on the HPCI room is not necessarily completely accurate. A more complex and complete physics model could have been developed to attempt to describe all of the heat transfer aspects of the space; however, this would have required significant effort and additional data that were not available. Because a complete physics model was not developed, data were used to determine the necessary coefficients in the physics equation to complete the linear regression and enable predicted values to be generated.
5.2.2. Results
As seen in the cluster plot in
Figure 9, the anomalous clusters can be identified as medium error, large error, medium derivative of error, or large derivative of error, which correspond to Clusters 5, 3, 4, and 2, respectively. The percentages shown in the figure for each cluster represent the portion of data points falling within that cluster. Just as with the neural-network model output, most of the data is clustered near the origin and is considered to represent nominal operating conditions.
Figure 10 shows the results of this method for anomaly detection. The top plot shows the comparison between the actual sensor reading in blue and the temperature prediction from the physics-based (linear regression) method in orange. The bottom plot shows the labeled data points plotted at their respective times, with each data point colored according to its assigned cluster. The data from anomalous clusters (all but the blue points) are primarily grouped in time as distinct events. Data from plant outage periods are removed from the bottom plot. The hybrid physics-based (linear regression) anomaly detection method identified 18 distinct anomalous events. These results are compared with results from the empirical anomaly detection method below.
Overall, both the empirical and hybrid models were able to capture anomalous events that occurred in the NPP during the data collection period. Out of 19,775 discrete points in time, a total of 18 distinct event groupings were identified through both the physics model and the autoregressive neural-network analysis. The neural-network model identified one extra anomalous event. An anomaly count comparison between the physics-based linear-regression model (i.e., the hybrid model) and the autoregressive neural network (i.e., the empirical model) is shown in
Figure 11, with total data points shown on the left and percentages of total data points shown on the right. Both models identified 19,387 of the same data points as being “normal” behavior. The hybrid method identified 52 additional points as normal that the empirical model identified as anomalous. The empirical model identified an additional 15 data points as normal that the hybrid model identified as anomalous. Both models together identified 321 of the same anomalous data points. Thus, the models agreed with 99.7% of the data points.
Four of the 18 anomalies identified by both models were confirmed by NPP staff as actual anomalous events. Three of the events (in April 2017, 2018, and 2019) were yearly surveillance tests in which steam flowing through the HPCI room is temporarily halted, resulting in a loss of room heating and temperature reduction. The other event identified as anomalous by the models was the March 2018 HPCI valve leak event.
The remaining 14 anomalies identified by both models include multiple events that have similar dynamics. Thus, it is likely that these events are additional surveillance tests involving the HPCI system performed on a routine basis. The anomalous events of most concern are those for which the actual room temperature exceeds the temperatures predicted by the models because these could indicate a potential steam leak in the room, leading to an actual room temperature increase that is not predicted by the model. There were six such events identified in the data. Another possible explanation for these events is a loss of room cooling, possibly due to the routine securing of ventilation in the room.
Because both the empirical and hybrid models performed similarly in identifying anomalous conditions, there can be high confidence in the ability of either model to predict future anomalous conditions if the models were placed into use at the NPP. Given the simplicity of the methods, there is a low barrier to application because these methods do not require detailed system modeling. By incorporating more detailed information about the system and in some cases additional measured information, such as heating, ventilation, and air conditioning and HPCI pump and motor operational data, it is expected that the model can be refined to provide better anomaly detection or better identification of certain events as normal, depending on the circumstances.