Anomaly Detection of Power Plant Equipment Using Long Short-Term Memory Based Autoencoder Neural Network

Anomaly detection is of great significance in condition-based maintenance of power plant equipment. The conventional fixed threshold detection method is not able to perform early detection of equipment abnormalities. In this study, a general anomaly detection framework based on a long short-term memory-based autoencoder (LSTM-AE) network is proposed. A normal behavior model (NBM) is established to learn the normal behavior patterns of the operating variables of the equipment in space and time. Based on the similarity analysis between the NBM output distribution and the corresponding measurement distribution, the Mahalanobis distance (MD) is used to describe the overall residual (OR) of the model. The reasonable range is obtained using kernel density estimation (KDE) with a 99% confidence interval, and the OR is monitored to detect abnormalities in real-time. An induced draft fan is chosen as a case study. Results show that the established NBM has excellent accuracy and generalizability, with average root mean square errors of 0.026 and 0.035 for the training and test data, respectively, and average mean absolute percentage errors of 0.027%. Moreover, the abnormal operation case shows that the proposed framework can be effectively used for the early detection of abnormalities.


Introduction
In recent years, attention has been devoted to the improvement of condition monitoring systems (CMSs) for power plants [1,2]. One of the critical tasks is anomaly detection for equipment, which can be used to reduce the high cost of unplanned downtime and has great significance in condition-based maintenance. Anomaly detection refers to the detection of patterns that do not conform to established normal behavior models (NBMs) in a specified dataset. The patterns detected are called anomalies [3]. The CMSs store a large amount of operation data in power plants [4][5][6], and the use of this data represents an important field for future smart power plants (SPPs) [7,8]. Therefore, anomaly detection using data-driven approaches is a popular research topic for SPPs. Power plant equipment can be regarded as a dynamic system with n operating variables, and NBMs describe the normal behavior patterns among these variables. The data-driven modeling approach establishes NBMs based on the historical normal operation dataset, without knowledge of the precise mechanism of the system. The residuals between the NBM outputs and measured variables are monitored in real-time. When the residual exceeds the threshold, the system is considered to be abnormal [9].
Machine learning methods were used to create NBMs as early as 20 years ago [10][11][12], due to their ability to model nonlinear dynamic systems. However, early computing power was limited, and 1.
The proposal of an anomaly detection framework for multivariable time-series based on an LSTM-AE neural network.

2.
The similarity analysis of the NBM output distribution and the corresponding measurement distribution.
The remainder of this paper is organized as follows: In Section 2, the proposed anomaly detection framework based on the LSTM-AE neural network is presented in detail. In Section 3, the case study demonstrates the established NBM, comparative analysis, and the process of anomaly detection in a power plant. The conclusions and future work are provided in Section 4.

Framework Flow Chart
Diverse pieces of equipment are utilized in a power plant. In this paper, a general anomaly detection framework for power plants is proposed. The framework utilizes a large amount of operating Sensors 2020, 20, 6164 3 of 18 data relating to the equipment and adopts the LSTM-AE network to establish the NBM. The residual is calculated using the NBM outputs and corresponding measured variables. The residual threshold is obtained using residual statistical analysis of the test dataset. The proposed anomaly detection framework includes two phases, as shown in Figure 1.
Step 4: Statistical analysis of the residuals with the test dataset is performed. The MD is proposed to describe the OR of multiple variables, and the reasonable range of residuals is obtained using KDE with a 99% confidence interval.

Online Anomaly Detection by NBM
The purpose of this phase is to perform real-time anomaly detection on the target equipment based on the trained NBM. Specific steps are as follows: Step 1: Based on the trained NBM and measured variable values, the residuals are generated in real-time to reflect the status of the equipment.
Step 2: Based on the real-time OR of the model and the residual of each variable, the abnormal patterns are detected through the reasonable range. This is referred to as phase 1.

The Normal Behavior Model
The NBM can represent the dynamic relationships among variables in a system [9]. When the system is normal, the output value of the NBM is consistent with the measured value. When the system is abnormal, the output value is different from the measured value. In this study, the LSTM-AE neural network is proposed to establish NBM. Suppose that certain equipment has n operating variables, represents the measured values at a certain moment, and ̂ represents the

Offline Training NBM
The purpose of this phase is to establish the NBM of the target equipment and obtain a reasonable range of model residuals. Specific steps are as follows: Step 1: According to the target equipment, obtain the historical dataset of related operating variables.
Step 2: Performing data cleaning to obtain the normal behavior dataset based on three aspects: (1) Eliminate data at the time of equipment downtime.
(2) Eliminate data at the time of equipment failure according to operation logs.
(3) Eliminate abnormal data based on statistical characteristics. These abnormalities originate from sensors or stored procedures. Boxplots are used in this work. Then, the cleaned dataset is divided into a training dataset and a test dataset.
Step 3: The NBM is established based on an LSTM-AE with the training dataset.
Step 4: Statistical analysis of the residuals with the test dataset is performed. The MD is proposed to describe the OR of multiple variables, and the reasonable range of residuals is obtained using KDE with a 99% confidence interval.

Online Anomaly Detection by NBM
The purpose of this phase is to perform real-time anomaly detection on the target equipment based on the trained NBM. Specific steps are as follows: Step 1: Based on the trained NBM and measured variable values, the residuals are generated in real-time to reflect the status of the equipment.
Step 2: Based on the real-time OR of the model and the residual of each variable, the abnormal patterns are detected through the reasonable range. This is referred to as phase 1.

The Normal Behavior Model
The NBM can represent the dynamic relationships among variables in a system [9]. When the system is normal, the output value of the NBM is consistent with the measured value. When the system is abnormal, the output value is different from the measured value. In this study, the LSTM-AE neural network is proposed to establish NBM. Suppose that certain equipment has n operating variables, x n represents the measured values at a certain moment, andx n represents the reconstruction of x n . The residuals of x n andx n are used to judge whether the system is abnormal. The threshold α is determined by KDE of the reconstruction residuals with a 99% confidence interval. The NBM is trained with a normal operation dataset of the equipment. Algorithm 1 shows the anomaly detection using the NBM.

Algorithm 1 Anomaly detection using the NBM
INPUT: normal dataset X, the measured values at a certain moment x n , threshold α OUTPUT: reconstruction residual ||x n −x n || f(·) represents the NBM trained by X if reconstruction residual > α then x n is an anomaly Else x n is not an anomaly end

The AE Neural Network
The AE neural network is an unsupervised ANN with a hidden layer [36]. It has a three-layer symmetrical structure, as shown in Figure 2, including an input layer, a hidden layer (interval representation), and an output layer (reconstruction). The input layer to the hidden layer is the encoding process, and the hidden layer to the output layer is the decoding process. The goal of the AE is to reconstruct the original input as much as possible. The AE neural network is an unsupervised ANN with a hidden layer [36]. It has a three-layer symmetrical structure, as shown in Figure 2, including an input layer, a hidden layer (interval representation), and an output layer (reconstruction). The input layer to the hidden layer is the encoding process, and the hidden layer to the output layer is the decoding process. The goal of the AE is to reconstruct the original input as much as possible.
The decoding process: where 1 and 1 represent the weight and bias from the input layer to the hidden layer, respectively. 2 and 2 represent the weight and bias from the hidden layer to the output layer, respectively. , , and ̂ represent the original input, intermediate representation, and reconstruction of the original data, respectively. 1 () and 2 () are activation functions. Common activation functions are the sigmoid function, tanh function, and ReLu function [37].
The most important feature of the AE is that the decoded ̂ should be close to the original input to the greatest extent; that is, the residual between ̂ and needs to be minimized. Therefore, the reconstruction error can be calculated as follows: The encoding process: The decoding process: where W 1 and b 1 represent the weight and bias from the input layer to the hidden layer, respectively. W 2 and b 2 represent the weight and bias from the hidden layer to the output layer, respectively. X, H, andX represent the original input, intermediate representation, and reconstruction of the original data, respectively. f 1 () and f 2 () are activation functions. Common activation functions are the sigmoid function, tanh function, and ReLu function [37]. The most important feature of the AE is that the decodedX should be close to the original input X to the greatest extent; that is, the residual betweenX and X needs to be minimized. Therefore, the reconstruction error can be calculated as follows: where N and M represent the dimensions of the original data and the number of samples, respectively. The goal of model training is to find W(W 1 , W 2 ) and b(b 1 , b 2 ) that minimize the loss function.

The LSTM Unit
The RNN uses internal state (memory) to process the time-series input to capture the relationship of input variables in sequence. The backpropagation through time (BPTT) algorithm is typically employed to train the RNN [38][39][40]. However, the chain derivation rule of the BPTT training algorithm results in gradient vanishing or explosion problems for long-term dependency tasks. The LSTM unit combines short-term memory with long-term memory using subtle gate control and solves the problem of gradient disappearance to a certain extent [41], as illustrated in Figure 3. where and represent the dimensions of the original data and the number of samples, respectively. The goal of model training is to find ( 1 , 2 ) and ( 1 , 2 ) that minimize the loss function.

The LSTM Unit
The RNN uses internal state (memory) to process the time-series input to capture the relationship of input variables in sequence. The backpropagation through time (BPTT) algorithm is typically employed to train the RNN [38][39][40]. However, the chain derivation rule of the BPTT training algorithm results in gradient vanishing or explosion problems for long-term dependency tasks. The LSTM unit combines short-term memory with long-term memory using subtle gate control and solves the problem of gradient disappearance to a certain extent [41], as illustrated in Figure 3. The pre-built memory unit consists of three gate structures: an input gate, a forget gate, and an output gate. The input and output gates control input and output activation to the memory cell, respectively, whereas the forget gate updates the state of the cell. These gate structures constitute a self-connected constant error conveyor belt (CEC), which allows the constant error flow to pass through the internal state to solve the problem of gradient disappearance. The memory cells are updated and the output is implemented by the following equations [41]: where ℎ 1 and 1 are output and cell state at the previous moment, respectively; represents the current input; represents the forget gate; represents the forgetting control signal of the cell state at the previous moment;  1 represents the information retained at the previous moment; represents the input gate; ̆ represents the candidate cell state at the current moment; represents the control signal for ̆; represents the output gate; ℎ represents the final output; represents the output control signal; represents the input-hidden layer connection matrix; ℎ indicates the hidden layer recurrent connection matrix;  represents elementwise multiplication. The pre-built memory unit consists of three gate structures: an input gate, a forget gate, and an output gate. The input and output gates control input and output activation to the memory cell, respectively, whereas the forget gate updates the state of the cell. These gate structures constitute a self-connected constant error conveyor belt (CEC), which allows the constant error flow to pass through the internal state to solve the problem of gradient disappearance. The memory cells are updated and the output is implemented by the following equations [41]: where h t−1 and C t−1 are output and cell state at the previous moment, respectively; x t represents the current input; f represents the forget gate; f t represents the forgetting control signal of the cell state at the previous moment; f t ⊗ C t−1 represents the information retained at the previous moment; i represents the input gate;C t represents the candidate cell state at the current moment; i t represents the control signal forC t ; o represents the output gate; h t represents the final output; o t represents the Sensors 2020, 20, 6164 6 of 18 output control signal; W x represents the input-hidden layer connection matrix; W h indicates the hidden layer recurrent connection matrix; ⊗ represents elementwise multiplication.

The LSTM-AE Neural Network
Abnormal detection of power plant equipment requires simultaneous monitoring of multiple variables. An AE-based NBM can reconstruct multiple variables at the same time without manually selecting input variables. These variables in power plant equipment are not only spatially correlated, but also have a strong temporal correlation. The conventional AE neural network cannot mine the time correlation of input layer variables.
For this purpose, the LSTM-AE neural network is proposed with the hidden layer units replaced by LSTM units. The LSTM-AE neural network cannot only learn the correlation between input variables, but also learn the correlation in the time series. At the same time, the LSTM unit can avoid the problem of long-term memory dependence. The structure of the LSTM-AE is depicted in Figure 4.

The LSTM-AE Neural Network
Abnormal detection of power plant equipment requires simultaneous monitoring of multiple variables. An AE-based NBM can reconstruct multiple variables at the same time without manually selecting input variables. These variables in power plant equipment are not only spatially correlated, but also have a strong temporal correlation. The conventional AE neural network cannot mine the time correlation of input layer variables.
For this purpose, the LSTM-AE neural network is proposed with the hidden layer units replaced by LSTM units. The LSTM-AE neural network cannot only learn the correlation between input variables, but also learn the correlation in the time series. At the same time, the LSTM unit can avoid the problem of long-term memory dependence. The structure of the LSTM-AE is depicted in Figure  4.

Encoding Decoding
Internal representation Figure 4. Architecture of the LSTM-AE.
In Figure 4, is the origin input; ̂ is the output, reconstruction of ; is the encoding result; represents the look-back time steps in the LSTM unit; ℎ and are the output and cell state, respectively, of the LSTM unit in the encoding process; ℎ and are the output and cell state, respectively, of the LSTM unit in the decoding process.

Data Preparation
In this study, an induced draft fan is considered as an example, with a historical dataset of the related temperature variables obtained from the CMS. The obtained dataset is from 1 May 2019 to 1 May 2020, with a 1 min interval. Eight temperature variables are contained in the dataset, as follows: Lubricating oil temperature: Abbreviated as LOT. Lubricating oil is used to cool the bearings in the induced draft fan.
Outlet temperature of electrostatic precipitator A1: Abbreviated as POT_A1, and indicates the temperature of the flue gas from the electrostatic precipitator A1 into the induced draft fan.
Outlet temperature of electrostatic precipitator A2: Abbreviated as POT_A2, and indicates the temperature of the flue gas from the electrostatic precipitator A2 into the induced draft fan.
Bearing temperature at the non-drive end of the motor: Abbreviated as MNDT, and indicates the health of the bearing.
Bearing temperature at the drive end of the motor: Abbreviated as MDT, and indicates the health of the bearing. Main bearing temperature 1: Abbreviated as MBT_1, and reflects the health status of the bearing. Main bearing temperature 2: Abbreviated as MBT_2, and reflects the health status of the bearing. Main bearing temperature 3: Abbreviated as MBT_3, and reflects the health status of the bearing.
The correlation coefficient matrix indicates the correlation between the variables; as shown in Figure 5, there is a clear correlation between the variables. In Figure 4, X is the origin input;X is the output, reconstruction of X; I is the encoding result; k represents the look-back time steps in the LSTM unit; h e and C e are the output and cell state, respectively, of the LSTM unit in the encoding process; h d and C d are the output and cell state, respectively, of the LSTM unit in the decoding process.

Data Preparation
In this study, an induced draft fan is considered as an example, with a historical dataset of the related temperature variables obtained from the CMS. The obtained dataset is from 1 May 2019 to 1 May 2020, with a 1 min interval. Eight temperature variables are contained in the dataset, as follows: Lubricating oil temperature: Abbreviated as LOT. Lubricating oil is used to cool the bearings in the induced draft fan.
Outlet temperature of electrostatic precipitator A1: Abbreviated as POT_A1, and indicates the temperature of the flue gas from the electrostatic precipitator A1 into the induced draft fan.
Outlet temperature of electrostatic precipitator A2: Abbreviated as POT_A2, and indicates the temperature of the flue gas from the electrostatic precipitator A2 into the induced draft fan.
Bearing temperature at the non-drive end of the motor: Abbreviated as MNDT, and indicates the health of the bearing.
Bearing temperature at the drive end of the motor: Abbreviated as MDT, and indicates the health of the bearing.
Main bearing temperature 1: Abbreviated as MBT_1, and reflects the health status of the bearing. Main bearing temperature 2: Abbreviated as MBT_2, and reflects the health status of the bearing. Main bearing temperature 3: Abbreviated as MBT_3, and reflects the health status of the bearing. The correlation coefficient matrix indicates the correlation between the variables; as shown in Figure 5, there is a clear correlation between the variables.

Data Cleaning
In principle, the normal behavior dataset is required to establish an NBM. It is necessary to clean the acquired historical data set. Firstly, the dataset is cleaned according to the equipment operating signal. Furthermore, the data during the failure period is eliminated according to the operation log. In this study, the boxplot method is used to eliminate abnormal data caused by sensors and the stored procedure from the perspective of statistics. A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. When the data exceeds the minimum and maximum range, it is considered an outlier [42]. The principle of outlier detection using a boxplot is shown in Figure 6. The data cleaning results in this study are shown in Figure 7, and the points in red circles are considered outliers that will be eliminated.

Data Cleaning
In principle, the normal behavior dataset is required to establish an NBM. It is necessary to clean the acquired historical data set. Firstly, the dataset is cleaned according to the equipment operating signal. Furthermore, the data during the failure period is eliminated according to the operation log. In this study, the boxplot method is used to eliminate abnormal data caused by sensors and the stored procedure from the perspective of statistics. A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. When the data exceeds the minimum and maximum range, it is considered an outlier [42]. The principle of outlier detection using a boxplot is shown in Figure 6. The data cleaning results in this study are shown in Figure 7, and the points in red circles are considered outliers that will be eliminated.

Data Cleaning
In principle, the normal behavior dataset is required to establish an NBM. It is necessary to clean the acquired historical data set. Firstly, the dataset is cleaned according to the equipment operating signal. Furthermore, the data during the failure period is eliminated according to the operation log. In this study, the boxplot method is used to eliminate abnormal data caused by sensors and the stored procedure from the perspective of statistics. A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. When the data exceeds the minimum and maximum range, it is considered an outlier [42]. The principle of outlier detection using a boxplot is shown in Figure 6. The data cleaning results in this study are shown in Figure 7, and the points in red circles are considered outliers that will be eliminated.
where y i is measured value,ŷ i is reconstructed value, and n is the number of variables. The look-back time step and hidden layer unit are important hyperparameters in the LSTM-AE and determine the structure of the NBM. Many different combinations were tested in this work, e.g., "4-6" means that the look-back time step is 4 and the number of hidden layer unit is 6. When the look-back time step is 0, the structure is a conventional AE. The performances of different combinations as measured by the two evaluation indices are shown in Figure 8. The comparison between the LSTM-AE and AE is shown in Table 1, and indicates that LSTM-AE is significantly better.

NBM Based on LSTM-AE
In this study, the NBM of an induced draft fan was established using an LSTM-AE neural network with eight temperature variables as model inputs. After data cleaning, 266,538 observations were obtained in the sample. Fifty percent of the sample comprised the training dataset, and the remaining 50% comprised the test dataset. Min-max normalization was adopted, and the range of variables was scaled to [0,1]. For the evaluation of the developed NBM model of the induced draft fan, two indices-root mean square error (RMSE) and mean absolute percentage error (MAPE)were used. The definitions of the two indices are given in the following equations: where is measured value, ̂ is reconstructed value, and is the number of variables. The look-back time step and hidden layer unit are important hyperparameters in the LSTM-AE and determine the structure of the NBM. Many different combinations were tested in this work, e.g., "4-6" means that the look-back time step is 4 and the number of hidden layer unit is 6. When the look-back time step is 0, the structure is a conventional AE. The performances of different combinations as measured by the two evaluation indices are shown in Figure 8. The comparison between the LSTM-AE and AE is shown in Table 1, and indicates that LSTM-AE is significantly better.
It can be clearly seen from Figure 8 that when the number of hidden layer units is small, the It can be clearly seen from Figure 8 that when the number of hidden layer units is small, the model performance is more sensitive to the number of hidden layer units than the look-back time step; however, a small look-back time step can also effectively improve the model performance. Therefore, the time relationship between variables is meaningful for improving model performance. However, when the number of hidden layer elements is very small, an overly large look-back time step will lead to the degradation of model performance. When the look-back time step exceeds 0 and the number of hidden layer units is greater than six, the RMSE and MAPE of the training and test datasets are highly similar and very low, which indicates that the accuracy and generalization of the model are very high. However, increasing the look-back time steps and the number of hidden layer units leads to an increase in the required model training resources. In this study, while ensuring the accuracy of the model, a small-scale model structure should be selected as far as possible; thus, the combination "8-12" was selected as the final model structure.

Comparative Analysis
In previous research, the principle components analysis-nonlinear autoregressive neural network with exogenous input (PCA-NARX) method was proposed to establish the NBM [25]. PCA was used to Sensors 2020, 20, 6164 9 of 18 reduce redundancy and retain useful information of dataset. The NARX is a nonlinear autoregressive model which has exogenous inputs and can be stated algebraically as: y t = F y t−1 , y t−2 , . . . , y t−n y , u t−1 , u t−2 , . . . , u t−n u + ε t (12) where y is the variable of interest; u is the exogenous variable; n y is the TDL (or time lags) of y, which indicates that previous values help to predict y; n u is the TDL of u; ε is the error term; F() is the neural network. In this model, information about u helps predict y, as do previous values of y itself.

Comparative Analysis
In previous research, the principle components analysis-nonlinear autoregressive neural network with exogenous input (PCA-NARX) method was proposed to establish the NBM [25]. PCA was used to reduce redundancy and retain useful information of dataset. The NARX is a nonlinear autoregressive model which has exogenous inputs and can be stated algebraically as: where is the variable of interest; is the exogenous variable; is the TDL (or time lags) of , which indicates that previous values help to predict ; is the TDL of ; ε is the error term; () is the neural network. In this model, information about helps predict , as do previous values of itself.
In the PCA-NARX method, the exogenous variable is the PCA result of the object variable. In this study, an NBM based on PCA-NARX was established. The TDL of the object and exogenous variables were hyperparameters, and a numerous combinations were trained. The performance of PCA-NARX for different combinations is shown in Figure 9. The comparison of the best performance between the LSTM-AE and PCA-NARX is shown in Table 2. The PCA-NARX method considers the  In the PCA-NARX method, the exogenous variable is the PCA result of the object variable. In this study, an NBM based on PCA-NARX was established. The TDL of the object and exogenous variables were hyperparameters, and a numerous combinations were trained. The performance of PCA-NARX for different combinations is shown in Figure 9. The comparison of the best performance between the LSTM-AE and PCA-NARX is shown in Table 2. The PCA-NARX method considers the time sequence correlation between variables, so is better than the conventional AE method. However, LSTM-AE is better than PCA-NARX. From the perspective of model principles, PCA-NARX is a predictive model, and LSTM-AE is a reconstruction model. time sequence correlation between variables, so is better than the conventional AE method. However, LSTM-AE is better than PCA-NARX. From the perspective of model principles, PCA-NARX is a predictive model, and LSTM-AE is a reconstruction model.

Statistical Analysis on the Residuals
In this study, a total of 133,405 observations were obtained from the corresponding reconstruction by the trained NBM. The MD was proposed to describe the overall residual (OR) of NBM. The MD provides the univariate distance between multi-dimensional samples that obey the same distribution with consideration of the correlation between variables and the dimensional scale, and has been successfully applied to capture different types of outliers [43]. The MD can be calculated as follows:

Statistical Analysis on the Residuals
In this study, a total of 133,405 observations were obtained from the corresponding reconstruction by the trained NBM. The MD was proposed to describe the overall residual (OR) of NBM. The MD provides the univariate distance between multi-dimensional samples that obey the same distribution with consideration of the correlation between variables and the dimensional scale, and has been successfully applied to capture different types of outliers [43]. The MD can be calculated as follows: where X i and X j are different samples in the same distribution, and C is the covariance matrix between the variables of the distribution. The distribution similarity between the measured dataset and the reconstructed dataset was used in this study. Kullback-Leibler (KL) divergence [44] was used to describe the distribution difference of the measurement dataset and the reconstructed dataset. KL divergence is an asymmetric and non-negative measure of the difference between two probability distributions. A smaller KL divergence demonstrates a greater similarity of two distributions. KL divergence can be calculated as depicted: (14) where P and Q are two distributions, p(x i ) and q(x i ) represent the probability of the two distributions for sample x i , respectively; N is the number of samples.
Because the dataset has eight dimensions and each dimension takes 100 points uniformly, a total of 10 16 sample points exists, which causes calculation difficulties. In this work, 2000 observations were randomly selected each time, and 1000 experiments were performed. KL divergence of the measurement dataset and the reconstructed dataset with 1000 experiments is shown in Figure 10. The calculated KL divergences are very small and consistent. The red line indicates the average KL divergence change trend as the number of experiments increases. The final KL was calculated to be 0.0049 using the law of large numbers. The experimental results show that the distributions of the measurement dataset and reconstructed dataset are approximately equal; therefore, it can be considered that the reconstructed dataset and the measurement dataset belong to the same distribution. The MD was applied to describe the OR of model, and the calculation method is as follows: where Y i andŶ i represent the measurement value and reconstructed value of the operating variables at time i, respectively; C is the covariance matrix between the operating variables.
where and are different samples in the same distribution, and is the covariance matrix between the variables of the distribution.
The distribution similarity between the measured dataset and the reconstructed dataset was used in this study. Kullback-Leibler (KL) divergence [44] was used to describe the distribution difference of the measurement dataset and the reconstructed dataset. KL divergence is an asymmetric and non-negative measure of the difference between two probability distributions. A smaller KL divergence demonstrates a greater similarity of two distributions. KL divergence can be calculated as depicted: where and Q are two distributions, ( ) and ( ) represent the probability of the two distributions for sample , respectively; is the number of samples. Because the dataset has eight dimensions and each dimension takes 100 points uniformly, a total of 10 16 sample points exists, which causes calculation difficulties. In this work, 2000 observations were randomly selected each time, and 1000 experiments were performed. KL divergence of the measurement dataset and the reconstructed dataset with 1000 experiments is shown in Figure 10. The calculated KL divergences are very small and consistent. The red line indicates the average KL divergence change trend as the number of experiments increases. The final KL was calculated to be 0.0049 using the law of large numbers. The experimental results show that the distributions of the measurement dataset and reconstructed dataset are approximately equal; therefore, it can be considered that the reconstructed dataset and the measurement dataset belong to the same distribution. The MD was applied to describe the OR of model, and the calculation method is as follows: where and ̂ represent the measurement value and reconstructed value of the operating variables at time , respectively; is the covariance matrix between the operating variables. The residual of each variable and overall model were subjected to statistical analysis. The KDE method was used to simulate the probability density function (PDF) of the residual distributions, to obtain a reasonable range of residuals. The KDE uses a smooth peak function to fit the observed data The residual of each variable and overall model were subjected to statistical analysis. The KDE method was used to simulate the probability density function (PDF) of the residual distributions, to obtain a reasonable range of residuals. The KDE uses a smooth peak function to fit the observed data points to simulate the true probability distribution curve. Let x 1 , x 2 , . . . , x n be an independent and identically distributed data sample, then, the kernel density of its PDF is estimated as: where h is a smoothing parameter called the bandwidth. The larger the bandwidth, the smaller the proportion of the observed data points in the final curve shape, and the flatter the overall curve; the smaller the bandwidth, the greater the proportion of the observed data points in the final curve shape. The overall curve is steeper. K(·) is the kernel function; commonly used kernel functions are triangle, rectangle, Epanechnikov, Gaussian, etc. In this embodiment, the kernel function was a Gaussian function, and an adaptive bandwidth method [45] was adopted. The residual distributions of the overall NBM and each variable are shown in Figures 11 and 12, respectively. F(·) represents the cumulative probability function (CDF), the integral of the probability density function. The lower and upper threshold of the reasonable residual range are recorded as R_lower and R_upper, respectively. In this work, the reasonable residual range was obtained based on the 99% confidence interval. This means F(R_lower) = 0.01, F(R_upper) = 0.99; the reasonable residual range of the overall NBM and each variable are shown in Table 3. The overall residual of the model only takes the upper limit.
identically distributed data sample, then, the kernel density of its PDF is estimated as: where ℎ is a smoothing parameter called the bandwidth. The larger the bandwidth, the smaller the proportion of the observed data points in the final curve shape, and the flatter the overall curve; the smaller the bandwidth, the greater the proportion of the observed data points in the final curve shape. The overall curve is steeper. (•) is the kernel function; commonly used kernel functions are triangle, rectangle, Epanechnikov, Gaussian, etc. In this embodiment, the kernel function was a Gaussian function, and an adaptive bandwidth method [45] was adopted. The residual distributions of the overall NBM and each variable are shown in Figures 11 and 12, respectively. (•) represents the cumulative probability function (CDF), the integral of the probability density function. The lower and upper threshold of the reasonable residual range are recorded as _ and _ , respectively. In this work, the reasonable residual range was obtained based on the 99% confidence interval. This means ( _ ) = 0.01 , ( _ ) = 0.99; the reasonable residual range of the overall NBM and each variable are shown in Table 3. The overall residual of the model only takes the upper limit.

Abnormality Detection
Based on the trained NBM of the induced draft fan as described in the previous section, this work demonstrates the proposed abnormality detection methods using two cases: normal operation and abnormal operation.

Normal Operation Case
A two-day operation dataset was collected from May 1 to May 2, 2020 with an interval of 1 min to monitor abnormalities in real-time. The real-time OR and each variable monitoring process are shown in Figures 13 and 14, respectively. The measured value obtained from the CMS and the reconstructed value given by the NBM are monitored in real-time. The reasonable residual ranges of each variable are shown in Table 2. The time lag and noise in the actual operation cause a small amount of discontinuity outside the allowable residual range in the real-time monitoring process. Therefore, in the actual monitoring process, an average sliding window, with size = 30, is used to process the monitored residuals. It can be seen from Figure 13 that the OR of the model is within the reasonable range during this period, indicating that the induced draft fan operates normally and is consistent with this case. Figure 14 shows that the residual of each variable does not exceed the reasonable range, and the reconstructed values are very close to the measured values. This indicates that the trained NBM has a good generalization and accuracy.

Abnormal Operation Case
In this study, the abnormal dataset was manually constructed according to the correlation of the variables shown in Figure 5 for an early warning demonstration, because there was no real failure case of the induced draft fan. The correlation coefficients among LOT, MNDT, MDT, MBT_1, MBT_2, and MBT_3 are relatively high, and the failure data between them has high uncertainty. The abnormal dataset generation is performed using POT_A1, and the other variables remain unchanged. Based on the normal operation case, starting at 12:00 on May 1, a linear cumulative drift was added to POT_A1 with a coefficient of 0.02 • C, as shown in Figure 15. The real-time monitoring process of the OR is shown in Figure 16, which shows the OR exceeds the upper threshold at 19:50 when the cumulative drift of POT_A1 was 9.4 • C. The real-time monitoring of POT_A1 is shown in Figure 17, which shows that the residual of POT_A1 exceeds the upper threshold at 16:15 when the cumulative drift of POT_A1 is 5.1 • C. According to the conventional fixed threshold, the alarm will not be raised until POT_A1 reaches 180 • C. In this case, the OR exceeds the threshold when POT_A1 is 132.23 • C, and the residual of POT_A1 exceeds the threshold when POT_A1 is 127.19 • C. This shows that the anomaly detection method proposed in this study can effectively detect early anomalies of equipment. Using the MD to characterize the residual of the overall model can reduce the false alarm rate compared to using a single variable threshold. Therefore, in the monitoring process, the OR is used to detect the abnormal state of the equipment, and the single variable threshold is used to analyze the cause of the abnormality.    Table 2. The time lag and noise in the actual operation cause a small amount of discontinuity outside the allowable residual range in the real-time monitoring process. Therefore, in the actual monitoring process, an average sliding window, with size = 30, is used to process the monitored residuals. It can be seen from Figure 13 that the OR of the model is within the reasonable range during this period, indicating that the induced draft fan operates normally and is consistent with this case. Figure 14 shows that the residual of each variable does not exceed the reasonable range, and the reconstructed values are very close to the measured values. This indicates that the trained NBM has a good generalization and accuracy. A two-day operation dataset was collected from May 1 to May 2, 2020 with an interval of 1 min to monitor abnormalities in real-time. The real-time OR and each variable monitoring process are shown in Figures 13 and 14, respectively. The measured value obtained from the CMS and the reconstructed value given by the NBM are monitored in real-time. The reasonable residual ranges of each variable are shown in Table 2. The time lag and noise in the actual operation cause a small amount of discontinuity outside the allowable residual range in the real-time monitoring process. Therefore, in the actual monitoring process, an average sliding window, with size = 30, is used to process the monitored residuals. It can be seen from Figure 13 that the OR of the model is within the reasonable range during this period, indicating that the induced draft fan operates normally and is consistent with this case. Figure 14 shows that the residual of each variable does not exceed the reasonable range, and the reconstructed values are very close to the measured values. This indicates that the trained NBM has a good generalization and accuracy.

Abnormal Operation Case
In this study, the abnormal dataset was manually constructed according to the correlation of the variables shown in Figure 5 for an early warning demonstration, because there was no real failure case of the induced draft fan. The correlation coefficients among LOT, MNDT, MDT, MBT_1, MBT_2, and MBT_3 are relatively high, and the failure data between them has high uncertainty. The abnormal dataset generation is performed using POT_A1, and the other variables remain unchanged. Based on the normal operation case, starting at 12:00 on May 1, a linear cumulative drift was added to POT_A1 with a coefficient of 0.02 °C, as shown in Figure 15. The real-time monitoring process of the OR is shown in Figure 16, which shows the OR exceeds the upper threshold at 19:50 when the cumulative drift of POT_A1 was 9.4 °C. The real-time monitoring of POT_A1 is shown in Figure 17, which shows that the residual of POT_A1 exceeds the upper threshold at 16:15 when the cumulative drift of POT_A1 is 5.1 °C. According to the conventional fixed threshold, the alarm will not be raised until POT_A1 reaches 180 °C . In this case, the OR exceeds the threshold when POT_A1 is 132.23 °C , and the residual of POT_A1 exceeds the threshold when POT_A1 is 127.19 °C . This shows that the anomaly detection method proposed in this study can effectively detect early anomalies of equipment. Using the MD to characterize the residual of the overall model can reduce the false alarm rate compared to using a single variable threshold. Therefore, in the monitoring process, the OR is used to detect the abnormal state of the equipment, and the single variable threshold is used to analyze the cause of the abnormality.

Abnormal Operation Case
In this study, the abnormal dataset was manually constructed according to the correlation of the variables shown in Figure 5 for an early warning demonstration, because there was no real failure case of the induced draft fan. The correlation coefficients among LOT, MNDT, MDT, MBT_1, MBT_2, and MBT_3 are relatively high, and the failure data between them has high uncertainty. The abnormal dataset generation is performed using POT_A1, and the other variables remain unchanged. Based on the normal operation case, starting at 12:00 on May 1, a linear cumulative drift was added to POT_A1 with a coefficient of 0.02 °C, as shown in Figure 15. The real-time monitoring process of the OR is shown in Figure 16, which shows the OR exceeds the upper threshold at 19:50 when the cumulative drift of POT_A1 was 9.4 °C. The real-time monitoring of POT_A1 is shown in Figure 17, which shows that the residual of POT_A1 exceeds the upper threshold at 16:15 when the cumulative drift of POT_A1 is 5.1 °C. According to the conventional fixed threshold, the alarm will not be raised until POT_A1 reaches 180 °C . In this case, the OR exceeds the threshold when POT_A1 is 132.23 °C , and the residual of POT_A1 exceeds the threshold when POT_A1 is 127.19 °C . This shows that the anomaly detection method proposed in this study can effectively detect early anomalies of equipment. Using the MD to characterize the residual of the overall model can reduce the false alarm rate compared to using a single variable threshold. Therefore, in the monitoring process, the OR is used to detect the abnormal state of the equipment, and the single variable threshold is used to analyze the cause of the abnormality.

Conclusions and Future Work
A general anomaly detection framework based on LSTM-AE neural networks is proposed for early detection of abnormalities in power plants. The LSTM-AE neural network-based NBM considers the spatial and temporal correlation of the related operating variables and realizes the

Conclusions and Future Work
A general anomaly detection framework based on LSTM-AE neural networks is proposed for early detection of abnormalities in power plants. The LSTM-AE neural network-based NBM considers the spatial and temporal correlation of the related operating variables and realizes the simultaneous monitoring of multiple variables. In the case study, hyperparameters of the LSTM-AE

Conclusions and Future Work
A general anomaly detection framework based on LSTM-AE neural networks is proposed for early detection of abnormalities in power plants. The LSTM-AE neural network-based NBM considers the spatial and temporal correlation of the related operating variables and realizes the simultaneous monitoring of multiple variables. In the case study, hyperparameters of the LSTM-AE model, including the hidden layer unit and look-back time step, were meticulously optimized to obtain a more accurate reconstruction and better generalization. The comparative analysis between the LSTM-AE and PCA-NARX models illustrates that the LSTM-AE model has better performance. The average RMSEs of the training and test datasets are 0.026 and 0.035 • C, respectively. The corresponding average MAPEs are within 0.027%. Based on the similarity analysis between the NBM output value distribution and the corresponding measured value distribution, the MD is used to describe the overall residual of the model. The reasonable residual range of the overall model and each variable were obtained using KDE with a 99% confidence interval. The abnormal operation case shows that the proposed framework detected the abnormality when the variable value was 132.23 • C, however, the conventional fixed threshold is 180 • C. The residual of the overall model is used to detect the abnormal state, and the residual of the single variable is used to analyze the cause of the abnormality. In summary, the proposed framework could be used for the early detection of abnormalities in real-time, which is of great significance to condition-based maintenance in power plants. In future work, an abnormal dataset will be needed to further verify this framework, especially for cases in which multiple variables are simultaneously abnormal. Moreover, the accurate extraction of the normal behavior operation dataset also requires further research.