A Decision-Making Method for Machinery Abnormalities Based on Neural Network Prediction and Bayesian Hypothesis Testing

: For anomaly identiﬁcation of predicted data in machinery condition monitoring, traditional threshold methods have problems during residual testing. It is difﬁcult to make decisions when the residuals are close to the threshold and ﬂuctuate. This paper proposes a Bayesian dynamic thresholding method that combines Bayesian inference with neural network signal prediction. The method makes full use of historical prior data to build an anomaly identiﬁcation and warning model applicable under single variable or multidimensional variables. A long short-term memory signal prediction model is established, and then a Bayesian hypothesis testing-based anomaly identiﬁcation strategy is presented to quantify the probability of anomaly occurrence and issue early warnings for anomalies beyond a certain probability. The model was applied to open data sets of a pumping station and actual operating data of a nuclear power turbine. The results indicate that the model successfully predicts the failure probability and failure time. The effectiveness of the proposed method is veriﬁed.


Introduction
With the development of mechanical equipment fault diagnosis technology, using signals to analyze faults is a common technology. During the production process, various data on the status of equipment can be collected by sensors and monitored in real time [1][2][3][4]. In recent years, with the rapid development of machine learning, especially deep learning, various models based on data-driven algorithms capable of predicting future data using historical data have been developed and are widely used in the field of signal prediction. Saufi et al. [5] introduced a deep learning architecture for automatic feature learning to solve the problems of traditional fault detection and diagnosis (FDD) systems and discussed the challenges of deep learning and future work related to mechanical fault diagnosis systems. Among the various deep learning time-series prediction methods, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks are the most widely used. The input of the RNN at each time includes both the input data of the current time and the hidden layer unit data of the previous time. Therefore, these hidden layer units play the role of a memory module, preserving the information of historical data and updating with the input of new data. Zhao et al. [6] reviewed the emerging research work on deep learning in machine health monitoring systems in terms of autoencoders and their variants, constrained Boltzmann machines and their variants, including deep belief networks (DBNs) and deep Boltzmann machines (DBMs), convolutional neural networks (CNNs), and RNNs. Liu et al. [7] presented the low-delay lightweight RNN (LLRNN) model for mechanical fault diagnosis, which occupies less memory space than previous methods and has low computational latency. The LSTM network is an improvement on the RNN, which mainly solves the problem of gradient disappearance or explosion that exists in RNNs [8] and makes effective use of long-range sequence information. Tian et al. [9] proposed a model that combines self-correlated local feature scale decomposition and an improved LSTM neural network as a hybrid prediction modeling strategy to build a reciprocating compressor vibration signal prediction model. Ma et al. [10] compared LSTM with other prediction algorithms via the mean absolute percentage error (MAPE) and mean squared error (MSE) as evaluation metrics and showed that the LSTM model has better accuracy and stability than the other algorithms. Kim and Won [11] combined LSTM models with various generalized autoregressive conditional heteroskedasticity (GARCH)type models to significantly improve the forecasting performance for time-series data. Di Persio and Honchar [12] propose a novel approach to energy load time series forecasting, combining deep learning networks with the prior properties of Bayes to predict the data of interest. He et al. [13] used deep confidence network to carry out unsupervised fault diagnosis for gear transmission chains. In this study, LSTM was employed to establish a prediction model.
In anomaly identification, the construction of the prediction model is only one part of the process, and this study focuses on two other aspects: (1) the fault identification decision strategy, which plays a decisive role, i.e., the identification of the residual between the hypothetical health value of the prediction model and the actual monitored value; (2) the data processing method in the multivariate case. The general anomaly decision method is mainly the threshold method and its improved algorithm, i.e., the residuals are recognized as anomalous when they exceed a threshold value. Zeng et al. [14] calculated the performance index of a generic delayer to adjust the threshold appropriately to reduce unexpected alarms. Zadakbar et al. [15] developed a quantitative operational risk assessment model to control the activities of an early warning system by setting a threshold value for acceptable operational risk. Yu et al. [16] designed optimal alarm thresholds for univariate simulated process variables based on alarm probability maps. Gao et al. [17] proposed a new multivariate alarm threshold optimization method based on the consistency of the correlation between process and alarm data and applied a particle swarm optimization (PSO) algorithm to obtain optimal thresholds. Zhao et al. [18] proposed an adaptive threshold determined by extreme value theory to prevent false alarms caused by the extreme value distribution of the calculated detection index in the health monitoring of wind turbines. Aslansefat et al. [19] evaluated the performance metrics of an alarm system with variable alarm thresholds and provided a genetic algorithm-based optimization design process for optimizing the parameter settings to improve the performance metrics. Zhang et al. [20] optimized the objective function by the PSO algorithm to obtain the optimal alarm threshold, which significantly reduced the false alarm rate. Each of these methods for determining the threshold value has its own merits, but in general, they all still have a certain rate of missed diagnoses and false diagnoses, especially when the residual is close to the threshold value. There is no reference standard for setting the threshold value in practice, and finding the optimal threshold incurs a high computational cost and yields an insufficient accuracy of the results.
In recent years, threshold analysis methods combined with Bayesian theory have become popular [21][22][23][24][25]. Cho et al. [26] used Bayesian networks to evaluate residuals and applied the method to fault detection of three-phase asynchronous motors. Nezhad and Niaki [27] developed a heuristic threshold strategy to detect and classify the state of a multivariate quality control system. The posterior confidence values of runaway features were updated using Bayesian rules and compared with decision thresholds to determine the minimum acceptable confidence value. Trachi [28] developed a fault testing strategy for electric motors that was based on likelihood ratio hypothesis testing, but the Bayesian approach was used only for parameter estimation. Jiang et al. [29] used a Bayesian hypothesis testing approach to develop anomaly identification criteria that are determined by testing whether the mean of the residuals is zero. However, if the prediction model is not accurate enough, those minor anomalies will be covered by the model's own residuals and will be difficult to identify.
As known from above, although the existing thresholding method combined with Bayesian analysis is an improvement on the traditional judgment method, it is not adequate in the study and use of a priori information and has difficulty making judgments in some cases, such as with multivariate data or data with slight anomalies. In this paper, further research will be carried out in the following three aspects: (1) processing anomaly identification for time-series data with multiple variables, (2) making full use of the a priori information for decision-making, which is more in line with the actual application situation, and (3) calculating the confidence level for the identified anomalies and providing early warning based on the quantified probability. The fault anomaly decision process is shown in Figure 1.

Data Preprocessing in the Multivariate Case
Before constructing the LSTM prediction model, the time-series signal needs to be processed, generally including noise reduction [30], normalization, and phase space reconstruction. The details of wavelet packet denoising and normalization can be found in our previous study [31]. For the multivariate time-series data, the new composite variables are output by dimensionality reduction in this study.

Dimensionality Reduction
During the operation of mechanical equipment, the number of variables can reach several hundred, and it is necessary to extract the main information by dimensionality reduction to simplify the subsequent calculation process and retain most of the essential information. Probabilistic principal component analysis (PPCA) [32] is employed to reduce the dimensionality and uncertainty of the a priori data, which is based on the Gaussian latent variable model by introducing a probabilistic framework into principal component analysis to attenuate the influence of noisy variables on the structural characteristics of the data. This approach ensures the full excavation and utilization of limited information and avoids the "dimensional disaster" problem. The expression of the PPCA model is as follows: where X is a p-dimensional vector, and the data points in each dimension are set to n, i.e., x i = {x i1 , x i2 , · · · x in }. W is a weight vector of dimension p×r (r ≤ p), z represents a latent variable of dimension r ( z ∼ N(0, I r )), µ is the sample mean vector, and ψ is a noise vector ( ψ ∼ N(0, σ 2 I p )), where I is the identity matrix. The posterior distribution W and σ 2 in the model are estimated from the maximum likelihood method as follows: where λ j is obtained by decomposing the covariance matrix of sample X according to the eigenvalue, i.e., The conditional expectation of potential variables is obtained as follows: The number of principal components of the data can be determined by the cumulative variance contribution rate, which is generally specified to be not less than 80%.

Phase Space Reconstruction
The predicted data of the signal are likely to be chaotic time-series data, and phase space reconstruction of the data is needed to make the output applicable to the neural network model. The C-C algorithm [33] is used to estimate the delay time τ and the embedding dimension m for phase space reconstruction. For a single-variable time-series X = {x 1 , x 2 , x 3 , · · · , x n }, the reconstructed matrix is as follows: where N m = n − (m − 1)τ is the number of vectors after reconstructing the phase space. Each vector represents a reconstructed input data [x i , x i+τ , · · · , x i+(m−1)τ ]. That is to say, there are m data input nodes.

LSTM Prediction Model
By using the dataset after data preprocessing as the input of the LSTM model, the predicted data can be output after determining the number of layers and related parameters. The larger τ is, the longer the time span for predicting future data. The predicted data Y are shown in Equation (5): The detailed structure of the LSTM network is described in our previous work [31], and in this study, single-step prediction is used, i.e., each set of input data contains m values, a total of N m groups of values are input, and the data input mode of group i is shown in Figure 2, where h 0 and c 0 are the random values of the initial input, h m−1 is the transferred output of the last state, c m−1 is the historical information output of the last state, and the circle represents merging the input. After training, a new time series is input, and its output is the corresponding predicted value. It is worth noting that to achieve a contribution rate of 80%, there is a high probability of more than one virtual signal after dimensionality reduction, so a separate decision is made for each reduced virtual signal in the prediction process, and then a comprehensive analysis is performed with Bayesian hypothesis testing.

Real-Time Bayesian Hypothesis Testing
The Bayesian hypothesis testing method is developed to identify anomalies in real time. Different from the traditional threshold method, Bayesian hypothesis testing can extract useful information from historical prior data, which can be used for subsequent abnormality judgment. In addition, the application of univariate and multivariate data types in the data processing method mentioned above means that the model can deal with most of the data from different industries and can solve various situations with insufficient information conditions, uncertain data, and chaotic multivariate data in actual production processes. The following is the specific construction method.

Setting the Hypotheses
The threshold approach is to monitor whether the data residuals exceed a set threshold to determine whether to raise the alarm, as shown in Figure 3. In this study, instead of setting the threshold through complex calculation, Bayesian inference is used to establish two hypotheses of the mean µ 0 and variance σ 2 0 of the residuals. Through the form of a sliding window, the dataset X = {x 1 , x 2 , . . . , x k } is composed of the current point and the previous k-1 points, where k represents the length of the sliding window. It can be known that n data points have n-k+1 data sets. Assuming that it obeys a normal distribution N(µ, σ 2 ), the corresponding mean value and variance value can be obtained. Figure 4 shows the residuals and the corresponding means and variances, which fluctuate when the residuals fluctuate dramatically. Therefore, using the mean and variance can be a good substitute for the residuals.
Let the floating values of their mean and variance be ε and δ. The floating values are obtained from the prior information, and the original and alternative hypotheses of the mean and variance are set up in terms of whether the values of both exceed the floating values. Taking the mean µ as an example, where the null hypothesis is H 0 : µ ∈ [μ − ε,μ + ε] and the alternative hypothesis is whereμ is the estimate of µ, the hypothesized probabilities are as follows [34]: where g(µ) is the posterior probability distribution. In the following, g(µ) and estimateμ will be solved.

Posterior Distribution Probability Determination
Assuming that dataset X in the above section obeys a normal distribution, its joint density function is: where Considering the positions of µ and σ 2 in the joint density, it is reasonable to choose the normal distribution as the prior distribution of µ and the inverse gamma distribution as the prior distribution of σ 2 , so the independent prior distribution of σ 2 and the conditional prior distribution of µ are: This step easily leads to the joint prior density function of (µ, σ 2 ): p µ, σ 2 = 1 2πσ 2 /ϕ where υ, τ, ρ, ϕ are the hyperparameters, which are assumed to be given here first. Multiplying a priori density by the joint a priori density immediately yields the kernel for a posteriori density.
To simplify Equation (10), first write: Substituting these equations back into the original Equation (10) yields: This posterior density is formally the same as the prior density (Equation (9)), which shows that the normal-inverse-gamma distribution is the joint conjugate prior distribution of the normal mean of µ and the normal variance of σ 2 , so the joint posterior distribution can also be decomposed into the product of the independent posterior distribution of σ 2 and the conditional posterior distribution of µ. Thus, the probability distributions of the mean and variance are obtained as follows: where the independent posterior distribution of σ 2 is the inverse-gamma distribution, and the independent posterior distribution of mean µ is the t-distribution with degrees of freedom υ n , location parameter τ n , and scale parameter ρ n / √ ϕ n . The hyperparameters given above were solved by using a priori moments. To eliminate the effect of randomness in the training process when training the prediction model using historical data, it is usually trained several times to evaluate the accuracy comprehensively. Assuming that the training is performed k times, the mean and variance of the i-th training residual set are µ i , σ 2 i . Then, the k overall sets can be expressed as µ 1 , σ 2 1 , µ 2 , σ 2 2 , · · · , µ k , σ 2 k , and the estimated values of the hyperparameters are obtained using Equation (8) as follows: where σ 2 z = 1  (11), the estimated values of υ n , ρ n , τ n , ϕ n can be obtained.

Mean and Variance Estimation
After obtaining the posterior probability distribution of the mean µ and variance σ 2 , the median, mode or expectation of the posterior distribution is taken as the parameter estimate, and the equation of the mode M o and expectation E corresponding to the mean value of the t-distribution is: The equations of the mode M o and expectation E corresponding to the variance of the inverse-gamma distribution are: The estimates of the mean µ and variance σ 2 were obtained from the mode and expectation equations of the inverse gamma and t-distributions: whereσ 2 M andμ M are the mode estimates, andσ 2 E and µ E are the expectation estimates. From the equation, the t-distribution corresponding to the mean makes no difference in choosing the mode or the expectation as the mean estimate because its mode and expectation are equal. However, for the variance estimate, since the mode value is smaller than the expectation, if the mode is chosen as the expectation, it will make the evaluation strategy stricter and reduce the underdiagnosis rate, but at the same time, it may also identify signals that are in a healthy state as abnormal and increase the misdiagnosis rate. Meanwhile, the opposite is true for the choice of the expectation, so the actual situation of the corresponding data should be considered in the selection process for analysis.

Normality Test
In the above solution process, the residuals are assumed to obey a normal distribution. To ensure the rigor of the decision, this paper uses the Anderson-Darling (A-D) goodnessof-fit test to determine whether the residuals obey a normal distribution. For residuals X = {x 1 , x 2 , . . . , x n }, a hypothesis test can be performed such that the null hypothesis H 0 : X obeys a normal distribution and the alternative hypothesis H 1 : X does not obey a normal distribution. First, the decision statistic A 2 is calculated [35]: where N(x) is the standard normal distribution, x is the mean value of the residual, and S is the standard deviation of the residual. Using Equation (18) to find A 2 , the critical value A 2 α is obtained by checking the table according to the given significance level α. If A 2 < A 2 α , it is concluded that X obeys a normal distribution and vice versa. If it does not obey a normal distribution, the Box-Cox power transformation model [36] is used to transform the nonnormal data.

Quantification and Anomaly Identification
After finding the posterior probability of the hypothesis, the failure probability can be quantified by setting the failure probability as P and the failure probabilities of the mean and variance as P µ and P σ 2 , respectively.
Similarly, the value of P σ 2 can be obtained, so the overall failure probability is: Next, a suitable strategy is developed to determine when anomalies occur. Traditional judgments are usually made as posterior probability ratios, e.g., for the mean value: when λ µ > 1, H 0 is accepted, i.e., the signal is considered currently healthy. On this basis, its logarithm log 10 λ µ is taken as the Bayes factor π µ to compress the data. H 0 is accepted when π µ > 0. Similarly, the Bayes factor π σ 2 for the variance is obtained. Finally, a health value is set up to synthesize the judgment conditions: ω = 0, π µ > 0 and π σ 2 > 0 and P < 0.05 1, else when the health value ω is 0, this signal is considered abnormal, and when it is 1, it is healthy. For multivariate data, after i virtual signals are obtained by processing, such as dimensionality reduction, the above processing can be performed for each virtual signal similarly to obtain the respective health value ω i . Then, the judgment for the overall health value is modified as follows: The whole anomaly identification and early warning model is divided into three parts. The preprocessed data are predicted by LSTM and combined with real-time Bayesian anomaly recognition to achieve early warning, as shown in Figure 1.

Experimental Results and Discussion
Machinery operation data of two cases are selected to verify the accuracy and generalization of the model. The sampling frequency of data is different and both of them are multidimensional.

Water Pumping Station
The public pump_sensor_data dataset [37] on the Kaggle website is taken as an example to illustrate the realization process of the decision strategy. The dataset is the operation data of a large water pumping station with a large-town water supply. The pump is equipped with nearly sixty sensors to detect different parts and variables. The sampling frequency of the data is 1 h, and the selected time is from 26 May to 31 July, which includes both healthy and fault data. The running state of the entire dataset is shown in Figure 5. The status value is 1 when the pump station is faulty and 0 when the operation is normal. Each signal is governed by a simple code, and this example also demonstrates the universality of the model. The entire processing procedure of the dataset is shown in Figure 1. These data need to be analyzed in a multivariate situation. After the missing values are filled and outliers are processed on the exported data, the dimensionality is reduced to simplify the data information, and the contribution of each principal component (PC1, PC2, PC3) is shown in Figure 6. The first two principal components are selected as two independent virtual signals according to the principle that the contribution is greater than 80%, and then these virtual signals are individually subjected to wavelet packet noise reduction and normalization to obtain the preprocessed dataset, as shown in Figure 7. In Figure 7, the data are divided into training, validation, and prediction data. The training data are used to train the LSTM model, while the validation data are used to validate the prediction model and as a priori information for parameter estimation. From the figure, the two independent virtual signals show significant fluctuations in both time periods in the prediction dataset and are consistent with the operation of the pumping station in the real state, indicating that the Bayesian principal component analysis in this model retains most of the information of the original data. Moreover, the failure times in the prediction set are marked in detail, which will be used to verify the prediction accuracy of this model.  Using the data of the first principal component as an example, the data of the training and validation sets are used as the training data of the LSTM neural network. A phase space reconstruction is performed by trying different combinations of m and τ, and the prediction model is found to have the highest accuracy when m= 4, τ= 2 with an MSE of 0.0054. Figure 8a. compares the actual value of the first principal component with the predicted value, and Figure 8b shows the predicted residuals. The residuals in the training and validation sets were intercepted and tested for normality, and the results are shown in Figure 9. The confidence level α was taken as 0.05, the mean value was −0.0248, the variance was 0.0548, and the p-value was 0.092 (greater than 0.05), which means that the residuals were found to obey a normal distribution and the training model was valid.   Table 1. The final estimated meanμ = 0.006265 and varianceσ 2 = 0.002181 were used as the criteria for the subsequent evaluation. The variation in the mean was within ±0.004 of the estimate, and the variation in the residuals was within ±0.0005. The hyperparameters can then be solved according to Equations (14) and (20), as shown in Table 2. The signal is tested in real time based on the obtained residual estimates and the mean estimates. The current signal value and the previous 11 h of data are taken as a sliding window. The obtained testing result is shown in Figure 10, where (a) and (b) show the Bayes factor and the failure probability for the mean test, respectively, and (c) and (d) show the Bayes factor and the failure probability for the variance test, respectively. As seen from the Bayes factor diagram, the mean graph appeared negative at 4:00 on 28 June, and the variance graph also appeared negative at 3:00 on the same day. The corresponding failure probability of the two parameters also began to rise rapidly to 100% at this moment, indicating that the equipment failed. In addition, the Bayes factor value of the mean diagram dropped rapidly at 7:00 on 25 July, while the factor value of the variance diagram also dropped rapidly at that moment, and the corresponding failure probability climbed to 100% just as rapidly. Both times, the change in the signal before the failure was successfully identified, and the earliest alarm was issued 18 h earlier.  (20) and (22), as shown in Figure 11a,b. Similarly, the overall failure probability and health values of the second principal component are obtained in Figure 11c,d. As seen from the figures, the trends in the first and second principal components are basically the same, so a comprehensive judgment by using Equation (23) is unnecessary in this example. However, in the comparison of the health value plots, the first principal component predicts the fault signal at an earlier time, and the failure probability value is always 100% during an actual fault of the pump station, while the second principal component has larger fluctuations, indicating that the first principal component contains most of the information of the signal and can predict the signal fault more accurately. Additionally, comparing the actual health values in Figure 5 shows that the proposed model accurately predicts the subsequent faults and is able to issue alerts in advance.

Nuclear Power Turbine
To verify the developed decision method in practical applications, a motor with faults of a turbine unit in a nuclear power plant is used as the research object, and the operation data from 00:00 on 25 November to 00:00 on 30 November are extracted from the monitoring system. The data consist of vibration signals on both sides of the driving end and the nondriving end of the motor, including both health data and fault data. The data are collected at a frequency of minutes. The four vibration signals are shown in Figure 12. According to the accident report from the nuclear power plant, the monitoring system found that the vibration of the motor's driving end was out of limit. The vibration value quickly rose to 2.48 mm/s, exceeding the alarm value by 2.3 mm/s, and then triggered the alarm at 11:26 on November 28. The decision model is applied to the abnormal prediction of operation data. The data are preprocessed and reduced in dimensionality, and the contribution rate of the first principal component after dimensionality reduction is as high as 98%, indicating that there is a strong correlation between the four vibration signals. Therefore, the first principal component PCA1 data are only taken for further analysis. The data curve is consistent with the trend of the curve of the original data, as shown in Figure 13. Subsequently, according to the decision flow, PCA1 is imported into the LSTM model for training and prediction to obtain the residuals. After the Bayesian hypothesis testing, the failure probability is obtained in Figure 14. The failure probability is basically 0 in the first half of time, i.e., normal operation, and rises slightly to 0.0543 at 16:30 on the 17th. This trend may be caused by the sensitive algorithm. Then, at 19:51 on the 17th, the failure probability quickly climbed to 100 percent. Compared with PCA1 in Figure 13, although each signal in Figure 13 did not exceed the alarm value before the monitoring system alarm, there was a significant change at 19:59 on 17 November, and the fluctuation range of the subsequent data deviated from the normal value area. This anomaly was consistent with the failure probability graph in this time period, and this decision model predicted this anomaly in advance and accurately. According to this decision model, the alarm will be given before the occurrence of abnormality, and the actual results also prove that machinery failure will occur within 12 h after the abnormality is identified. However, the actual threshold alarm system of the nuclear power plant did not identify this anomaly, and the fault was not found until the vibration value reached the alarm value, resulting in alarm delay. At the same time, the failure probability reached 100% again at 9:59 on 28 November, corresponding to the alarm of the monitoring system at 11:26, which indicated that the decision model still predicted the fault 1.5 h in advance, even without considering the previous abnormality.

Conclusions
A multivariate abnormality identification and early warning model based on Bayesian hypothesis testing is proposed to address the shortcomings of traditional thresholds with many missed diagnoses and late alarms. The model predicts time-series data by an LSTM neural network and makes full use of historical data to dynamically determine future abnormal status in real time while proposing a processing method for multivariate cases. The proposed method has the following features: (1) based on data-driven methods, it makes full use of a priori information to improve the identification accuracy, (2) it can be effectively applied to abnormality identification in multivariate situations, and (3) it uses Bayesian hypothesis testing as the fault judgment method to quantify the failure probability, making the judgment results more rigorous and reliable and enabling earlier detection of potential faults. The proposed model is validated using monitoring data from pumping stations and actual failure cases in nuclear power plants. The experimental study shows that this model successfully predicts the occurrence of faults in the multivariate case over a long period of time, which validates the accuracy and reliability of the model.

Conflicts of Interest:
The authors declare no conflict of interest.