The Prediction Method on the Early Failure of Hydropower Units Based on Gaussian Process Regression Driven by Monitoring Data

: The hydropower units have a complex structure, complicated and changing working conditions, complexity and a diversity of faults. Effectively evaluating the healthy operation status and accurately predicting the failure for the hydropower units using the real-time monitoring data is still a difﬁcult problem. To this end, this paper proposes a prediction method for the early failure of hydropower units based on Gaussian process regression (GPR). Firstly, by studying the correlation between different monitoring data, nine state parameters closely related to the operation of hydropower units are mined from the massive data. Secondly, a health evaluation model is established based on GPR using the historical multi-dimensional monitoring information and fault-free monitoring data at the initial stage of unit operation. Finally, a condition monitoring directive based on the Mahalanobis distance (MD) is designed. The effectiveness of the proposed method is veriﬁed by several typical examples of monitoring data of a hydropower station in Guangxi, China. The results show that, in three cases, the abnormal conditions of the unit are found 2 days, 4 days and 43 days earlier than those of regular maintenances respectively. Therefore, the method can effectively track the change process of the operation state of hydropower units, and detect the abnormal operation state of hydropower units in advance. engineering application value


Introduction
Hydropower is the key renewable and clean energy in pursuit of economic decarbonization. The authors [1] reviewed a lot of innovative technologies in the field of hydropower, such as the digital implementation innovation technology that improves efficiency by 12%, and discuss the cost of standardization, best practices, and innovative technologies. The hydropower station is a key part of hydropower generation. Hydropower units are the core equipment of hydropower stations. Due to their complex internal structure and the influence of mechanical, electrical, and hydraulic factors during operation, hydropower units have a high failure rate and long maintenance time. The performance degradation of hydropower units has the property of gradual change [2]. At present, the state monitoring of hydropower units is mainly based on their vibration signal to extract the fault characteristics that can characterize the operation state. Methods such as EMD [3,4], VMD [5,6], and wavelet analysis [7,8] are commonly used to decompose the vibration signal to remove the high-frequency noise in the original signal, to extract the typical characteristic frequency corresponding to different faults of the units. A reliability centered maintenance evaluate the operation state of the unit in real-time. In this paper, the basic process of the ideal framework of hydropower unit health assessment based on GPR is as follows. Firstly, in the off-line phase, for the problem of monitoring parameter selection, the correlation and redundancy between monitoring parameters are analyzed by using the historical monitoring data, and the parameter variables with strong correlation are selected to further remove the redundant variables. Then, based on the above selected monitoring parameters, the health assessment model based on GPR model is constructed by using the data without fault records during the initial operation of hydropower units, and the model is used as the benchmark model for online evaluation. Finally, in the on-line stage, combined with the online monitoring data, an index for evaluating the operation status of hydropower units is designed based on MD, and the reasonable alarm threshold is determined, to realize the on-line condition monitoring of hydropower units, as shown in Figure 1.
selection, the correlation and redundancy between monitoring parameters are a by using the historical monitoring data, and the parameter variables with strong tion are selected to further remove the redundant variables. Then, based on th selected monitoring parameters, the health assessment model based on GPR mod structed by using the data without fault records during the initial operation o power units, and the model is used as the benchmark model for online evaluation in the on-line stage, combined with the online monitoring data, an index for ev the operation status of hydropower units is designed based on MD, and the rea alarm threshold is determined, to realize the on-line condition monitoring of hyd units, as shown in Figure 1.
This paper is organized as follows. Section 2 introduces the correlation analys ods for mining the appropriate input variables by using the correlation among h rameters. Section 3 describes the process of GPR modeling and the method of d state evaluation indicators. Section 4 presents results from case studies, and Sectio marizes the conclusions.

Input Variables Selection
Extracting the state parameters closely related to the operation of hydropow from the monitoring system of hydropower units is the premise to realize the h sessment of hydropower units. The correlation analysis methods such as PCC, G MIC can solve this problem well. The detailed descriptions of these three correlat ysis methods are as follows.
The PCC is a commonly used correlation analysis method, which can cha the linear degree of tightness between the two variables [30]. In this paper, from spective of linear correlation, the PCC is used to identify the strong linear correlati itoring parameters with the operation status of hydropower units, and to remo redundant monitoring parameters with similar functions.

The GCD
Healthy data set Correlation analysis Representative data set Representative data set Monitoring system Real-time monitoring data CMD Health assessment model Predicted value

Status analysis
Online health assessment Evaluation model construction This paper is organized as follows. Section 2 introduces the correlation analysis methods for mining the appropriate input variables by using the correlation among health parameters. Section 3 describes the process of GPR modeling and the method of designing state evaluation indicators. Section 4 presents results from case studies, and Section 5 summarizes the conclusions.

Input Variables Selection
Extracting the state parameters closely related to the operation of hydropower units from the monitoring system of hydropower units is the premise to realize the health assessment of hydropower units. The correlation analysis methods such as PCC, GCD, and MIC can solve this problem well. The detailed descriptions of these three correlation analysis methods are as follows.
The PCC is a commonly used correlation analysis method, which can characterize the linear degree of tightness between the two variables [30]. In this paper, from the perspective of linear correlation, the PCC is used to identify the strong linear correlation monitoring parameters with the operation status of hydropower units, and to remove some redundant monitoring parameters with similar functions.

The GCD
Under the actual conditions, the operation condition of hydropower units is complex, and the monitoring data have the characteristics of high noise and strong randomness, which lead to many nonlinear relationships among the monitoring parameters except a linear relationship. In this paper, the GCD is used to identify the nonlinear characteristics among these monitoring parameters. The GCD mainly measures the relationship between each factor and behavior output or index according to the similarity of each correlation factor curve, which is suitable for extracting the relationship between small samples and poor information state [31].
Assuming the time series of active power is y = {y 1 , y 2 , · · · , y n }, and the time series of monitoring parameters x i = {x i1 , x i1 , · · · , x in }, where i represents the ith monitoring parameter. The new sequences are obtained using dimensionless mean processing as Equations (1) and (2). The GCD of y and x i can be described by Equation (3), and the specific calculation process is shown in Reference [31].
In Equation (3), k = 1, 2, · · · , n, R x i , y grey relational grade represents the level of correlation between the time series of active power and the time series of monitoring parameters.

The MIC
Although the GCD can find out the non-linear relationship between the monitoring parameters, some of them still have many non-linear correlations and cannot be expressed by specific formulas. Therefore, this paper uses MIC to further explore the correlation between them. The MIC is a correlation algorithm that does not need to make any assumptions about the data distribution to evaluate the function and statistical relationship between variables. Given enough samples, it can effectively capture a wide range of relationships without being limited to specific relationship types [32]. The calculation process is as follows.
(1) The formula for calculating the maximum mutual information value of two random variables X and Y for any given partition is Equation (4).
(2) Normalizing the above formula to make its value in the interval [0, 1].
(3) In the ordered pair data set D with the known sample n, the MIC of the two variables x and y in this set D is defined as Equation (6).
In Equation (6), xy < B(n)(B(n) = n a ), n is the data scale to limit the mesh size to divide the region, and the value of a constant a can be set according to experience or scale. MIC can detect the correlation whether two variables are functionally related, hyperfunction related, or even not related to any functional relationship. However, some statisticians believe that MIC can also make the illusion of correlation between two uncorrelated variables [33,34]. Therefore, to avoid the introduction of irrelevant variables due to the defects of MIC, this paper combines CGD theory to mine the nonlinear relationship between variables.

GPR Theory
Gaussian process (GP) is a machine learning method based on statistical learning theory and Bayesian theory, which has good adaptability to deal with complex regression problems such as high dimensionality, small sample sizes, and nonlinearity [35]. The statistical properties of GP are determined by their mean function m(x) and covariance function k(x, x ) [36]. Gaussian prior is composed of mean function and variance function, and its process can be described as follows.
where x is the input, y is the output, f (x n ) is a hidden function, and ξ n is noise. Given the data D = [(x*, y*)], the joint distribution of training set y and y * can be obtained according to the Bayesian principle. y * |X, y, X * ∼ N[y * , cov(y * )] (8) where y* and Cov(y*) are fitting values and variances, respectively. The super parameters set of the GPR model is one of the key factors affecting the model prediction. In this paper, the square exponential covariance function is selected as the covariance function described as follows: In Equation (11), σ 2 f is the signal variance. M= diag l −2 is a symmetric matrix with y per parameters, l is the characteristic length scale parameter, and Θ = (l, σ 2 f , σ 2 n ) is usually called the super parameter.
After the super parameters of the GPR model is obtained, the model is established.

Online Evaluation Using CMD
The health assessment model of hydropower units reflects the complex relationship between the health status of hydropower units and many influencing factors in normal operation. When the hydropower units deteriorate, their original relationship will inevitably change, resulting in their operation state deviated from the original health state. Therefore, to identify the deterioration or failure state of hydropower units, it is necessary to find a method to quantify the degree of deviation between the two states.
The MD [37] was proposed by Mahalanobis, an Indian statistician. It is a method to calculate the similarity of two sample sets. In other fields, it is often used as an anomaly detection index and has achieved good results [38,39]. This method is very suitable to measure the distance between the characteristic vector and the healthy operation state space. The MD is defined as Equation (12).
Assuming that the characteristic vector of the current operating state for the hydropower units is X i and considering the influence of the historical state on hydropower units at the current time, the CMD is defined as Equation (13).
In Equation (13), n is the sliding window and d(t) is the MD between the current time vector X i and the health assessment model.

Description of Collection Data Type
To verify the effectiveness of the above methods, this paper takes the bulb tubular hydropower units, with a rated power of 30.77 MW, which was put into use in May 2015, in Guangxi Province of China as the research object. The general structure and some sensors installation positions of the hydropower unit studied are shown in Figure 2. The Monitoring System of NaRI Open-2000 installed in this power station is designed and manufactured by NARI Group Co., Ltd. (Nanjing, China). The function of the Open-2000 monitoring system is to realize the centralized monitoring and control, recording and management of the whole power station, and remote monitoring of the system. Its data acquisition system can collect real-time data from various field control units and external data links, complete the database refresh, and analyze, process and calculate the collected data to form the data needed for monitoring and management, and store these data in the database. The monitoring parameters collected by the acquisition system include electrical parameters such as current, voltage, and power, mechanical parameters such as vibration, travel, displacement, guide vane opening, water level, flow, pressure, and other hydraulic parameters, and thermal parameters such as tile temperature, oil temperature, winding temperature, etc. In this paper, 107 kinds of parameters are selected from the above monitoring parameters for analysis. Some of the original data of some monitoring parameters are shown in Table 1. It can be seen from Table 1 that the data of monitoring are discrete data, which cannot be analyzed by traditional diagnosis method based on the continuous signal. The main types of sensors used are shown in Table 2. Figure 3 shows the installation position of the RTZ bearing bush temperature sensors for collecting the temperature of water guide bearing bush, and basic dimensions parameters of turbine model are shown in Table 3. Table 1. Part of the original data of some of 107 measuring parameters on 9 October 2015.

Feature Vector Selection
This paper uses the monitoring data of no-fault records with a sampling interval of 10 min in the first fourth months of its operation to complete feature selection. The active power information of the hydropower units is used to evaluate the operation state of the hydropower units. Therefore, when selecting the monitoring parameters, the parameters closely related to the active power are mainly selected. According to the above correlation analysis methods, the correlation between all simulated monitoring parameters (107 in total) of the monitoring system and the output active power of the hydropower units are calculated. Some results are shown in Table 4, and the vibrations mentioned in this paper are measured vibration. Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 20

Feature Vector Selection
This paper uses the monitoring data of no-fault records with a sampling interval of 10 min in the first fourth months of its operation to complete feature selection. The active power information of the hydropower units is used to evaluate the operation state of the hydropower units. Therefore, when selecting the monitoring parameters, the parameters closely related to the active power are mainly selected. According to the above correlation analysis methods, the correlation between all simulated monitoring parameters (107 in total) of the monitoring system and the output active power of the hydropower units are calculated. Some results are shown in Table 4, and the vibrations mentioned in this paper are measured vibration.
From Table 4, the correlation between each monitoring parameter and the active power parameter is quite different. To reduce the influence of weakly correlated or unrelated state parameters on the next step of modeling, the following Equations (14) and (15) are used to select the monitoring parameters closely related to the operation state of the unit, and these monitoring parameters are taken as the input eigenvectors of the initial selection.  Table 4. Correlation analysis results of the partial condition monitoring parameters using maximum information coefficient (MIC), Pearson correlation (PCC), and grey correlation degree (GCD).  Table 4, the correlation between each monitoring parameter and the active power parameter is quite different. To reduce the influence of weakly correlated or unrelated state parameters on the next step of modeling, the following Equations (14) and (15) are used to select the monitoring parameters closely related to the operation state of the unit, and these monitoring parameters are taken as the input eigenvectors of the initial selection.

MIC PCC GCD
In Equations (14) and (15). N is the monitoring parameters of 107 measuring points collected from the hydropower station monitoring system. M(i) is the MIC value of the ith monitoring parameter. P(i) is the PCC value of the ith monitoring parameter. G(i) is the GCD value of the ith monitoring parameter. When f (i) > g, it is considered that the ith parameter is closely related to the unit operation state specifying it as a preliminary input feature.
According to the above methods, the input eigenvectors with 41 parameters, i.e., the vibration of the oil receiver in the Y direction, the vibration of the stator frame in the Y direction, the vibration of the combined bearing in the Y direction, the vibration of the water guide bearing in the Y direction, the temperatures of the thrust bearing bushes from No.1 to No.12, the stator coil temperatures from No.1 to No.12, air cooler hot air temperatures from No.1 to No.6, stator core temperatures from No.1 to No.12, and the unit flow, are preliminarily obtained. To reduce the influence of redundant variables, the above parameters are further analyzed. The time-domain variation curves of some monitoring parameters are shown in Figure 4. From Figure 4, these monitoring parameters are highly correlated, indicating th they have the same roles in reflecting the operation state of the unit. Therefore, one of parameters is selected as the input feature. Through linear correlation analysis, nine p rameters, including the vibration of oil receiver, the stator vibration in the Y direction, vibration of combined bearing in the Y direction, the vibration of water guide bearing the Y direction, the temperature of No.1 forward thrust bearing bush, the temperature No.1 stator coil, the temperature of No.1 air cooler hot air temperature, the temperatu of No.1 stator core and unit flow rate, are selected as final input eigenvectors.

The Establishment of the Health Assessment Model for Hydropower Units
The health model of the hydropower unit is established by using ten monitoring p rameters in the initial three months of operation. Each monitoring parameter has 12 groups of data, and the data sampling time is 10 min. The first 1000 sets of data are us for training the GPR model, and the remaining 200 sets of data are used for model ver cation. The validation effect of the GPR model is shown in Figure 5a. The predicted cur by the GPR model is in good agreement with the actual monitoring data. To highlight superiority of the GPR model, the back propagation neural network (BPNN), and the dial basis function neural network (RBFNN) model are used for comparison. BPNN c realize the nonlinear mapping between input samples and output samples, and has characteristics of self-organization and self-learning, which makes it very effective in no linear approximation. In addition, the potential relationship between these data can summarized through the automatic learning process. So far, there is no definite meth to determine the number of nodes in the hidden layer of BPNN. Homik points out t the number of hidden layers should be between 2 1 m + and 2m n + , Hecht Niels thinks that the number of nodes is 2 +1 m [40], and Some scholars think that the numb of nodes is m n a + + , where a is a constant between 1 and 10 [41]. In these methods, and n are the number of nodes in the input layer and the output layer respective Therefore, according to these empirical formulas, the number of nodes in the hidden lay From Figure 4, these monitoring parameters are highly correlated, indicating that they have the same roles in reflecting the operation state of the unit. Therefore, one of the parameters is selected as the input feature. Through linear correlation analysis, nine parameters, including the vibration of oil receiver, the stator vibration in the Y direction, the vibration of combined bearing in the Y direction, the vibration of water guide bearing in the Y direction, the temperature of No.1 forward thrust bearing bush, the temperature of No.1 stator coil, the temperature of No.1 air cooler hot air temperature, the temperature of No.1 stator core and unit flow rate, are selected as final input eigenvectors.

The Establishment of the Health Assessment Model for Hydropower Units
The health model of the hydropower unit is established by using ten monitoring parameters in the initial three months of operation. Each monitoring parameter has 1200 groups of data, and the data sampling time is 10 min. The first 1000 sets of data are used for training the GPR model, and the remaining 200 sets of data are used for model verification. The validation effect of the GPR model is shown in Figure 5a. The predicted curve by the GPR model is in good agreement with the actual monitoring data. To highlight the superiority of the GPR model, the back propagation neural network (BPNN), and the radial basis function neural network (RBFNN) model are used for comparison. BPNN can realize the nonlinear mapping between input samples and output samples, and has the characteristics of self-organization and self-learning, which makes it very effective in nonlinear approximation. In addition, the potential relationship between these data can be summarized through the automatic learning process. So far, there is no definite method to determine the number of nodes in the hidden layer of BPNN. Homik points out that the number of hidden layers should be between √ 2m + 1 and 2m + n, Hecht Nielsen thinks that the number of nodes is 2m+1 [40], and Some scholars think that the number of nodes is √ m + n + a, where a is a constant between 1 and 10 [41]. In these methods, m and n are the number of nodes in the input layer and the output layer respectively. Therefore, according to these empirical formulas, the number of nodes in the hidden layer should be between 4 and 19. Considering the complexity of monitoring parameters, the transfer functions logsig and purelin are used as the transfer functions of hidden layer and output layer. Tainglm adaptive learning algorithm is used to complete network training and prediction, and is programed and run in MATLAB 2019a environment. The main parameters of the network are as follows: the target error is 6.5 × 10 −4 , the learning rate is 0.005, the maximum number of calculations is 5000, and the additional momentum factor is 0.90. The test error of typical network performance with different number of hidden layer nodes is shown in Table 5. Therefore, the best prediction model of BPNN is 9 × 19 + 1, and the research results of the BPNN model is seen in Figure 5b. The research results of the RBFNN model are seen in Figure 5c. The comparison between the results obtained with the GPR model to the results of BPNN and RBFNN models is showed in Table 6.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 20 and prediction, and is programed and run in MATLAB 2019a environment. The main parameters of the network are as follows: the target error is 6.5 × 10 −4 , the learning rate is 0.005, the maximum number of calculations is 5000, and the additional momentum factor is 0.90. The test error of typical network performance with different number of hidden layer nodes is shown in Table 5. Therefore, the best prediction model of BPNN is 9 × 19 + 1, and the research results of the BPNN model is seen in Figure 5b. The research results of the RBFNN model are seen in Figure 5c. The comparison between the results obtained with the GPR model to the results of BPNN and RBFNN models is showed in Table 6.   In this paper, Equation (16) is used to quantitatively test the above prediction results.  In this paper, Equation (16) is used to quantitatively test the above prediction results.
In Equation (16), σ MAE is the absolute value of average relative error. N 1 is the number of test samples y(i) is the predicted value and y(i) is the real value of the monitoring data. According to Table 6, the σ MAEs of the GPR model, the BPNN model, and the RBFNN model are 2.1158%, 6.6455%, and 5.1104%, respectively. Obviously, the predicted curve by BPNN model diverges largely from the actual monitoring data. For the monitoring data with strong randomness and high noise of hydropower units, compared with BPNN and RBFNN, the covariance function is used to describe the random distribution of the monitoring data of hydropower units in GPR modeling, and the noise of the data can be effectively identified and separated. Therefore, the GPR model is very suitable for the health assessment model of hydropower units. In the field of condition monitoring, there are many research results on determining the alarm threshold of equipment abnormal operation by analyzing the monitoring data collected by equipment status [38,39,42]. To determine the alarm threshold value of the state index, this paper selects another 2400 groups of data in the fault-free operation time to determine the distribution law of the CMD. The results are shown in Figure 6. It can be seen from Figure 6 that the deviation range between the normal operation of hydropower units and the health model is stable at 0.6 or below. Considering that the hydropower station is easily affected by runoff and undertakes the task of peak load regulation and frequency regulation, the alarm threshold is determined as 0.8. This threshold is considered as a universal value in the operation of the unit. When the value of CMD exceeds 0.8, it indicates that the current state of hydropower units deviates from the health model, which means that the current operating state of hydropower units deviates from the normal state, and there may be potential risks or faults. Several typical cases from the historical monitoring data of this hydropower units are used to verify the effectiveness of the proposed method.  Table 6, the σMAEs of the GPR model, the BPNN model, and the RBFNN model are 2.1158%, 6.6455%, and 5.1104%, respectively. Obviously, the predicted curve by BPNN model diverges largely from the actual monitoring data. For the monitoring data with strong randomness and high noise of hydropower units, compared with BPNN and RBFNN, the covariance function is used to describe the random distribution of the monitoring data of hydropower units in GPR modeling, and the noise of the data can be effectively identified and separated. Therefore, the GPR model is very suitable for the health assessment model of hydropower units. In the field of condition monitoring, there are many research results on determining the alarm threshold of equipment abnormal operation by analyzing the monitoring data collected by equipment status [38,39,42].
To determine the alarm threshold value of the state index, this paper selects another 2400 groups of data in the fault-free operation time to determine the distribution law of the CMD. The results are shown in Figure 6. It can be seen from Figure 6 that the deviation range between the normal operation of hydropower units and the health model is stable at 0.6 or below. Considering that the hydropower station is easily affected by runoff and undertakes the task of peak load regulation and frequency regulation, the alarm threshold is determined as 0.8. This threshold is considered as a universal value in the operation of the unit. When the value of CMD exceeds 0.8, it indicates that the current state of hydropower units deviates from the health model, which means that the current operating state of hydropower units deviates from the normal state, and there may be potential risks or faults. Several typical cases from the historical monitoring data of this hydropower units are used to verify the effectiveness of the proposed method.

Case Analysis
Through several planned maintenances, the staff of the hydropower station found that the flange plate of the blade of the motor unit leaked oil, the floating bearing bush of the oil receiver was seriously worn, the oil leakage was large, and the connecting parts of the guide vane mechanism of the unit were broken, and the faults occurred on 29 December 2015, 5 February 2018 and 23 January 2019, respectively. During these periods, the sensors did not fail and did not cause abnormal monitoring data. Therefore, the change rule of the operation state of the hydropower units is studied mainly before and after these faults. To highlight the engineering application of the method proposed, the law of amplitude value change in the time domain of relevant monitoring parameters such as the vibration of oil receiver in the Y direction, the vibration of combined bearing in the Y direction, vibration of water guide bearing in the Y direction, temperature of forwarding thrust bearing bush and temperature of stator core before and after the fault is studied. Through several planned maintenances, the staff of the hydropower station found that the flange plate of the blade of the motor unit leaked oil, the floating bearing bush of the oil receiver was seriously worn, the oil leakage was large, and the connecting parts of the guide vane mechanism of the unit were broken, and the faults occurred on 29 December 2015, 5 February 2018 and 23 January 2019, respectively. During these periods, the sensors did not fail and did not cause abnormal monitoring data. Therefore, the change rule of the operation state of the hydropower units is studied mainly before and after these faults. To highlight the engineering application of the method proposed, the law of amplitude value change in the time domain of relevant monitoring parameters such as the vibration of oil receiver in the Y direction, the vibration of combined bearing in the Y direction, vibration of water guide bearing in the Y direction, temperature of forwarding thrust bearing bush and temperature of stator core before and after the fault is studied.  In case 1, when the unit fails, the amplitudes of each monitoring parameter of the hydropower unit are far lower than the alarm threshold set by the monitoring system, for example, it can be seen from Figure 8 that the vibration amplitude of the oil receiver in the Y direction is in the range of 20 to 40 μm, and the temperature value of monitoring parameters is lower than 100 °C. It can be seen from Tables 7 and 8 that only when the vibration amplitude of the oil receiver in the Y direction is greater than 250 μm, or the temperature value of the monitoring parameter is greater than 100 °C, the first level alarm of the monitoring system will be triggered. The proposed CMD curve exceeded the alarm threshold of 0.      To reduce the influence of accidental factors and verify the universality of the proposed method, case 2 and case 3 are further used for analysis. It can also be seen from in case 2 and case 3 that when the unit fails, the amplitude of each monitoring parameter of the hydropower unit is far lower than the alarm threshold of the monitoring system. Although with the change of time, the vibration amplitude of the oil receiver in the Y direction fluctuates in the range of 20 to 50 μm, and the amplitude value changes greatly, it is still far lower than 250 μm. It can be seen from case 2 that the CMD curve exceeded the alarm threshold of 0.8 at 5:50 on 31 January 2018, and the fault status of the unit was monitored in advance. In case 3, the CMD curve also exceeded the alarm threshold of 0.8 at 20:20 on 11 December 2018, and the fault status of the unit was also monitored in advance. It can be seen from the above cases that the fault alarm and early warning means of the hydropower station monitoring system are relatively single, and the alarm setting value of the hydropower station monitoring system is mainly fixed value. To reduce the influence of accidental factors and verify the universality of the proposed method, case 2 and case 3 are further used for analysis. It can also be seen from in case 2 and case 3 that when the unit fails, the amplitude of each monitoring parameter of the hydropower unit is far lower than the alarm threshold of the monitoring system. Although with the change of time, the vibration amplitude of the oil receiver in the Y direction fluctuates in the range of 20 to 50 μm, and the amplitude value changes greatly, it is still far lower than 250 μm. It can be seen from case 2 that the CMD curve exceeded the alarm threshold of 0.8 at 5:50 on 31 January 2018, and the fault status of the unit was monitored in advance. In case 3, the CMD curve also exceeded the alarm threshold of 0.8 at 20:20 on 11 December 2018, and the fault status of the unit was also monitored in advance. It can be seen from the above cases that the fault alarm and early warning means of the hydropower station monitoring system are relatively single, and the alarm setting value of the hydropower station monitoring system is mainly fixed value.   However, in the process of equipment operation, with the continuous change of monitoring status with environmental conditions and operating conditions, the collected monitoring point data amplitude does not exceed the set value, but the equipment has failed. Therefore, the role of hydropower station monitoring system in engineering application is limited. Although the deterioration or failure information of hydropower units must be reflected in the massive data collected from each monitoring point, the above failures were found in the scheduled maintenance. This also shows that it is difficult for hydropower station staff to master the operation status of units in real-time through the information reflected by these massive data. In view of the problems existing in these engineering applications, this paper studies how to extract the operation status information from the massive monitoring data. According to the method proposed in this paper, the health model of hydropower units in normal operation is established, this model reflects the complex relationship between the operation status of hydropower units and multiple monitoring parameters, and monitoring index based on MD is set to measure the change degree of these complex links. From the above cases, the CMD curve is abnormal ahead of the maintenance time of the staff. In order to explore the change trend of the CMD curve after maintenance, with the first maintenance of the unit on 18 February 2016, we selected the data of the unit from 5:30, 16 February 2016 to 13:40, 19 February 2016, using the method proposed in this paper for analysis, and the results are shown in the Figure 13. It can be concluded from the Figure 13 that after the maintenance of the unit, the CMD curve drops to below 0.8. Therefore, a conclusion is drawn that when the unit fails, the state of the unit at this time will inevitably lead to the change of the original structure information relationship among the parameters, and GPR model captures the subtle relationship of these changes and the evaluation and monitoring index CMD is more sensitive than the In case 1, when the unit fails, the amplitudes of each monitoring parameter of the hydropower unit are far lower than the alarm threshold set by the monitoring system, for example, it can be seen from Figure 8 that the vibration amplitude of the oil receiver in the Y direction is in the range of 20 to 40 µm, and the temperature value of monitoring parameters is lower than 100 • C. It can be seen from Tables 7 and 8 that only when the vibration amplitude of the oil receiver in the Y direction is greater than 250 µm, or the temperature value of the monitoring parameter is greater than 100 • C, the first level alarm of the monitoring system will be triggered. The proposed CMD curve exceeded the alarm threshold of 0.8 at 21:30 on 27 December 2015, which can monitor the fault state of the unit in advance.  To reduce the influence of accidental factors and verify the universality of the proposed method, case 2 and case 3 are further used for analysis. It can also be seen from in case 2 and case 3 that when the unit fails, the amplitude of each monitoring parameter of the hydropower unit is far lower than the alarm threshold of the monitoring system. Although with the change of time, the vibration amplitude of the oil receiver in the Y direction fluctuates in the range of 20 to 50 µm, and the amplitude value changes greatly, it is still far lower than 250 µm. It can be seen from case 2 that the CMD curve exceeded the alarm threshold of 0.8 at 5:50 on 31 January 2018, and the fault status of the unit was monitored in advance. In case 3, the CMD curve also exceeded the alarm threshold of 0.8 at 20:20 on 11 December 2018, and the fault status of the unit was also monitored in advance. It can be seen from the above cases that the fault alarm and early warning means of the hydropower station monitoring system are relatively single, and the alarm setting value of the hydropower station monitoring system is mainly fixed value.
However, in the process of equipment operation, with the continuous change of monitoring status with environmental conditions and operating conditions, the collected monitoring point data amplitude does not exceed the set value, but the equipment has failed. Therefore, the role of hydropower station monitoring system in engineering application is limited. Although the deterioration or failure information of hydropower units must be reflected in the massive data collected from each monitoring point, the above failures were found in the scheduled maintenance. This also shows that it is difficult for hydropower station staff to master the operation status of units in real-time through the information reflected by these massive data. In view of the problems existing in these engineering applications, this paper studies how to extract the operation status information from the massive monitoring data. According to the method proposed in this paper, the health model of hydropower units in normal operation is established, this model reflects the complex relationship between the operation status of hydropower units and multiple monitoring parameters, and monitoring index based on MD is set to measure the change degree of these complex links. From the above cases, the CMD curve is abnormal ahead of the maintenance time of the staff. In order to explore the change trend of the CMD curve after maintenance, with the first maintenance of the unit on 18 February 2016, we selected the data of the unit from 5:30, 16 February 2016 to 13:40, 19 February 2016, using the method proposed in this paper for analysis, and the results are shown in the Figure 13. It can be concluded from the Figure 13 that after the maintenance of the unit, the CMD curve drops to below 0.8. Therefore, a conclusion is drawn that when the unit fails, the state of the unit at this time will inevitably lead to the change of the original structure information relationship among the parameters, and GPR model captures the subtle relationship of these changes and the evaluation and monitoring index CMD is more sensitive than the existing monitoring and alarm system to catch the abnormal operation of the unit, which can effectively find the potential failure of the unit, send out the early warning information, and give the decision-maker plenty of time to make the maintenance plan in advance.

Conclusions
In this paper, a state evaluation method of hydropower units based on GPR theory is proposed. In this paper, the PCC, GCD, and MIC analysis methods are used to mine 41 monitoring parameters which are closely related to the operation of hydropower units from 107 monitoring parameters. To reduce the influence of redundant parameters on the modeling, PCC is used to analyze the correlation among 41 monitoring parameters. Nine monitoring parameters are obtained and used as input eigenvectors. Secondly, to verify the performance of the GPR model, this paper uses the classic BPNN for comparative analysis, and the results show that the GPR model is superior to the BPNN model, and BPR model. Finally, a monitoring index based on MD is designed to measure the deviation between the abnormal state of the unit and the state of the health model. In the process of verifying the effectiveness of the proposed method through several fault examples, it is compared with the analysis results of the existing monitoring system of hydropower stations. The results show that the proposed method can effectively track the operation state change process of the hydropower units and identify the potential fault of the unit in advance. The proposed method mainly uses the information rich data stored in the monitoring system of hydropower station, which is usually easy to be ignored. In addition, the method mainly uses the data stored in the system and does not need additional equipment to obtain, so it reduces the cost of research. Compared with the existing research methods, this method has the advantage of simplicity in application. The method proposed in this paper can detect the faults of the unit before the regular inspection and maintenance, and the shortest time is two days ahead of the regular inspection and maintenance. During this period, the crew members can make the best maintenance decisions, reduce the time of unit operation with faults, bring down the incidence of accidents, and lower the overall maintenance cost.
The method in this paper is suitable for the monitoring data analysis of a hydropower station in Nanning, Guangxi, China. However, if it is to be extended to other power stations or similar equipment, some improvements should be made in combination with the research object to improve the generalization ability of the method. For example, the model parameters are timely updated according to the healthy evolution of the system. Secondly, most of the monitoring data studied in this paper are the normal operation data of hydropower units, with less fault data and less fault types with labels, which cannot accurately locate and identify the fault types. Only when the data is complete can the fault location be realized by using the existing fault diagnosis methods. The above is also the focus of our follow-up work.