Extended Isolation Forests for Fault Detection in Small Hydroelectric Plants

Maintenance in small hydroelectric plants is fundamental for guaranteeing the expansion of clean energy sources and supplying the energy estimated to be necessary for the coming years. Most fault diagnosis models for hydroelectric generating units, proposed so far, are based on the distance between the normal operating profile and newly observed values. The extended isolation forest model is a model, based on binary trees, that has been gaining prominence in anomaly detection applications. However, no study so far has reported the application of the algorithm in the context of hydroelectric power generation. We compared this model with the PCA and KICA-PCA models, using one-year operating data in a small hydroelectric plant with time-series anomaly detection metrics. The algorithm showed satisfactory results with less variance than the others; therefore, it is a suitable candidate for online fault detection applications in the sector.


Introduction
With energy demand expected to double by 2060, the development of clean energy sources is essential for guaranteeing an energy supply in the coming decades. Renewable energy already represents three quarters of yearly new installed capacity [1], and those related to water resources are the most applied. In this group, the construction of small hydroelectric plants (SHPs) has grown worldwide due to the lower initial investment, low operating costs, and increasing regulation of energy markets. The potential total energy generation capacity of these SHPs is twice the current total capacity of the presently installed energy plants [2].
Several case studies are reported in recent literature addressing the energy potential and importance of developing SHPs in emerging countries like Brazil [3], Turkey [4], Nigeria [5], and other sub-Saharan African [6] countries. Overall life cycle assessment is applied for quantitative economic evaluation of this type of undertaking in India [7] and Thailand [8]. Economic models of viability sensitivity analysis of SHPs stations are presented and applied to the energy context in Spain [9] and Greece [10]. A common factor among all these models of economic viability is the cost of operation and maintenance, which is a determining variable for the development of new stations.
However, maintenance of a hydroelectric generating plant is a complex task. It requires a certain level of expertise to ensure a satisfactory level of reliability of the asset during its useful life. There are three types of maintenance. The first, and most rudimentary, is corrective maintenance, in which a component is expected to break, and is then replaced. Preventive maintenance estimates the expected service life of a component, and replacement is done when the operating lifetime is reached.
• EIF presented reductions of 40.62% and 7.28% in the temporal distance, compared to the PCA and KICA-PCA.
The remainder of the present article is organized as follows. Section 2 defines the study methodology, describing the methods, algorithms, and data set applied. Section 3 presents the results and discussions of the simulations of the models, in addition to the outputs of the forest committee with illustrative examples of imminent failures. Finally, Section 4 presents the conclusions and recommendations for future work.

Dataset
The current study was developed in Ado Popinhak, an SHP situated in the southern region of Brazil. With an installed capacity of 22.6 MW, the plant supplies energy to 50,000 residences. Condition monitoring data from the main single HGU are registered every 5 min, and the scope of the study period is from 13 August 2018 to 9 August 2019.
We filtered out from the dataset the periods of maintenance, planned stop, operator intervention, or another status not associated with normal operation. Event data related to the asset are gathered from status reports and the maintenance management system to identify past failures. Fifty-nine faults were registered in the disclosed period, totaling 123 h of downtime. Six monitored variables are used: generator apparent power; bearing hydraulic lubrication unit (HLU) inflow; and, bearing vibration from four different positions: axial, vertical radial, horizontal radial, and coupled. Figure 1 presents the interaction between the variables in the dataset, in two different visualizations. The vibration variables are replaced by the average of the variables at the different measuring points. Figure 1a indicates a low-apparent power region, where the average vibration is higher than in the rest of the observations. These present a transient period in which the generator unit operates with imbalanced water inflow inside the runner. In such a state, the wear damage to the system and the fault risks are more serious.  Figure 1. Graphical representation of the data set in three dimensions. In (a), the entire data set is presented, regardless of the temporal relationship between data points. In (b), three excerpts from the series with imminent failures are presented, each in a different color. The darker the marker, the closer the fault. Points connected by lines represent sequential states.
Figure 1b presents three excerpts of the time series before failure. The analysis of the representation indicates that the failures generally occur in regions where the vibration and apparent power are at their maximum, and there may be significant fluctuations in the power and flow of HLU before they occur. The figure presents only a sub-sample, of 3 out of the 59 faults in the entire database, to avoid overload of information in the representation, which would make it difficult for the reader to analyze.
A fixed period, from 12 h prior to the failure up to the failure, splits the full data set into a training set and a test set. The training set corresponds to the healthy state, or normal operation, as long as the test set is linked to abnormal operation. In this way, the algorithm focuses its training on the positive class related to normal operating conditions, thus becoming a density estimator of the class of interest [42]. This type of approach is common in problems of unbalanced classes, in the context of anomaly detection, in which negative cases (our outliers) are absent or not adequately sampled [43].
After separation, the training and test set sizes were 47,857 and 4897, respectively. The ratio between training and test sets is about 10:1, which is an appropriate ratio when compared to reference studies on failure detection in hydroelectric plants that have adopted proportions of 8:1 [44] and 1140:100∼10:1 [33]. Anomaly detection algorithms, reported sequentially, are trained using only training data.

PCA
Principal Component Analysis (PCA) is a linear decomposition technique, effective for data dimensionality reduction that projects the correlated variables onto smaller sets of new variables that are orthogonal and retain most of the original variance. PCA is the most widely used data-driven technique for process monitoring, due to its capacity to deal with high-dimensional, noisy, and correlated data variance [45].
Let X ∈ R n×m be an observation matrix, where n is the number of samples, and m is the number of monitor variables. X can be decomposed by the function where E is the residual matrix, T ∈ R nxa is the score matrix, and P ∈ R mxa is loading matrix. The measure of PCA variance can be obtained by Hotelling's T 2 statistic representing the sum of the normalized squared scores where D is the diagonal matrix of the eigenvalues with the retained principal components and t = P T x, is the score of PCA, calculated from the multiplication of each element x and the loading matrix P.
The T 2 index is used for monitoring processing, detecting a systematic variation of the process every time an observation exceeds the confidence limit T 2 α , given by where n is the number of samples, α is the number of sensed variables, F α is the upper 100% critical point of F-distribution with α and n − α degree of freedom. As for classification, a set of class labels C is set as 1 if T 2 i > T 2 α or else 0, if conditions are not met, for T 2 1 , T 2 2 , ..., T 2 n .

KICA-PCA
The KICA-PCA method provides a kernel transformation of data into higher dimensional data, prior to the application of decomposition. Thus, the method is capable of handling nonlinear multivariate processes, such as SHP condition monitoring [33].
In this application, we adopted the explicit mapping to a low-dimensional Euclidean inner product space using a randomized feature map z : R nxm → R nxd proposed by [46], so that the inner product between a pair of transformed points approximates their kernel evaluation: Contrary to kernel's lifting Φ, z is low dimensional and k is the radial basis function k(x, y) = exp(−||x − y|| 2 /σ) and σ is the standard deviation.
The z mapping competes favorably in speed and accuracy, as evidenced by [46][47][48], being capable of handling the large training matrix of this study without exceeding computational resources of a standard personal computer.
The transformed matrix X is calculated by the kernel approximation z T z, such as each element where x i and x j are the ith and jth columns of X, respectively. Before the application of ICA, the matrix X should be whitened to eliminate the cross-relations among random variables. One popular method for whitening is to use the eigenvalue decomposition, considering x (k) with its co-variance R x = E{x (k)x (k) T }, as described in [32]. The association of the kernel transform and the ICA is known in the literature as KICA.
ICA is a statistical, computational technique originally proposed to solve the blind source separation problem by revealing patterns hidden in signals, variable sets, or measurements [32]: whereX is the the whitened transformed matrix, A is the mixing matrix, S is the independent component matrix, and E is the residual matrix. The basic problem is estimating the original component S and the matrix A fromX . ICA calculates a separating matrix W such that the components of the reconstructed matrix S become as independent of each other as possible, given aŝ From the multiplication of x = S Tx , a new matrix X is obtained which represents the independent components (ICs) from the sensed data. These matrices are used as input for the PCA monitoring algorithm in Equation (1) and used to calculate the T 2 score and classification set from the comparison with the T 2 α threshold.

iForest and EIF
While most anomaly detection approaches are based on normal instance profiling, iForest is an anomaly detection algorithm that explicitly isolates anomalies. The method exploits two particularities of anomalies: they represent fewer instances in the observed set, and, compared to healthy instances, they have discrepant attribute-values [35].
The method does not apply any distance or density measures, thereby eliminating the major computational cost of distance calculation. In addition, the algorithm scales up linearly while keeping memory usage low and constant, which aligns with parallel computing, making the model suitable for handling large, high-dimensional data sets.
The anomaly detection procedure using iForest is a two-stage procedure: the training stage constructs the isolation trees (iTree), using sub-samples from the training set; the subsequent evaluation stage calculates the anomaly score for each instance of test set [34,35].
The iForest builds an ensemble of binary trees individually trained using a sub-sample X s randomly drawn from X, X s ⊂ X. There are two control parameters in Algorithm 1: (1) the sub-sampling rate ψ sets the number of samples used for each tree training, and (2) the number of trees nt, related to the complexity of the model.
Output: a set of t iTrees Initialize Forest The normal points tend to be isolated at the deeper end of the tree, whereas anomalies are closer to the tree root, due to their singularity nature. The shorter the average path length, the higher the chances to be anomalies. Hence, the anomaly score s is then defined by: where n is the number of samples in the dataset, E(h(x)) is the average of path length h(x) from a group of isolation tree, and c(n) is the average of h(x) given n, used for normalizing the path length. If an instance returns an anomaly score s very close to 1, it is very likely that one represents an anomaly; if it is much smaller than 0.5, it is safe to say the instance is normal; if the instance returns s ≈ 0.5, the sample does not present any distinct anomaly [34].
Although the standard iForest algorithm is computationally efficient, there is a limitation as to how the anomaly score aggregates tree branches' length. Branch cuts are always horizontal or vertical, which introduces a bias in the anomaly score map.
The EIF algorithm can overcome this limitation by adopting random slopes along with the branching process. The selection of the branch cut then requires a random slope and a random intercept chosen from the range of values available in the training data. Each random slope is drawn from a random number for each coordinate of a vector m of size equal to the number of variables in a normal distribution N(0, 1). The intercept is obtained from the uniform distribution of a range of values present at the branching point. The splitting criterion for a point x is given by ( x − p). m ≤ 0.
Input: X-input data, e-current tree height, hl-height limit Output: an iTree if e ≥ hl or |X| ≤ 1 then return exNode{Size ← |X|} else randomly select a normal vector m ∈ IR |X| by drawing each coordinate of m from a standard Gaussian distribution. randomly selects an intercept point p ∈ IR |X| in the range of X set coordinates of m to zero according to extension level The property of concentration of data in clusters is maintained with the algorithm, as the intercept points p tend to accumulate where the data are, while the score maps are free of previously observed artifacts. EIF implementation modifies the lines of the original formulation in Algorithm 2 that describes the choice of the random value and intercept and adds an inequality condition test. Algorithm 3 is modified accordingly to receive the regular observation and intercept point of each tree, and to calculate the path depth if the condition test is valid [40]. Contamination is the parameters that estimate the number of outliers in a given set. The value is set near the confidence interval of 0.95, adapted for the Hotelling's distance-based models. Proposed values for the number of trees nt and the size of the ψ sub-sample are 100 and 256, respectively [34,35]. However, these parameters may vary according to the size and complexity of the dataset.
We carried out 50 simulations varying one parameter at a time while keeping the other fixed at its standard value. Figure 2 summarizes the results of these simulations, in which the points represent the average values calculated by the metric, while the bars represent the confidence interval of 0.95. The nt search parameter space is defined as 1, 5, 10, 50, 100, 500, and 1000, while ψ is the sample space following the power of 2, from 2 7 to 2 13 . By varying the nt, we find that, with the increase in the number of trees, the variance and average of the TTC and CTT decrease, with the model showing excellent stability with 500 trees. Performing the same analysis for ψ, we found that, for values above 2048, CTT starts to increase gradually. Table 1 presents the parameters adopted for EIF based on these observations.

Temporal Distance Metric
Whereas traditional anomaly detection adopts classification evaluation techniques, such as confusion matrix or one of its derived metrics [49], these metrics may be deficient in a time series context. Specific metrics, first developed for evaluating time series segmentation, are adopted in time series evaluation. In the present work, we adopt the average detection count and the absolute detection distance in order to evaluate the different methods [41].
Let Y be a time series of ordered set of real values indexed by natural numbers Y = {y 0 , y 1 , y 2 , ..., y n }, y t ∈ R and C a set of class labels or classification C = {c 0 , c 1 , c 2 , ..., c n }, c t ∈ {0, 1}, similar to Y but consisted of binary values {0, 1}. Values labeled as 0 represent normal values, and values labeled as 1 stand for anomalous values.
The average detection count l is simply given by the difference between the number of anomalies in the target classification and the candidate classification [41]: where count(C i ) = ∑ n j=1 c j , c j ∈ C i . The absolute temporal distance (TD) method consists of calculating the sum of all distances between anomalies from two classifications by [41]: where TTC appears for Target To Candidate and is calculated by f closest (C 0 , C i ), and CTT means Candidate To Target given by f closest (C i , C 0 ). C 0 and C i denote target classification and candidate classification, respectively. Note that lower values scored in each of the individual metrics are better than higher ones. Figure 3 graphically represents the calculation of the metric. In the given example, TTC = ∆t 1 + ∆t 2 and CTT = ∆t 1 , thus the absolute temporal distance TD = 2∆t 1 + ∆t 2 . The best possible value for the metric is zero, comprising a perfect detection system.

Software and Hardware
The simulation was developed using the Python language, version 3.7.6, adopting common scientific libraries SciPy 1.4.1, Pandas 1.0.1 and NumPy 1.18.1. Scikit-learn 0.22.1 [50] presents efficient and reliable implementations of machine learning algorithms, such as PCA and Isolation forest. KICA-PCA was implemented by sequentially declaring the kernel approximation, ICA and PCA into a pipeline of transformers. We applied the authors' original EIF algorithm implementation, available at the isotree 0.1.16 package [40].
Hardware specifications adopted to perform the simulation are: CPU Intel Core i7-8550U, 1.80 GHz, 16 GB RAM installed, and Windows 10 v.1909 operating system. The amount of time necessary to perform all 150 simulations is around 2 h. All data and scripts are available in the researcher's public repository (Github repository: https://github.com/rodrigosantis1/shp_anomaly).

Results and Discussion
The temporal distance and anomaly detection count are measured in test sets for each method, using Equations (9) and (10) shown in Section 2.5. The methods were trained 150 times using the training set, with unique random seeds. Table 2 exhibits the average and standard deviation of the calculated metrics. KICA-PCA obtained the smallest difference between real fault detection, in general. It was followed by PCA, EIF, and iForest. The methods with the lowest scores are shown in bold. As a linear model, PCA converges equally to the same solution when applied to a single training set. Thus, the standard deviation, unaffected by randomness, is not calculated for this particular method. PCA is the method with the lowest CTT distance. This means that, since the distance between the detections and anomalies is the lowest, it is less likely to raise false alarms than the others. On the other hand, the TTC distance is higher with PCA than with all other methods. This means that the linear approach is less effective in detecting all anomalies. The total TD, obtained from the sum of the two individual components, is higher due to the TTC score.
Combining the PCA with the kernel trick and the ICA increased the accuracy, compared to the PCA alone. KICA-PCA presented intermediate distances when compared to all other methods. When compared to PCA, the nonlinear method improved the TTC, TD, and l. However, the main drawback is that its variance is higher than the others, meaning the method is more susceptible to randomness.
iForest presented the second-lowest TTC distance, which is the distance between the anomalies and the closest detection. Thus, this model was able to detect anomalies closest to their occurrence and, having the lowest TD and standard deviation, to present a suitable method for adoption in an online detection system. The main drawback with this method is the detection count: the number of detections is higher than the other methods. The EIF algorithm boosted the results obtained by iForest, with a TD reduction of 3.88% and an l reduction of 4.02% compared to its original implementation. Figure 4 shows the anomaly score calculated using the EIF model for the entire test period. The moving average of the anomaly score illustrates that the observations are time-related, noting that the level of health tends to fluctuate over time. Factors like wear level, time of operation, maintenance, and shutdown directly impact the health index (HI). Analyzing some sample cases give us a better visualization of the model outcomes in practice. In the first period, the iForest model detected two anomalies before the first registered system failure. The first detection occurred 10 h before the failure, and the second detection occurred 2 h before the failure. In this case, an intervention in the HGU system could have avoided the failure. A late detection was raised just after the first failure, which is considered a near detection. One hour before the second failure, the EIF model detected an anomaly. In total, the model detected four anomalies while two faults registered. The anomaly score profile calculated in this first excerpt is unsteady, with sudden fluctuations in the HI throughout the plotted period, making the detection task even more difficult.
The anomaly score increased over time in the second observed period, resulting in the first fault around noon. The HGU was rebooted and continued to operate, while iForest detected several anomalies. Six hours later, another fault occurred. In this example, the system could not detect any anomaly before the first failure, although it identified an abnormal state of operation between failures. It is possible to visually identify a positive growth trend in the anomaly score. The number of detections, in this example, associated with iForest results in Table 2, can explain the higher value of l and TTC.
In the third example, the anomaly score calculated between 10:00 p.m. and 11:00 p.m. detected an unusual operation, identifying two anomalies and presenting a situation in which the system nearly failed. However, the online monitoring system registered no fault until the next day. The model detects an anomaly minutes before the real fault. The damage is severe, and the system is shutdown. Throughout the rest of the observed period, the anomaly score was relatively low, close to 0.40.
Prognosis usually adopts the anomaly score to forecast the HGU behavior. This approach is commonly used to predict vibration trends and, since the score is a nonlinear combination of the monitored variables, it is likely that it also reflects the frequencies and amplitudes of the original signals. Some examples of vibration prediction models for prognosis can be found in [27,51].
The model has the limitation of not allowing the analysis of the importance of attributes, as do other models based on decision trees. The choice of the separation attributes in each node is random, and not generated from an explicit rule. However, machine understanding models, such as permutation-based and depth-based isolation forest, feature importance that can be used to circumvent this model limitation [52].
From our simulations, we found that EIF obtained an average TD reduction of 1628 (40.62%) compared to PCA, and of 187 (7.28%) compared to KICA-PCA. These results indicate that the anomaly detection algorithms are efficient and suitable for dealing with the problem of intelligent fault detection in hydroelectric plants, as indicated in the qualitative analysis of imminent failure. In some cases, the anomaly score depicts the trend in the risk of failure. In other cases, the anomaly score identifies regions of at-risk operation, even though no fault is registered.
Continuous improvement of the model is found in associating the detected fault patterns with known failure modes, using fault analysis techniques such as fault trees [53][54][55]. The anomaly score that is calculated can be used in future work to develop forecasting systems. The adoption of a single dimension HI simplifies the process control and the design of the predictive system. Instead of predicting each variable in isolation, one can focus on analyzing a single time series, which carries the individual characteristics of each of the individual measurement variables.

Conclusions
In the present paper, we propose the application of iForest for fault diagnosis in a small hydro-electric plant in CBM. The observed period is approximately one year, and the main input variables are vibration, oil inflow and apparent power. The model benchmarks, in the recently reported hydro-power fault diagnosis literature, are PCA and KICA-PCA, using the specific metrics of time series anomaly detection, temporal distance, and average detection count. The tree ensembles presented promising results, with lower error levels and variance than KICA-PCA. Another significant advantage of adopting iForest and EIF is their capability for parallel computing, which speeds up model training while keeping memory usage low, and fixed to a known limit.
Identifying failures before they occur is vital to allowing better management of asset maintenance, reducing operating costs and, in the case of SHPs, enabling the expansion of renewable energy sources in the energy matrix [56]. With the application of machine learning models such as iForest and EIF, the aim is to improve the health of the equipment and reduce power generation downtime.
Future studies should include investigating feature and model selection through exhaustive searching, Bayesian or evolutionary optimization, as parameters manually adjusted. Fine-tuning the models can contribute even more to increasing model accuracy. A step towards the prognostic model can be taken from the prediction of the anomaly score by decomposing the signal into components in the time and frequency spectrum, and combining methods of extracting attributes with uni-or multi-variate forecasting [28,51].
Another essential beneficial area of the present study is identifying feature importance in a SHP diagnosis system. This knowledge can guide the development of CBM systems by prioritizing the installation of critical sensors in SHP automation projects. EIF, since it is a generalization of iForest, can be combined with forward selection component analysis [57] for automatic variable selection.
Finally, the present study contributes to the improvement of SHP maintenance, a vital renewable power resource with huge potential for energy supply worldwide. By identifying faults before failure, management can take actions to avoid further damage caused to joint systems and further aggravation of the components, lowering the operating costs of power plants.

Acknowledgments:
The authors would like to thank Ado Popinhak for making available the failure monitoring database used in this study.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Nomenclature
The following nomenclature is adopted in this manuscript: