Improved Ensemble Learning for Wind Turbine Main Bearing Fault Diagnosis

The goal of this paper is to develop, implement, and validate a methodology for wind turbines’ main bearing fault prediction based on an ensemble of an artificial neural network (normality model designed at turbine level) and an isolation forest (anomaly detection model designed at wind park level) algorithms trained only on SCADA data. The normal behavior and the anomalous samples of the wind turbines are identified and several interpretable indicators are proposed based on the predictions of these algorithms, to provide the wind park operators with understandable information with enough time to plan operations ahead and avoid unexpected costs. The stated methodology is validated in a real underproduction wind park composed by 18 wind turbines.


Introduction
Global demand for energy has achieved an unprecedented level with the growing world population and the rising industrialization in developing countries. However, globally, the largest amount of energy is obtained from fossil fuels, which is closely related to the rising levels of greenhouse gases emissions. Definitely, energy transition to renewable sources is the crux of the matter to fight climate change. As stated in [1] only 10% of the world's primary energy supply is already from renewable energies, but they are steadily growing. Among renewable energy sources, wind power has been the fastest growing in the recent decades, and in 2019 it was the leading source of new capacity in Europe, the US, and Canada, and the second largest in China [2].
It is noteworthy that wind energy levelized cost of energy (LCOE) fell 39% between 2010 and 2019 [3]. However, from this LCOE, the operation and maintenance (O&M) accounts for 28.5% in land-based wind projects, and up to 34% in offshore wind projects [4]. Energy production losses due to downtime (caused by O&M of the assets), together with the costs associated to the replacement of components can scale up to millions of Euros per year in any industrial size wind park. Thus, it is of paramount importance that the wind industry moves from corrective (repairing components after they break down) and preventive maintenance (scheduled at regular intervals) to predictive maintenance (scheduled as needed based on the asset condition). Predictive, or condition-based maintenance (CBM), provides operators with an advanced warning before the actual fault occurs, allowing them to plan ahead and schedule repairs to coincide with weather or production windows to reduce costs and turbine downtime. CBM is an extensive area of research in a wide variety of applications, such as smart manufacturing [5]. However, its application to complex systems, such as wind turbines (WTs) that work under different and varying operating and environmental conditions remains an open challenge. Furthermore, the latest CBM developments tend to use expensive specifically tailored sensors for condition monitoring (CM), which is not economically viable for turbines already under operation and even less in case they are close to reach the end of their lifespan.
The life expectancy of wind parks from the late 1990s and early 2000s surge of wind power is about to be completed. A decision on lifetime extension is complicated, but it is clear that end-of-life solutions will develop a significant market over the next five years [6]. In particular, CBM data-driven methodologies based on already available supervisory control and data acquisition (SCADA) data are a promising cost-effective solution. These SCADA data are highly variable because of the changing operational conditions, they have a low sampling time (10 min average value), are not standardized, and have not been initially designed for the particular purpose of CM but for control purposes. Therefore, it is a hard challenge to contribute CBM strategies based solely on these data [7]. However, this is an active research area, as shown by recent publications. On the one hand, much of the references found in the literature work with simulated SCADA (e.g., [8]) that can be obtained using open source simulation software and, rarely, real data as these are proprietary data from the wind park operators and are not easily available. On the other hand, when dealing with real data, normally, only one or two WTs are tested. For example, [9] where support vector machines are used to predict WT faults, and [10], where different machine learning classifiers are compared to predict generator faults. These references use real WT SCADA data from one and two WTs, respectively, and are based on supervised approaches that require historical fault data to be constructed. This is an important drawback, as obtaining labeled datasets from operational data is typically hard, it is exposed to errors, and leads to a highly unbalanced dataset. Additionally, the methodology can not be applied straightforward to wind parks where the fault of interest did not occur in the past. Thus, despite the promising performance of supervised methods, unsupervised approaches are preferred for SCADA predictive maintenance [11]. The recent works related to unsupervised SCADA based predictive maintenance can be mainly grouped in one of the following three categories: signal trending, normality models, and anomaly detection models. In the following paragraphs, a brief review of each one of these categories is given.
The approaches in the signal trending category study changes and trends in the SCADA time series. Some relevant works include [12], where monitoring of WT generators using the statistical inertia of a wind park average is proposed; and [13], where the difference between the SCADA data of each turbine with the median of the rest of turbines in the wind park is used to establish a condition vector to later apply vector autoregression Hotelling and vector autoregression multivariate exponentially weighted moving average to locate the faulty turbine. The main advantages of signal trending methods are its simplicity and interpretability. However, most of these methods can not capture complex relations among different variables, thus machine and deep learning models are needed to move beyond signal trending solutions.
Regarding data-based normality models, also called normal behavior models, their objective is to learn the relation between a set of input variables and a target variable under normal operation. Then, the difference between predictions and real measurements (for the target variable) is used to detect abnormal behavior. This normality models are well established in unsupervised SCADA based predictive maintenance, but the general trend is to use the power curve (relation between the wind intensity and the extracted power) as the target variable. It is noteworthy the work of [14] where main working parameters (e.g., the rotor speed, and the blade pitch) are used as input variables and the power is employed as the target to construct the normality model. The main advantage of these methods is their capability to learn complex relations between different sensors. On the other hand, its major drawback is a lack of interpretability.
The main bearing is the object of this analysis due to its central role in the drive train assemble. It connects the rotor to the gearbox, and it has to withstand the large torque generated by the rotor, making it prone to problems. Failure rates of 30% are reported for single main bearing and 15% for double main bearing calculated over a 20 year lifetime [15]. Within the typical wear mechanisms affecting the main bearing, micro-pitting, spalling, and smearing can be listed [16]. Main bearing failures can be anticipated through vibration analysis, but also using predictive models based on SCADA data and analysis of temperature signals, as reported in [17].
In this work, a normality model at WT level is selected, based in [18], where the target variable is selected to be the closest sensor to the component under study. In particular, as the main bearing is the component to be monitored, the main shaft temperature is used as target variable. In this work, the component under study is the main bearing because, as stated by the European Academy of Wind Energy (EAWE) [16], the wind industry has identified main bearing failures as a critical issue in terms of increasing WT reliability and availability, as they lead to major repairs with high replacement costs and long downtime periods.
The methods in the category of anomaly detection models, also called outlier detection, seek to identify rare samples which raise suspicion by differing significantly from the majority of the data. A significant work is [19] where a comparison of three anomaly detection techniques (one-class support vector machine, isolation forest, and elliptical envelope) for WT CBM using SCADA data is realized. Isolation forest has some advantages over other approaches as it does not require a normal operation dataset, and it requires less training data than other deep learning strategies. In this work, an anomaly detection model at wind park level is proposed based on the isolation forest methodology.
Considering all the aforementioned references, in this work, a CBM strategy based on SCADA data is stated with a five-fold contribution: (i) It is an unsupervised approach, thus there is no need that the specific studied fault happened in the past to train the proposed models; (ii) It is an ensemble that combines the benefits of a WT normal behavior model with a wind park anomaly detection model; (iii) It combines different training and prediction window sizes; (iv) Fault prediction is accomplished months in advance prior to the fault, giving enough time to operators to plan ahead and schedule repairs; and (v) The validity and performance of the proposed methodology is demonstrated (tested) on a dataset covering two years and a half of operation from a real underproduction wind park composed by 18 WTs.
The rest of the article is organized as follows. Section 2 presents the available SCADA data and work order logs used in this work. Next, Section 3 states the proposed ensemble methodology for WT main bearing fault diagnosis, including a comprehensive description of the single models and their indicators. Section 4 presents the results and their discussion to interpret and describe the significance of the ensemble method in comparison to the single models by themselves. Finally, in Section 5, conclusions are drawn, and future work is proposed.

Data
Two information sources are used in this research: (i) SCADA operating data and (ii) work order logs. The first one is used for modeling the behavior of the turbine, whereas the second one is used uniquely to validate the results of the algorithms. Both SCADA and work order logs are typically available to wind park owners, without the need of installing new sensors nor change operating routines. Albeit, the format and content of the work order logs, being less structured and not standardized, may vary significantly from one wind park maintainer to another.

SCADA Operating Data
SCADA is made off a vast net of sensors, monitoring the state of the main components of a turbine, as well as the environmental conditions. Operating data are recorded at high frequency and successively down-sampled to reduce network usage and storage costs.
The resolution of SCADA data is usually of 10 min. To reduce information loss, due to down-sampling, not only the mean value across the aggregation period is stored, but also the standard deviation, minimum, and maximum values. An example of a typical SCADA dataset is provided in Table 1. Operating data from a European onshore wind park, situated in Poland, is used in this research. A total of 18 turbines with nominal power of 2.3 MW is available. Approximately three and a half years of data are analyzed, starting from the beginning of 2017 to mid-2020. Data from 2017 are used to train the normality model, and the remaining two and a half years are used to test predictions. Having more than one year of data allows to control seasonal variations in environment temperature and wind-speed.

Work Orders
Wind park owners commonly keep track of the ordinary and extraordinary maintenance interventions required by the turbines. This information is typically stored in text files, where a description of the intervention is reported with a timestamp of the date in which it occurred. Sometimes detailed information, such as the material required for the intervention and other details, is provided. In this research, work order logs are used to assess the validity of the main-bearing status predictions, and it is not fed as input to the algorithms. An example of the work order logs is presented in Table 2.

Normal Behavior Model
In this subsection, the selected normal behavior model based on an artificial neural network (ANN) is comprehensively described. Note that this model is designed at a turbine level, thus each WT in the park will have its associated normality model trained with only its own historical SCADA data. First, the data preprocess is detailed to deal with out-of-range values, missing data, and sensors with different magnitudes. Second, a one-year time window is selected to define the training dataset including all operating conditions of the WT. This will ensure that the detected anomalies are not just a change in seasonality and will allow the methodology to be used across all regions of operation of the WT. Third, the ANN set-up is detailed, including a brief explanation of the optimization and regularization methods, as well as the selection of the ANN structure. Finally, a specific purposely build indicator for the normal behavior model is stated.

Data Preprocess
First, variable selection to determine a set of variables that will provide the best fit for the model so that accurate predictions can be made is needed. In this work, the temperatures of the components located close to the main bearing are selected as variables of the normality model together with the ambient temperature, as it affects the temperatures of all subsystems. Additionally, the generated power and rotor speed provide information about the region of operation of the WT and, thus, they are also used as variables. In summary, the selected variables are shown in Table 3 where also the specific ranges of realistic values for each sensor are listed.
Second, for each variable out-of-range values are deleted as they are associated with sensor measurement errors. Furthermore, data imputation of missing values (and deleted values of the prior step) is carried out through piecewise cubic Hermite interpolating polynomials. As it has a local smoothing property, this strategy produces more stable estimates compared to other standard approaches used for data imputation [20]. Notice that the nearest value before or after the missing values is used at the beginning and end of the dataset, respectively.
Finally, as data from various variables have varying magnitudes, the max-min normalization is used to scale the dataset.

Train and Test Sets
The aim of the normality model is that it is capable to cope with the various operational and environmental conditions that the WT will face, see [18]. Thus, the train and test datasets include data from all working conditions: different wind speed regions and their associated regions of operation of the WT (start up, maximize power, and limit wind power to avoid exceeding the safe electrical and mechanical loads), different year seasons, curtailment, etc. Therefore, it is noteworthy that in this work there is no filtering of the data based on specific regions of operation or seasonality. The available SCADA data, for the normality model, are divided as follows: data from 2017 are used to train the model (thus, the training dataset contains a whole year of data), and the remaining two and a half years are used to test predictions. Note that there is no validation set because Bayesian regularization is used to train the ANN.
Finally, recall that a customized normality model will be built for each wind turbine in the park. This could raise some concerns related to its computational cost. However, note that after the model is trained on one year data, this model is used for two-year predictions ahead, as will be shown in the results section. Thus, the computational cost is low in comparison to the use of a training rolling window of observations preceding the target forecast that must be retrained at each window shift of the data.

ANN Set Up
The ANN model structure is proposed in this section and is based on the eight selected variables, shown in Table 3. The output of the ANN is the temperature of the low-speed shaft (variable of interest) at time t, and the inputs are the following ones: y 1 : generated real power at time t − 1, y 2 : generated real power at time t, y 3 : ambient temperature at time t − 1, y 4 : ambient temperature at time t, y 5 : bearing coupling side temperature at time t − 1, y 6 : bearing coupling side temperature at time t, y 7 : bearing non-coupling side temperature at time t − 1, y 8 : bearing non-coupling side temperature at time t, y 9 : generator temperature at time t − 1, y 10 : generator temperature at time t, y 11 : gearbox temperature at time t − 1, y 12 : gearbox temperature at time t, y 13 : rotor speed at time t − 1, y 14 : rotor speed at time t.
Thus, referring to the structure of the ANN, there are 14 inputs noted as y i , i = 1, · · · , 14, there is 1 output noted as y, and a hidden layer comprising 72 neurons. Figure 1 shows the ANN architecture.
. . . . . .  The ANN is trained with the Levenberg-Marquardt optimization method combined with Bayesian regularization (to enhance the generalization capability of the model) as they are able to obtain lower mean squared errors than any other algorithms for functional approximation problems, see [21,22]. Recall that the main purpose of the WT normality model is to approximate precisely a function, namely the temperature of the low-speed shaft of that specific WT, under normal (healthy) condition. The Bayesian selection of the regularization parameters provides an optimal regularized solution, as well as insight into the effective number of parameters actually used by the ANN which is extremely useful to design the size of the network. In this work a value of 1058 effective number of parameters is obtained from a total of 1153 parameters in the proposed network (number of weights and biases), thus the complexity of the ANN is appropriate for the used training dataset. A comprehensive description of the Levenberg-Marquardt with Bayesian regularization method can be found, for example, in [18,21,23].
Finally, note that the network uses rectified linear unit (ReLU) activation functions, and weights and biases initialization is performed using the Xavier initializer [24].

Normal Behavior Model Indicator
A fault indicator is needed that activates an alarm when samples not following the normality model are detected. Initially, such an indicator could be based on establishing a threshold in the residual between the estimated value (given by the ANN) and the real SCADA data. However, this will result in an unacceptably high number of false alarms because of solitary samples trespassing the threshold, rendering the method worthless. Therefore, it is critical to establish an indicator that takes into account the persistence of consecutive samples above a certain threshold.
First, a detection threshold is prescribed as threshold = µ + 6σ, based on the mean, µ, and standard deviation, σ, of the residuals over the training data.
Second, a weekly indicator is implemented as follows: indicator normality = min 1, n over 504 . (1) where n over denotes the number of samples that had a residual value greater than the threshold that week. Note that this indicator has a range between 0 and 1. When all samples are below the threshold, the indicator value is 0. When half or more of the samples in a week are above the threshold, the indicator value is 1.

Isolation Forest
A second method is used to analyze the status of the main bearing, namely an anomaly detection algorithm: isolation forest. The objective is to complement the prediction of the neural network normality model, using an algorithm that is able to analyze the wind park as a whole. First, data pre-processing is discussed, highlighting the differences to the processing required by the neural network. Second, the testing scheme for the isolation forest is presented; the main difference with the previous algorithm is the lack of a training period. Third, a brief explanation of isolation forest is provided together with a discussion on the selection of the main parameters for the algorithm. Finally, the post-processing of the anomaly detection results is detailed, showing how to construct an indicator that captures differences within turbines.

Data Preprocess
In analogy to the neural-network pre-processing, a reduced selection of variables is used. Main bearing temperature is chosen, as monitoring the status of this component is the objective of the research. Additionally, ambient temperature and rotating speed of the main shaft are selected as they determine information on the context and operating status of the main bearing. A summary table of the inputs is proposed in Table 4, completed with the ranges used to filter outliers and sensor measurement errors. No data imputation is performed for missing values, instead data are down-sampled from ten minutes to hourly resolution, reducing the amount of empty timestamps and cutting computation time. Being the main bearing a massive component the reduction in data frequency is mitigated by the large thermal inertia of the bearing, thus limiting information loss.

Testing Scheme
The way the isolation forest algorithm is used in this research, requires no selection of a set of normal conditions to adjust the parameters of the model. Instead, data are analyzed through a rolling window of 1 month width and instances lying at the border of the data distribution are marked as anomalies. The rolling window is shifted each time by one week, effectively sweeping the entire dataset. The choice of the window length is the fruit of a compromise between computation time and the ability to capture a complete set of operating conditions of the turbines.

Isolation Forest Setup
Isolation forest has been introduced by Liu et al. [25] as an efficient anomaly detection algorithm based on widely known decision trees. Unlike most model-based algorithms, no normal condition training set is required. Points lying at the edges of the data distribution are to be intended as anomalies. Having this definition of anomalous point, isolation forest uses binary search trees to recursively split data. Various fully developed trees, i.e., having single instance terminal nodes, are trained. Isolation of anomalies is promoted by using random partitions when separating data, as indicated in [25]. The anomaly score used to determine the likelihood of a point to be anomalous is based on the average path length required by the trained trees to isolate the cited point from the rest of the data. Equation (2) is the canonical definition for an isolation forest anomaly score.
where E(h(x)) is the average of the path length h(x) required to isolate the given point. The denominator c(n) is the average path length of unsuccessful search in binary search tree. Given a distribution d, three cases are typical: In this research, the implementation of Isolation Forest from Sklearn, for Python programming language is utilized [26]. The following parameters are set to run the algorithm: The number of estimators determines the quantity of decision trees of which the isolation forest is composed. More trees lead to more reliable estimations of the path length, at the cost of higher computational time. The contamination coefficient represents the expected amount of anomalies in the data. The points in the analyzed distribution are sorted by their anomaly score and the most anomalous ones, according to the contamination value, are selected. Finally, the maximum number of samples to use to train each estimator, i.e., each decision tree, is defined by the homonymous parameter.

Anomaly Detection Indicator
An anomaly indicator is defined to translate the output of the isolation forest into concrete information on the WT status. The discovered anomalies are assigned to the corresponding turbines, to define which turbines have a larger percentage of anomalous points with respect to the rest of the wind park. The indicator is calculated as the ratio between the anomalous points (n anomalies ) and the total amount of records (n total ) of a turbine for a given period of time, see Equation (3). The idea is that if no clear differences between turbines' behaviors are present, the anomaly indicator will be similar across the wind farm. On the other hand, if a turbine is subjected to different operating conditions, its percentage of anomalies over the total number of records will be higher if compared with other turbines.

Ensemble
Ensemble strategies have been widely used in the literature both in power output prediction, as well as fault diagnosis [27][28][29].
The output of the previous algorithms is merged into a composed indicator, with the objective to produce better predictions by overcoming the limitations of the two methods. To build a valuable ensemble, it is important to choose base indicators that are complementary and not redundant, thus a key step is to verify that the correlation of the indicators is limited. The correlation of the indicators is addressed in Section 4.3. The normality indicator provides an individual analysis of the turbines, by comparing a historical model with current conditions, whereas the anomaly indicator is based on a comparison within the different turbines of the wind farm.

Ensemble Indicator
The ensemble strategy that is used in this research does not require training of an additional meta-learner, instead the normality indicator (1) and anomaly indicator (3) are combined through a rolling windowed sum. The pseudocode to compute the ensemble indicator is given in Algorithm 1. In particular, the first step is to rank the turbines, assigning the corresponding percentile in which they fall when compared to the indicator distribution of the whole wind farm. Possible ties between turbines are managed using a ranking scheme that assigns to the tying turbines the lowest rank of the group. Being the single model indicators calculated on a weekly basis, each turbine will be assigned a percentile rank per each week. The next step consists in the application of a rolling window sum. More specifically, a window of p = 4 weeks size is chosen. Applying a rolling window sum allows to highlight long-lasting changes, limiting the influence of isolated peaks. Finally, a decision threshold is applied to decide whether an alarm must be or not triggered.

Test Data
The proposed methodology is validated with real SCADA data from an underproduction wind park composed by 18 wind turbines. Recall that approximately three and a half years of data are analyzed, starting from the beginning of 2017 to mid-2020. Data from 2017 are used to train the normality model, and the remaining two and a half years are used to test predictions using the work order logs. Table 5 shows the main bearing failures during the test period based on the work order logs. Three WTs in the park suffer from the main bearing failure during the test period. Note that the work order log comments are different for each case, ranging from lubrication to component replacement.

Performance Metrics
To evaluate the performance of the proposed methodology, the results will be analyzed via a confusion matrix based on the criteria that is described next. When the alarm is triggered within a six-month window before the failure actually occurs, the result is reported as a true positive (TP). The method in this case provides sufficient advance notice of the failure, so that the wind park operators can organize the maintenance operation while minimizing costs. When the alarm is not triggered and the WT does not have any work order, the result is documented as a true negative (TN). In this case, the method is correctly establishing that the WT is under normal operation. When the alarm is not triggered, or it is done out of the six-month before failure window, and the WT suffers a main bearing failure, the result is reported as a false negative (FN). Finally, when the alarm is triggered but the WT does not have any work order, this result is informed as a false positive (FP).
Finally, the following performance metrics are used to analyze the results: accuracy, precision, recall, specificity, and F1 score. Their definitions are briefly recalled hereby,

Ensemble vs. Single Models
The different models that compound the ensemble should provide complementary information, see [17]. Thus, it is important to analyze the correlation between the indicators provided by each of the individual models: the anomaly detection model vs. the normality model. Figure 2 displays the isolation forest indicator vs. the neural network indicator for each WT in the park separately. The Pearson correlation coefficient and p-value for testing non-correlation are reported as r and p, respectively. It is shown that, in general, there is no correlation between the individual indicators (low values of the Pearson correlation coefficient are obtained), thus they are complementary in terms of their contained information. Note the particular case of WT05 where the Pearson correlation is about 0.78, which indicates that there is a moderate positive relationship between the variables, as the p-value in this case is almost zero (which indicates that the correlation coefficient is significant). The explanation of this special behavior is that WT05 suffers the only main bearing fault in the park that is correctly predicted by both indicators, thus in this case the indicators are correlated. Figures 3-5 show the normal behavior model indicator, the anomaly detection indicator, and the ensemble indicator, respectively. In these figures, the blue stars indicate the occurrence of a main bearing fault (as recorded in the work order logs). Note that the normal behavior indicator shows high values in WT01 and WT18, which are healthy over the whole test set, but the anomaly detection indicator shows low values for these WTs. On the contrary, the anomaly indicator shows high values for WT16, which is healthy, but the normality indicator shows low values for this WT. The ensemble indicator adequately combines the information from the single indicators showing low indicator values for WT01, WT16, and WT18 as desired.    To make a decision to trigger or not an alarm, a decision threshold is needed. Values of the indicator above the threshold will activate an alarm, and values below the threshold will be considered normal. To better visualize the results, a saturation mask is proposed in the following manner: all indicator values below the threshold are saturated to 0, and all indicator values above or equal to the threshold are saturated to 1. Figure 6 shows the ensemble indicator after applying the saturation mask with a decision threshold value of 0.85. It can be observed that the three main bearing faults in the park (highlighted with the blue stars) are warned at least 5 months in advance. On the other hand, there are false alarms at WT01, WT02, and WT09. The value of the decision threshold (DT) has a great impact on the final results. Figure 7 shows the DT value with respect to the different metrics. It is observed that, in general, higher values of the DT lead to better performance metrics, and it is noteworthy that for the same value of the DT the best performance metrics are clearly obtained with the ensemble method (except for low values of the threshold that have no practical relevance as they would lead to an excessive number of false alarms).
To further analyze the impact of the DT on the final results, Table 6 shows the confusion matrix information with respect to the DT value for the ensemble method. It is clear that low values of the threshold have no practical application since they lead to diagnose always as faulty the WT, since all results are positive (either TP or FP). Increasing the value of the DT diminishes the number of FP (false alarms), and consequently increases the number of TN (correctly classified healthy WTs). When the DT is equal to 0.95 only 28 instances are wrongly classified as FN (that is, faulty instances wrongly classified as healthy). When the DT is equal to 0.90 there are 58 instances wrongly classified as FP (false alarm), however, all faulty instances are correctly detected. In this case, when DT = 0.9, all faults are detected at the expense of having some false alarms. The wind park operator should compare the cost of unnecessary checks (due to false alarms) with respect to the savings of early warning of the main bearing fault to decide the best DT to be used.  Finally, Figure 8 and Table 7 compare the performance metrics among the different single models and the ensemble model with respect to different DT values. In particular, Figure 8 shows a bar plot of the used metrics for three different specific values of DT, namely 0.6, 0.85, and 0.95. It is clear that the ensemble strategy outperforms in all cases the single methods, being outstanding the result in precision for DT=0.95 where a value of 1.0 is achieved with the ensemble but values lower than 0.5 are obtained with the normality and anomaly models by themselves. Table 7 gives a further detailed comparison of the performance metrics for values of the DT from 0 to 0.95 with increments of 0.05. Note that for values of the DT equal or bigger than 0.7 the best performance metrics are obtained with the ensemble method, clearly showing the advantage of this method with respect to the single models.

Conclusions
In this work an ensemble method for main bearing fault diagnosis has been proposed, implemented, and validated on a real under-production wind park composed by 18 WTs. The ensemble combines a normality model at WT level (i.e., a specific model is trained for each WT in the park) with an anomaly detection model at wind park level by using only SCADA data and past work order logs.
The stated ensemble method only requires healthy data to be trained (since it is not a supervised approach), and thus the methodology can be applied to any WT, even when there are no past records about the fault under study. The obtained results show that for the same value of the DT the best performance metrics are clearly obtained with the ensemble method in comparison to the single methods. Furthermore, the proposed strategy is able to warn at least 5 months in advance (giving enough time to the wind park operator to organize logistics and minimize downtime) the three main bearing faults present in the wind park during the test set, with few false alarms.
The DT has a great influence on the final results, thus it is strongly advised that the wind park operator analyzes the cost of unnecessary checks (due to false alarms) with respect to the savings of early warning of the main bearing fault to decide the best DT to be employed.
Two future work directions are envisioned. First, the extended isolation forest (EIF) presented in [30] will be incorporated to investigate the possible improvement in the results. Second, gearbox faults will be studied, as gearboxes tend to fail prematurely in WTs, and their replacement is very costly (the outage can last between a few days to months, depending on crane and parts availability). The wind park SCADA data and work order logs used in this research also contain gearbox faults, thus future work will address this type of fault, taking as starting point the ensemble method proposed in this work.
Author Contributions: All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data required to reproduce these findings cannot be shared at this time as it is proprietary.