1. Introduction
Each year, the wind sector has profit losses due to wind turbine failures that can range from around 200 M€ in Spain or 700 M€ in Europe to 2200 M€ in the rest of the world. Additionally, if operation costs are taken into account, these losses can be tripled. Owing to the volume of losses and the actual economic situation in the sector, without any bonuses to the generation and furthermore with generation selling prices policy restricted by new energy directives (see for example [
1,
2]), tasks related to maintenance and operation improvement are key for wind farm operators, maintenance companies, financial institutions, insurance companies and investors.
The operating and environmental conditions of virtually all wind turbines (WT) in use today are recorded by the turbines’ “supervisory control and data acquisition” (SCADA) system in 10-min intervals [
3]. The number of signals available to the turbine operator varies considerably between turbines of different manufacturers as well as between generation of turbines by the same manufacturer. This is because of its complex nature as indicated on the IEC [
4]. The minimum data-set typically includes 10 min-average values of wind speed, wind direction, active power, reactive power, ambient temperature, pitch angle and rotational speed (rotor and/or generator). An example of these sensors are depicted in
Figure 1.
One of the main tasks of the Operation and Maintenance (O&M) process is to find out the possible causes of a fault manifested by a specific alarm or a set of alarms that stops the wind turbine production. This process is crucial to reduce the downtime or detect critical faults in earlier stages. Methodologies and tools that can support this type of process can benefit wind farm owners not only to increase availability and production but also to reduce costs.
The earlier O&M processes were corrective, meaning that the maintenance was carried out when turbines broke down and faults were detected. This is an expensive strategy because of a lack of planning. By contrast, a preventive maintenance tries to either repair or replace components before they fail, but is expensive because maintenance tasks are completed more frequently than is absolutely necessary. Condition based maintenance (CBM) are a trade-off between both aforementioned strategies in which continuous monitoring and inspection techniques are employed to detect incipient faults early, and to determine any necessary maintenance tasks ahead of failure [
5]. This is achieved using condition monitoring systems (CMS), which involve acquisition, processing, analysis and interpretation of data using the SCADA systems.
In modern wind turbines, however, the SCADA data often comprises of hundreds of signals, including temperature values from a variety of measurement positions in the turbine, pressure data, for example from the gearbox lubrication system, electrical quantities such as line currents and voltages or pitch-motor currents or tower vibration, among many others [
6,
7,
8,
9]. Comprehensive SCADA data-sets often contain not only the 10-min or even 5-min averaged values, but also minimum, maximum and standard deviation values for each interval. Therefore, due to the high number of available variables and data, analyzing them can be a high time consuming task [
10,
11,
12] and when just well-known related variables are analyzed, hidden causes (or not common causes) cannot be, or are hard to be, found. As these data are already being collected and are available for the purpose of condition monitoring, some research has been carried out in the recent years for the purpose of predicting fault-detection in a non-invasive manner.
Amongst the state of the art research, some authors focus on methods for the signal analysis, mathematical models or an ensemble of statistical methods sequentially connected. Authors such as Shafiee et al. [
13] develop methods to calculate the number of redundant converters and to determine the number of failures needed before reporting a maintenance task in the case of turbines offshore, located in hard-to-reach places. On the other hand, Hameed et al. [
14] apply transformations for spectral analysis with the aim of detecting deviations before failures. Astolfi et al. [
15] use statistical methods to extract indicators showing miss-alignment of the nacelle with respect to the wind direction; these indicators are checked with real SCADA data. The same authors, in Astolfi et al. [
16], show different algorithms that generate performance indicators (Malfunctioning index, Stationarity index and Miss-alignment index) for the analyzed turbine. Unlike other authors, Qiu et al. [
17] work with the data from alarms and also introduce methods of temporal and probabilistic analysis, generating a diagnosis of the current state of the WT and a forecast of their future state. There are also authors who have focused on creating a physical-statistical model to detect faults [
18]. A statistical analysis of the duration of each type of alarm can be found in [
19].
In the area of artificial intelligence (AI) there is a wide variety of techniques largely based on support-vector machines (SVM) and artificial neural networks (ANN). One of the first examples of a system based on ANN is the SIMAP (Intelligent System for Predictive Maintenance) [
20] developed for detecting and diagnosing gearbox faults. The system was able to detect a gearbox fault two days before the actual failure, which is an interesting result but the system is not developed enough to be used for other types of applications. In 2007, Singh et al. [
21] also use an ANN approach for wind turbine power generation forecasting, showing that the ANN offered -over a monthly period- a much more accurate estimation closer to actual generated power than the traditional method. Zaher et al. [
22] propose an ANN-based automated analysis system. The study describes a set of techniques that can be used for early fault identification for the main components of a WT, interpreting the large volume of SCADA data and highlighting the important aspects of interest before presenting them to the operator.
Neural networks are used in [
23] for the estimation of the wind farm’s power curve. This curve links the wind speed to the power that is produced by the whole wind farm, which is non-linear and non-stationary. The authors model the power curve of an entire wind farm using a self-supervised neural network called GMR (Generalized Mapping Regressor). The model allows them to estimate the reference power curve (on-line profile) for monitoring the performance of the wind farm as a whole. Another example related to forecasting wind parameters can be found in [
24], where a combination of Wavelet Decomposition (WD), Empirical Mode Decomposition (EMD) and Elman Neural Networks is presented for wind speed forecasting.
An ANN is also used in the work of Bangalore and Tjernberg [
25] and Cui et al. [
26], with four continuous variables as input and one as output. The objective is to compare the output of the model with the real data. In the training step they obtain the threshold from which a positive output will be generated. This threshold is determined using the error distribution and with a
p-value of 0.05 the corresponding value is found. In this other work, Bangalore and Tjernberg [
27], they present another methodology to detect the deviation from the ANN model using the Mahalanobis distance and with a
p-value of 0.01 the threshold value is obtained.
In Mazidi et al. [
28] the authors propose to use an ANN, again with continuous variables, in order to detect anomalies. As in the previous work, the input variables are selected manually. Pearson correlation is used to eliminate the more correlated ones. They define various error indicators that are compared to an experimentally derived threshold. A post-analysis based on PCA is then performed to identify the variable that exceeds the threshold. In a posterior study, Mazidi et al. [
29] improve this methodology. First they apply PCA to visualize the correlations between variables and to select some of them, by means of the Pearson correlation, Spearman correlation, Kendall correlation, Mutual Information, RReliefF or Decision-Trees. Then, and based on experiments, they choose the variables to be used as inputs for the ANN model, which will have the Power as output variable. The output error is used to create a stress model that will be used to indicate the status of the WT. We refer the reader to [
30] where a detailed explanation of these techniques can be found.
Authors such as [
31] have used a different type of ANN, Neuro-Fuzzy Inference System (ANIFS), to characterize normal behaviour models in order to detect abnormal behaviour of the captured signals using the prediction error to indicate component malfunctions or faults; whil [
32] use an ANN to perform a regression using two to four input variables and one output variable.
On the SVM side, authors such as Vidal et al. [
33] focus on using a multiclass SVM classifier to detect different failures. They use a pre-analysis of the contribution of each variable by the means of PCA. It should be noted that these authors work with data simulated by the FAST system [
34] which does not have the handicap of noise and the low quality of data in real datasets. [
35] use a SVM classifier with five output classes. An important contribution of this work is that it carries out the tasks of cleaning and sampling, which are necessary when dealing with real data, although the selection of variables is done manually. Works such as [
36] use an ensemble of models based on ANN, Boosting Tree Algorithm (BTA), Support Vector Machine (SVM), Random Forest (RF) or Standard Classification and Regression Tree (CART), generating an interval of probability of failure. Leahy et al. [
37] also use an ensemble of SVM, RF, Logistic regression (LR) and ANN to generate a model that is capable of classifying 3 classes (Fault, No Fault, Previous Fault) from SCADA data and alarms. The author achieves a prediction rate of 71% with 35 hours in advance, in some cases.
We can also find works that use models based on clustering like SOM (Self Organizing Maps) in Du et al. [
38], which sets the target variable (power) and selects the input variable by correlation. Then, a SOM map is created from a WT in good conditions. Using this map, the distribution of distances to the BMU (Best Matching Unit) is generated and the threshold is established as the quartile value. The data of new wind turbines are mapped to this SOM obtaining the distance to the BMU and determining the points that are out of normality. To determine the origin, they compute a statistic of which variable has had the greatest contribution to generate the distance from the BMU. Following with the SOM techniques, authors such as Blanco-M. et al. [
39] propose a process that includes a clustering technique on the result of the turbines after applying SOM, in order to identify the health status of the turbines. Other authors, such as Leahy et al. [
40], focus on clustering groups of alarms, detecting particular sequences before a failure. Gonzalez et al. [
41] uses similarity measurements between turbines, KNN, RF and Quantile Regression Forests to determine the error and dispersion of data from each turbine to detect an anomaly. SCADA alarms are used to find the system that generated it.
In many papers of the state of the art research we can see that the selection of the variables is done manually by an expert, or based on the perception of the author according to the subsystem to analyze. Some authors, such as [
29,
33,
42,
43], include some type of reduction stage by correlations or PCA, but do not make a comparison of selection methods, or this comparison does not contain methods that include the interaction of more than two variables such as those presented in this paper.
As we have seen in previous studies, choosing the optimal and adequate number of variables related to a failure is a key step when making the model. To address this issue, this paper explores the possibility of using automatic methods for feature selection and studies their performance in real SCADA data. In this work, an exhaustive search-based quasi-optimal algorithm (QO), which has been used as a reference for the automatic algorithms, is proposed. This will allow us to consider the whole set of variables of the subsystem and automatically select the smallest subset of relevant variables, which in turn will simplify the models and permit a graphical representation of their time evolution.
The paper is organized as follows:
Section 2 is dedicated to review and present the automatic feature selection algorithms based on Information Theory measures;
Section 3 describes a QO algorithm for feature selection in order to define a reference for the experiments;
Section 4 details the study case and methodology;
Section 5 is then devoted to the experimental results and discussion. Finally
Section 6 provides conclusions to the work.
2. Automatic Feature Selection Algorithms
When dealing with classification systems, the selection of optimal features is of great importance because even if theoretically having more features should give us more discriminating power, in real-world scenarios this is not always the case. The reason for that is because some features can be irrelevant with respect to predicting the class, or can be redundant to other features (highly correlated, sharing mutual information, etc.) which can decrease the performance of the classification system.
To explore all the available features, and due to the impossibility of testing all the possible combinations, feature selection algorithms are needed to sort the features according to a balance between its relevance and its redundancy. As the goal is to solve a classification problem from a subset of variables, the employed algorithms should automatically provide the smallest subsets of non-redundant and most-relevant features.
One way to do this is to apply a criterion that allows us to obtain a score of each feature by employing information theory measures. Naming J the score function, the scores of each characteristics will be obtained as . That measure must establish a descending-order ranking of features.
One of the first and simplest heuristic rules to score features employs the Mutual Information (MI) measure
, where in that expression
Y is the class label and
, is the feature under analysis. Then
provides the scores of all features
according to their individual mutual information content [
44] and the feature selection is performed by choosing the first
K ones, according to the needs of a given application. Note that the term
gives a measure of the relevance of a feature, so that sometimes it is known as relevance index (RI). Note also that in a feature selection stage for a classification problem, the use of RI is only optimum when the features are mutually independent. When features are interdependent this criteria is known to be sub-optimal because it can select a set of individually relevant features which also should be redundant to each other [
45].
To overcome that limitation, some other criteria have been proposed in order to also take into account their possible redundancy. One way to do this is not only by considering the RI of a new feature but also by measuring and extracting the mutual information that a new feature shares with the previously selected features (referred as
S) in order to aggregate only its contribution in the set. That is what the Mutual Information Feature Selection (MIFS) criterion implements [
46]. Its corresponding score function
is shown in Equation (
2). Note that its first term is again
which takes into consideration the relevancy of
. Its second term, which contributes with negative sign, is
and accumulates the mutual information of
with all
already selected in
S. This term clearly introduces a penalty to enforce low correlations with the features previously selected, those
. Note that in Equation (
2), the term
increases with the number of selected features whereas
keeps constant. Therefore, when dealing with a large set of features the second term could be the predominant one.
A new refinement can be done if each new feature selected to be aggregated in
S is the one which increases the complementary information between features previously selected. That criteria is fulfilled when working with the Joint Mutual Information (JMI) [
47,
48]. In that case, the JMI score function for
is
and computes the mutual information between the targets
Y and the joint random variable
, defined by pairing the candidate
with each
. After some mathematical manipulations,
can be written as shown in the right part of Equation (
4) in which the RI term appears, followed by the term that penalizes the redundancy (present also in MIFS approach) and finally a new term:
. This last term contributes with positive sign to
increasing it with some class-conditional dependence of
with the existing features in
S. This means that the inclusion of some correlated features can improve feature selection performance thanks to the complementary of the new added features with the ones already present in
S. A similar term can be observed in Equation (
4). The improvement in the feature selection performance that can be observed in some data-sets due to the inclusion of this third term was also reported by [
45].
What is interesting in this point is that according to the framework presented in Brown et al. [
45], although many other criteria have been reported in the literature, most of the linear score functions can always be rewritten as a linear combination of the exposed three terms as follows:
where
and
are configurable parameters.
Not all the methods found in the literature have all three terms. It’s also obvious that the performance of different criteria will depend on the statistical properties of each feature data-set. Consequently, in order to evaluate the best criteria for our data-set, different methods have been employed in the feature selection stage.
In the next subsection, the expressions of information theory based feature selection algorithms that have been used in this work are detailed. For all these algorithms,
Table 1 contains the list of acronyms, names, references and if the method employs a second term to avoid redundancy in features or has some way to capture the inter-class correlation that improves the classification performance (as it is observed in some data-sets). A detailed description of all these algorithms can be found in [
45].
Compilation of Used Criteria
The feature selection algorithms used in the experiments are mainly described as a function of the Mutual Information and the Conditional Information. Given the discrete variables X, Y and Z, these functions are denoted by
and
respectively. Both expressions can be written in terms of Shannon entropy expressions [
53] which are used directly in Equation (
6) as a normalization parameter. In the following expressions
is the feature under analysis and
Y is the class label. The group of previously selected features is indicated by
S. All sums are performed considering all the features already included in
S which is denoted as
. Symbol
stands for the cardinality of
S and it is employed in Equations (
4) and (
5) so that, as the cardinality of S increases, its inverse reduces the effect of the term to whom it multiplies. Note that Equations (
8) and (
9), corresponding to Conditional Mutual Information Maximization (CMIM) and Interaction Capping (ICAP) criteria, are non-linear due to max and min operations and therefore the interpretations are not as straightforward as in the linear case.
Mutual Information Feature Selection
Conditional Mutual Information
Minimum-Redundancy Maximum-Relevance
Double Input Symmetrical Relevance
Conditional Mutual Information Maximization
or:
To perform the experiments, the original code from [
45] was adapted to R language, the speed of calculations were optimized and a new functionality was included in the functions to provide a set of features to be used as mandatory for the feature selection functions and then allowing the algorithm to add other features, ranking them according to the optimization process. This functionality was not provided by the original code. The R code of the library (FEASTR) is freely available at
http://mon.uvic.cat/data-signal-processing/software/.
3. Exhaustive-Search-Based Quasi-Optimal Algorithm
In this section a quasi-optimal (QO) algorithm for feature selection is presented, in order to establish a reference or gold standard for the rest of experiments performed using automatic feature selection algorithms. Optimal feature selection implies to test all possible combinations and select the one that give us the best classification rate. Unfortunately this is only possible when the number of features is sufficiently small, due to the exponentially growing of possible combinations when increasing the number of features. This effect is know as curse of dimensionality. Indeed, the number of combinations of n features taking k at a time (without repetition) is equal to the binomial coefficient.
In our specific case each sub-system has 4 variables (minimum value, maximum value, average value, standard deviation) which gives us 36 features (4 variables × 9 sub-systems) coming from the gearbox, transmission and nacelle wind sensors systems of wind turbines (see
Table 2 for the exact list of variables). This implies, for example, that we have 7140 combinations of three features, 58,905 combinations of four features and 376,992 combinations of five features. The worst case, when taking 18 features, gives a total of 9,075,135,300 combinations.
Therefore, all the possible combinations of 1, 2 and 3 features will be calculated and a QO strategy for 4, 5, and 6 features will be implemented. In all the cases, the criteria for selecting the best combination is based on the classification rate obtained with the
k-NN classifier. The following strategy (see
Figure 2 for a block diagram) gives the details on how the QO feature selection is implemented. Suppose you want to determine the best combination of
n characteristics. Then:
Calculate the frequency of selection of the characteristics for the case n-1 using the best 500 results.
Sort the features according to its frequency.
Select the subset of S features with highest frequency.
Calculate all possible combinations of these S features taking n at a time (without repetition).
Select the best combination based on the classification rate obtained with the k-NN classifier.
For the case n = 4 the best 20 frequent features (S = 20) of the case n = 3 will be used, generating a total of 4845 combinations of 4 characteristics. For the case n = 5 the best 15 features (S = 15) of the case n = 4 will be used, generating a total of 3003 combinations of 5 characteristics. Finally, for the case n = 6 the best 15 features (S = 15) of the case n = 5 will be used, generating a total of 3003 combinations of 6 characteristics.
The advantage of optimal feature selection is that all possible combinations (interactions) between features are tested. The disadvantage is the impossibility of implementing the large number of combinations when the number of characteristics is huge and you want to consider a substantial number of characteristics in each group. The QO strategy presented above gives an approximation to the selection of optimal features, but even so some combinations that could be better are probably ignored, and even if the number of combinations decreases, there are still a lot of cases to try with the classification algorithm. On the other hand one is usually interested in a fast algorithm for automatic characteristic selection, which can deal with all 36 characteristics and classify them according to their importance for the classification problem. Therefore, the aim is to replace the QO characteristics selection with an automatic characteristic selection algorithm without losing performance and allowing all available characteristics to be exploited.
5. Experimental Results and Discussion
All the experiments (see
Figure 3) use the data-set presented in
Section 4.1, which contains 36 features, and each target has a label indicating normal state, warning state or alarm state. Warnings and alarms are integrated, therefore it becomes a binary classification problem. The selection of the best features to be used as input to the classification system is implemented as detailed in
Section 2. Several experiments were performed using all the WT, and the best features, from 1 to 6, were obtained trough several feature selection algorithms. Panel (a) of
Figure 4 shows the CR against the number of features for the quasi-optimal algorithm and all the WT. Results are very good in all the WT, reaching above 85% of CR when the number of features is 3 or higher. Adding new features slightly increases the CR, but for more than 4 features the change is almost imperceptible. Numerical results for these experiments (in terms of CR and F1) are detailed in
Table 3. All results are obtained with
k = 1 and we can see that the F1-score is close to 1 and highly correlated with the CR results.
The specific features selected by the algorithms are included in
Table 3, coded with a letter and a number. The letter indicates the group of the feature, while the number stands for the exact variable code (1: average; 2: min; 3: max; 4: sdv (standard deviation)).
Table 2 contains the translation from the variable code to the variable name. For instance, in
Table 3 and using only one feature, the best result for WT1 is 91.79% with the feature A1.
Table 2 indicates that this feature is “WGDC.TrfGri.PwrAt.cVal.avgVal”, meaning the active power (letter A), averaged value (number 1).
5.1. Quasi-Optimal vs. Automatic Feature Selection
The next step is to look for a feature selection algorithms able to obtain similar results with a few number of features. Results for those feature selection algorithms are presented in panels (b) to (f) of
Figure 4. Each panel corresponds to a WT and contains the result obtained for the quasi-optimal method (as a reference, dashed line) and the results obtained with all the others algorithms for this WT. As can be observed, some WT are easy to model (see for example WT4) while others are more challenging (see for example WT5). Numerical results for all the experiments are detailed in
Table 4, again showing the CR and the F1. When comparing results obtained by the quasi-optimal exploratory method and the automatic feature selection methods, QO results are always the best ones, as expected, but several automatic methods obtain also very good results.
Among all the automatic algorithms, CMI emerges as stable along all the WT and obtaining (almost) always a very good result, comparable to that obtained with the quasi-optimal method for a number of features equal or higher than 4.
By exploring all possible combinations of features, the optimal number of features is determined. As can be seen, CR saturates for 6 features, therefore the system will not increases its performance by adding new features. It is important to keep the number of features as small as possible in order to develop less complex classification systems. Besides, if systems are less complex it will be easier to train the models and the risk of overfitting will be lower. Finally, using a small number of features can allow to graphically represent the information, if having up to 3 features. This is of great importance as a tool in the front-end of real applications for the managers of the wind farms. Hence, CMI with 3 or 4 features is a good choice in the experiment, with CR and F1 comparable to the quasi-optimal one for all WT.
5.2. Effect of the Number of Neighbors Considered
To analyze the effect of the number of neighbors in the k-NN algorithm, experiments exploring all the cases for k = 1 to k = 50 in all the algorithms are performed, using the best combination of features for each case.
When analyzing the quasi-optimal case,
k = 1 is the best option for all the WT. When using any of the automatic feature selection algorithms, if the number of features is small then the number of neighbors affects the CR and habitually
k = 1 is not the best. Nevertheless, even increasing the number of neighbors, the obtained CR is lower that the QO case for the number of features analyzed. If the number of features increases, and therefore also the CR increases,
k = 1 becomes again the best option and CR tends to the QO case. The advantage of increasing the neighbors is compensated by increasing the number of features. This effect can be observed in
Figure 5: On the left column, the evolution of the CR as a function of
k, for the quasi-optimal set of features (1 to 6) for WT1 and WT3, is presented. On the right column, the same WT but now using features obtained with the best feature selection algorithm among all the analyzed algorithms. Note that increasing the number of neighbors is only useful for the CMI algorithm when the number of features used is small (1 or 2), but does not help increase the CR when the number of features is larger. For the quasi-optimal feature selection algorithm,
k = 1 is (almost) always the best option regardless of the number of features. Therefore, changing the number of neighbors has only impact when using 1 or 2 features in the CMI algorithm and degrades CR when the number of features is large or when the QO method is used.