Anomaly Detection of Wind Turbines Based on Deep Small-World Neural Network

: Accurate and e ﬃ cient condition monitoring is the key to enhance the reliability and security of wind turbines. In recent years, an intelligent anomaly detection method based on deep learning networks has been receiving increasing attention. Since accurately labeled data are usually di ﬃ cult to obtain in real industries, this paper proposes a novel Deep Small-World Neural Network (DSWNN) on the basis of unsupervised learning to detect the early failures of wind turbines. During network construction, a regular auto-encoder network with multiple restricted Boltzmann machines is ﬁrst constructed and pre-trained by using unlabeled data of wind turbines. After that, the trained network is transformed into a DSWNN model by randomly add-edges method, where the network parameters are ﬁne-tuned by using minimal amounts of labeled data. In order to guard against the changes and disturbances of wind speed and reduce false alarms, an adaptive threshold based on extreme value theory is presented as the criterion of anomaly judgment. The DSWNN model is excellent in depth mining data characteristics and accurate measurement error. Last, two failure cases of wind turbine anomaly detection are given to demonstrate its validity and accuracy of the proposed methodology contrasted with the deep belief network and deep neural network.


Introduction
Changeable and severe environment is the main cause of failures for wind turbines (WT), such frequent malfunctions will inevitably lead to low availability and expensive maintenance costs [1]. In general, various types of condition monitoring sensors are installed in different WT components, and their multi-dimensional state parameters, such as wind speed, pitch angle, hydraulic oil temperature, etc., are recorded and saved by the WT supervisory control and data acquisition (SCADA) system [2]. Once an exception occurs, its fault information will be fed back in the multi-dimensional sensor parameters of the SCADA system [3], where such parameters are referred as "SCADA data". Hence, using the SCADA data for early anomaly detection is beneficial to achieve the condition assessment and fault warning of wind turbines.
Generally, fault detection and isolation (FDI) approaches have two critical processes: (1) Extract effective features from the complex data, and (2) use prior knowledge or machine learning techniques to achieve failure classifications. It is worth noting that both steps need obvious label information (normal or abnormal labels) as the essential elements of training the intelligent classification algorithms [4,5]. But in the actual WT operation, it is almost impossible to acquire enough label data. This is because the normal operation time is much longer than fault occurrence time, which causes the data sparseness to strategy combining multiple small-world neural networks has been proposed to diagnose and detect the WT pitch failures [21]. Combined with the characteristics of wind turbines' SCADA data, this paper proposed a deep small-world neural network (DSWNN) for anomaly detection of wind turbines. First, a DSWNN prototype using multiple restricted Boltzmann machines (RBM) was constructed based on the classical deep auto-encoder network. Previously unlabeled SCADA data from wind turbines were used to pre-train this DSWNN prototype to extract implied features. Then, the regular neural network was transformed into a small-world one though a randomly add-edge method, and the labeled data were used to train the reconstructed DSWNN model within the supervised case to fine-tune the global parameters of the network. Due to the acute changes and disturbances of wind speed in actual operation, the fixed thresholds for judging failures are always unreasonable and can cause false alarms. Therefore, an adaptive threshold determined by extreme value theory was presented as the criterion of anomaly judgment. Finally, the effectiveness of the proposed method was verified by two failure cases of wind turbine pitch systems.
The remainder of this paper is organized as follows. Section 2 describes the proposed DSWNN model and its training method; Section 3 gives the estimating method of the adaptive threshold; Section 4 presents and discusses two failure cases of wind turbine pitch system. Finally, Section 5 lists the conclusion of the study. All abbreviations and symbols used in this paper are shown in Tables A1 and A2 of the Appendix A.

Deep Small-World Neural Network (DSWNN)
A deep auto-encoder network is a deep learning network with multiple hidden layers between the input and out layers, which can model complex non-linear relationships among multiple types of variables. The parameters of this network are initialized by unsupervised learning for input data layer by layer, and then supervised learning is used for fine-tuning. In this network framework, deep learning models lead to more complex features at higher output layers, and the learned complex features will be invariant with the change of input data [22,23]. In this paper, the proposed DSWNN model is an improved deep auto-encoder network that has small-world characteristics. Such improvement adds additional neuron connections between non-adjacent hidden layers on the basis of the original network structure. In particular, the DSWNN model is composed of multiple RBM stacks, and Figure 1 gives an example of the DSWNN structure with four hidden layers. The training process of the DSWNN model includes three phases, pre-training, small-world transformation and fine-tuning, which are applied to obtain the model parameters.

Pre-Training of the DSWNN Prototype
As mentioned above, the DSWNN prototype is a deep auto-encoder network with multiple RBMs. An RBM is a specific energy-based stochastic model [24] with one visible layer and one hidden layer, in which the visible neurons v = (v 1 , v 2 , . . . , v m ) are connected fully to the hidden neurons h = (h 1 , h 2 , . . . , h n ). Figure 2 shows the mechanisms of an RBM. In the RBM, the energy of the joint configuration units E(v, h) is shown as Equation (1), and the joint probability P(v, h) between units based on the energy model is described as Equation (2).
where w ij is the symmetric weight between visible neurons i and hidden neurons j, v i , h j are the binary states and a i , b j are their biases, respectively. The unbiased samples of v i and h j give a hidden vector h and a visible vector v can be calculated as follows: where f (•) is an activation function that can be taken as a logistic Sigmoid function or tanh function, which are defined as Equation (5).
The essence of the activation function is to retain the characteristics of the activated neuron and map it out. The derivative of the log-likelihood with respect to weight w ij is shown in Equation (6).
where, v i h j data represents the expectation with respect to the data distribution, and v i h j model represents the expectation with respect to the model distribution. However, getting an accurate description of v i h j model is computationally intractable, so Hinton proposed a Contrastive Divergence (CD) algorithm to crudely approximate the gradient [25]. First, the CD algorithm uses the input data to initialize the visible layer, and then calculates the hidden layer based on conditional distribution rules. Second, the visible layer is also calculated according to the hidden layer. Relying on the repeated calculation, a reconstruction of the input data is obtained.
where, η is the learning rate, v i h j recon is the expectation of reconstruction states, which is calculated by Gibbs sampler based on the initialized input data. The DSWNN prototype needs to pre-trained first, which is regarded as the process of training multiple RBMs layer by layer. The output of each RBM is considered as a new input of another RBM with a higher level to achieve the transmission of learning results. Once an RBM is trained, the weights between two-layer neurons are determined and locked. This procedure is conducted in an unsupervised environment, and the whole network weights will be obtained after the layer by layer pre-training. These weights will be recorded and used as the prior values for subsequent supervised training (fine-tuning).

Small-World Transformation of DSWNN
A small-world network is an intermediate network between completely random and completely regular, which was originally proposed by Watts in 1998 to describe the natural distribution of biological, technological and social networks [26]. After that, various researches began to apply the characteristics of a small-world network to the structural improvement of artificial neural networks (ANNs) [21,22,27]. We summarize the current researches and consider that the small-world transformation has two ways (see Figure 3): reconstruct-edge transformation and add-edge transformation. Taking a four-layer BP neural network with four neurons in each layer as an example, the reconstruct-edge transformation first separates the connections between adjacent-layer neurons randomly, and then reconstructs the new connections between nonadjacent-layer neurons. The add-edge transformation does not have the disconnecting procedure, it only adds new connections between nonadjacent-layer neurons without changing any original edges. The positions of the newly added connections are all randomly distributed among all network neurons, and the degree of network randomization is described by probability p. p = n ad /n or (8) where, n ad and n or are the number of the newly added connections and original connections, respectively. Figure 4 gives the random add-edge procedure from a regular network to a random one. When p = 0 or p = 1, the network is completely regular or completely random. While when p is between 0 and 1, the network has a small-world property. To probe the intermediate region 0 < p < 1, the characteristic path length L(p) and clustering coefficient C(p) are used to quantify the small world structural properties. The characteristic path length L(p) is a global property of measuring the average length of all connected edges in a network, and clustering coefficient C(p) is a local property that is used to describe the density of connected edges in local areas. Figure 5 shows the changing normalized values of L(p) and C(p) with p for the four-hidden-layer DSWNN described in Figure 1. It can be observed that as p increases, L(p) drops sharply while C(p) descends relatively slowly. When p moves towards 0.1, a large C(p) and a small L(p) are obtained, which indicates that the topology of the DSWNN has the best small-world properties. Therefore, the number of newly added edge connections is p = 0.1 times of the total number of connections. This paper selects the way of add-edge transformation to reconstruct the DSWNN model. Suppose that the DSWNN model has H (H = 1, 2, 3, . . . , i, . . . ) hidden layers, and there are N neurons in the ith hidden layer and M neurons in the i + 1th hidden layer. When p = 0, the DSWNN model is a regular network and the weight matrix W i between the ith and i + 1th hidden layers is described as Equation (9). Accordingly, the connection matrix W for the entire hidden layers can be expressed as in Equation (10).
When p = 0.1, the DSWNN does not disconnect connections in the original network and only generates connections between neurons in the non-adjacent hidden layers. The diagonal weights in matrix W are not changed because the connection does not disconnect. The global W a for the entire hidden layers can be represented as the Equation (11).
where, W b a is the weight matrix of the added edges between layer a and layer b, and its structure is shown in Equation (12).
where, w b axy is the weight of the added edge between the xth neuron in layer a and the yth neuron in layer b. N and M represent the number of neurons contained in layers a and b, respectively.
On the basis of the probability p, the W b a is a sparse matrix in which only the non-zero weights represent the randomly added edges. Therefore, transforming the DSWNN model into a small-world one is equivalent to randomly selecting the position in the global weight matrix and giving non-zero value. The pre-trained weights will not be changed in this procedure.

Fine-Tuning of the DSWNN Parameters
After the layer-wise pre-training and add-edge small-world transformation, the DSWNN model will be fine-tuned by using BP algorithm. BP is a classical method commonly used for supervised learning to improve the representation of data features and optimize the parameters of hidden layers in the fine-tuning. In this process, the initial parameters of the DSWNN model are composed of the pre-training parameters (weights and bias) and the edge-weights obtained by the small-world transformation. Because the fine-tuning only completes the local search based on these superior initial parameters, the convergence time of the optimization is significantly shortened in this process. After fine-tuning, the globally optimized parameters of the DSWNN model are obtained.
In this paper, the tanh function is regarded as the activation function for hidden layers, and the Softmax as the activation function for the top classifier layer. Moreover, the Cross Entropy [28] shown in Equation (13) is chosen as the cost function C to measure error.
where, k is the number of neurons in the output layer, t is the expected output.
To summarize, the training process of the DSWNN model includes three steps.
(1) Pre-training: In the unsupervised case, each RBM of DSWNN is trained one by one to mine the feature information of the unlabeled input data. (2) Small-world transformation: Lock the weight values obtained by pre-training, add new edges according to the probability p, and randomly assign the weight values to the newly added edges.
(3) Fine-tuning: Add a classification layer in the last layer of DSWNN to receive the output eigenvector from the last RBM. Train the entity DSWNN model with labeled data in the supervised case and the network global weights are adjusted by the BP algorithm. Figure 6 gives the three training processes of the DSWNN model. The process of RBM training can be regarded as the initialization of the weight parameters for the DSWNN model, which overcomes the shortcomings of the network easily falling into local optimum and long training time due to the random initialization of the weight parameters. For the detection of WT abnormality, a large number of unlabeled SCADA data can be used for pre-training, and a small number of labeled data can be used for fine-tuning. In this way, the trained DSWNN is more optimized than the deep auto-encoder network or BP network alone.

Anomaly Detection Based on Adaptive Threshold Estimating Method
In anomaly detection, the trained DSWNN model is used to predict the future values of the SCADA signals, and the prediction error (PE) between predicted value and actual value is used to judge the abnormalities. The PE is defined as Equation (14).
where X is the actual value of SCADA data, and X is the output of the DSWNN model. Generally, setting a threshold and comparing it with the PE is the most common and effective way to evaluate whether the wind turbine is in failure. When a wind turbine operates normally, the SCADA signals are all within the threshold range. Once an abnormality occurs, the implicit relationship between these monitoring signals will be broken, and one or more signal values will suddenly exceed the threshold to give an alarm. The rule for determining the abnormal condition is defined as Equation (15): where R th is the threshold. However, the thresholds are often set in a wide range and remain unchanged after one set. On the one hand, the faults within the threshold will not be diagnosed; on the other hand, some occasional fluctuations caused by the random wind speed will be misdiagnosed as faults. Therefore, this paper presents an adaptive threshold estimating method on the basis of the extreme value theory to monitor the trend of PE and detect its anomaly variation.
Suppose X 1 , X 2 , . . . , X n are n sample vectors of independent and randomly distributed variables whose distribution function is F(x). Each sample vector of X i contains a certain amount of values in a period of time. M n = max(X 1 , X 2 , . . . , X n ) represents the maximum of the n sample vectors. For a set of M n , the probability distribution function can be described as Equation (16).
In general, F n (x) is unknown, so we need to replace the F n (x) with the extreme distribution function of the maximum or minimum values. When n → ∞ , F n (x) → 0 . We should normalize the extreme distribution function to avoid the degradation of M n to a point. Assume that there are two normalization parameters a n and b n that satisfying the non-degenerate distribution function H(x) [29].
where, a n and b n are the scale parameter and the location parameter, respectively, β is the shape parameter. A large number of normal SCADA data are used to train the DSWNN model, and also be used to calculate the PE value. As the data are mostly normal, the mean value of PE will be stable, but their variance data should be non-stationary. Therefore, the scale parameter a n and the location parameter b n can be obtained as follows. a n (t) = exp(δ 0 + δ t g(t)) (19) b n (t) = δ 0 (20) where δ 0 and δ t are the constant coefficients, g(t) is a function describing the variable operating condition, which is affected by the changing SCADA data. Then the final adaptive thresholds for determination of warning can be calculated by: where, p is the confidence limit, which is calculated by the cumulative distribution function (CDF) method. The parameters a n , b n and β can be ensured by the maximum likelihood estimation approach.

Case Studies
The DSWNN model proposed in this paper is a novel complex neural network with small-world characteristics, which was trained using an unsupervised learning technique. In order to test the performance of the proposed DSWNN model, the following case analyses related to fault identification, prediction and classification are given in this section. The experimental data were one year's SCADA data, which were collected from the SCADA system of thirty 2-MW wind turbines in a wind farm. Additionally, to increase the contrast, the classical deep belief network (DBN) model and deep neural network (DNN) model were used as the comparison methods.
Pitch failures are mainly categorized by the pitch sensor and actuator. The pitch sensor fault occurs from dust on the encoder disc, mis-adjustment of the blade pitch bearing, temperature beyond the acceptable range and humidity or improper calibration. These causes can result in the unbalanced rotation of the rotor from the sensor bias and fixed outputs from the last measurements. These faults for the blade pitch sensor and actuator frequently appear and result in structural loading of the turbine due to rotor imbalance and affect the stability of the floating platform. These failures are mainly reactions to the signals of pitch angle, pitch torque, pitch motor, and so on. The specific monitoring SCADA parameters will be used to train the DSWNN, DBN and DNN models, and the specific parameter information is listed in Table 1. In the following case analyses, the DSWNN, DBN, and DNN were all chosen to have the same network structure and the training data.

Prediction Analysis for the Pitch Abnormalities (Case A)
The three blades of the wind turbine have three groups of pitch driving devices, which are independent and synchronous. When the wind speed changes suddenly and the blades need to change the pitch angles frequently, the failure of synchronous action of the three pitch drives often leads to multiple failures occurring at the same time. Table 2 shows a record of multiple alarms that occurred on 1 January 2017. It is confirmed that the main reason was that the pitch action of the three blades was out of sync after the last maintenance. This kind of failure is caused by typical mechanical wear, and its fault characteristics will be hidden in SCADA data. Therefore, the relevant signals monitored can be expressed as Equation (22). We applied the dataset X A to train the DSWNN, DBN and DNN models, then calculated the prediction errors by using the recorded failure SCADA data, as shown in Figure 7. It can be seen from Figure 7 that the prediction error obtained by DSWNN fluctuated within the adaptive threshold range before time T3. But after T3, the error approached its upper limit and then gradually exceeded its upper limit. According to the alarm record, the first fault named "Pitch angle 1 out of sync" occurred at 07:58:12. This means that the DSWNN can detect an abnormality approximately 3 h ahead of actual downtime, which can provide sufficient time to take some actions for pitch system maintenance. In addition, from Figure 7a,b, the DNN and DBN models can also predict failures 1.1 h and 2.3 h in advance, respectively. The proposed DSWNN can detect incipient faults earlier than that of the DBN and DNN. Moreover, the error calculated by the DSWNN was the smallest in the three models, which shows that the DSWNN model can extract more sufficient dynamic features from normal SCADA data. Hence, on the one hand, the DSWNN model in terms of describing the dynamic behavior of the pitch system is more accurate. On the other hand, the adaptive threshold can effectively track the prediction error, which can increase the adaptability of the prediction models and provide a more accurate judgment basis for the system to reduce false alarms and omissions.

Performance Comparison of Pitch Fault Classification (Case B)
To verify the accuracy of the proposed method for multi-fault classification, nine typical or frequent pitch faults were selected as the classification targets according to the real alarm information from the SCADA system. Table 3 lists the specific failures and their descriptions, in which F1-F9 represent the nine fault alarms and F10 stands for the fault-free status. Similarly, DBN and DNN were used as comparison models, and the three models adopted the same network structure: (i) The number of the input neurons was 12, which corresponded to the 12-dimensional parameters described in Table 1; (ii) The number of the output neurons was 10, corresponding to the 10 classifications in Table 3 respectively; (iii) Five hidden layers and 30 neurons in each layer were selected for all three models. Specifically, the DSWNN model had five RBMs in the pre-training process, and the probability in the small-world transformation process was set as p = 0.1. Compared with the DSWNN, the DBN model had no process of small-world transformation, and it only included two processes: pre-training and fine-tuning. The DNN model was a standard multilayer feed-forward neural network, and its training process followed the error back-propagation principle. In terms of data preparation, the experimental data were divided into 20,000 fragments of training data and 6000 fragments of validation data. Moreover, each failure category contained a certain amount of fault data, where the data distribution is shown in Table 4. It is worth noting that all the data fragments provided are labeled on the basis of the corresponding fault types. During the model training, the DSWNN and DBN were first pre-trained by the data fragments with labels removed, and then they were fine-tuned with these labeled data fragments. Simply, the DNN model used all labeled data for training and validating, which were conducted in a supervised environment.  Figure 8 gives the classification accuracy of the three models. By comparison with the misclassified conditions shown in the confusion matrices, the DNN model easily misjudged the faults of F3, F5, F6, and F8, which was critical as these true negatives could cause serious consequences. F3 and F5 were electrical failures, the occurrence of which has strong randomness and contingency. F6 and F8 were related to the pitch angle, which is supposed to be monitored by wind speed, wind power, pitch encoder, and blade root torque, etc. The reason for these failures is that the random wind speed makes a strong impact on the blades. The essential requirement for diagnosing the above faults is that the classification algorithms used should have the ability to mine implied features from multiple operating data. However, this was exactly what the DNN model lacked because its network parameters were generated by random initialization without any theoretical basis. It is generally known that parameters directly affect the classification results. Fortunately, the DSWNN and DBN models used pre-training to get better network parameters, and their classification accuracy for all failures showed better performance than that of the DNN model (see Figure 8b,c). But the accuracy in diagnosing the fault F4 and F5 also decreased, which may have been caused by the lack of fault data in the training data. In addition, we also recorded the global changing errors of the three models in the training process at the same time, as shown in Figure 9. It can be seen from the figure that the convergence times of DSWNN, DBN and DNN were 255 s, 210 s, and 168 s, respectively. DSWNN took the longest time to calculate because it had the small-world transformation process between pre-training and fine-tuning. In a 0-110 s interval, the training errors of DSWNN and DBN almost reached the stage stability, which was because the network was in the pre-training stage and their multiple RBMs had reached the energy conservation. In the range of 110-130 s, DSWNN and DBN changed from unsupervised training to supervised training, and the training error suddenly decreased until a new lower convergence value appeared. Note that at 124 s, the training error of DSWNN increased in a short time, which was due to the additional weights of the add-edges in the network structure after the small-world transformation. The DSWNN retrained the random new add-edges, leading to the short-term error increase.
Seen from the above case studies, the advantages of the DSWNN model appeared mainly in two aspects: (1) the learning ability of the DSWNN model was better than the DBN method, and it was much better than the traditional DNN method; (2) the DSWNN model had very good sensitivity and accuracy in reflecting the condition changes of the wind turbine pitch system; (3) although the DSWNN model was not dominant in time cost, it had a stronger ability to mine deeper feature information from the same data source.

Conclusions
This paper presents a novel DSWNN model based on unsupervised learning for early anomaly detection and fault detection of wind turbines. The DSWNN model is a combination of deep auto-encoder network and small-world neural network, which are more accurate in simulating the dynamic behavior of wind turbines by working on a closer level of mimicking the working process of a natural brain. Analysis results of an actual case study confirm the following conclusions: (1) The case study shows that the adaptive threshold can effectively track the prediction errors and reduce the false alarms. Therefore, the proposed adaptive threshold method based on the extreme value theory can be used in real-time monitoring of wind turbines to reduce the impact of wind speed fluctuations and external interference on the anomaly detection of the wind turbine. (2) Compared with the DBN and DNN algorithms, the proposed DSWNN model has better performance in error prediction, fault classification and learning ability, which benefits its unique small-world characteristics and unsupervised pre-training. Although it takes slightly longer on network training, it still has broad application prospects and research value. (3) In addition, the strategy combined with the DSWNN model and the adaptive threshold has been proven able to predict pitch system failures 3 h in advance, which can be used for subsequent anomaly detection and fault diagnosis of wind turbines. Funding: This research was funded by the National Natural Science Foundation of China, grant number 50776005, 51577008.

Conflicts of Interest:
The authors declare no conflict of interest.