Wind Turbine Anomaly Detection Based on SCADA Data Mining

: In this paper, a wind turbine anomaly detection method based on a generalized feature extraction is proposed. Firstly, wind turbine (WT) attributes collected from the Supervisory Control And Data Acquisition (SCADA) system are clustered with k-means, and the Silhouette Coefﬁcient (SC) is adopted to judge the effectiveness of clustering. Correlation between attributes within a class becomes larger, correlation between classes becomes smaller by clustering. Then, dimensions of attributes within classes are reduced based on t-Distributed-Stochastic Neighbor Embedding (t-SNE) so that the low-dimensional attributes can be more full and more concise in reﬂecting the WT attributes. Finally, the detection model is trained and the normal or abnormal state is detected by the classiﬁcation result 0 or 1 respectively. Experiments consists of three cases with SCADA data demonstrate the effectiveness of the proposed method.


Introduction
With the increasing exhaustion of resources such as minerals and petroleum, wind energy is widely used due to its sustainability and cleanliness. By 2020, wind power will account for 12 percent of global power generation and become the main pillar of clean energy [1,2]. With the continuing growth of global wind power capacity, condition monitoring (CM) of WTs is increasingly important to reduce operation and maintenance cost [3].
CM is a process of monitoring the operating parameters of the physical system, it attracts a lot of research in the industrial field. CM is applied to anomaly detection [4][5][6] and fault diagnosis [7,8] of wind turbines. In [4], an evaluation index of wind turbine generator operating health based on the relationships with SCADA data was presented. In [5], a framework was developed to monitor the health of a wind turbine using an undercomplete autoencoder. In [6], a wind turbine generator slip ring damage detection through temperature data analysis method was presented. In [7], a novel fault diagnosis and forecasting approach based on support vector regression model was proposed. In [8], a novel parameter-varying model for wind turbine systems was established, which was used for real-time monitoring and fault reconstruction in wind turbine systems.
In recent research, many excellent methods were proposed for anomaly detection. The existing methods can be divided into three categories: model-based [9], signal-based [10], and data-driven [11,12]. In model-based approaches, the nonlinear relationship among the sub-component of a wind turbine makes it difficult to build numerical models [13]. The signal-based methods are realized by analyzing the mechanical signals emitted during the operation process. However, the signal acquisition requires the installation of sensors which adds additional costs [14]. The data-driven methods, and machine learning techniques in particular, are used to model wind turbine behavior with supervisory control and data acquisition (SCADA) data [15,16]. SCADA provides hundreds of condition variables such as temperatures, wind parameters, energy conversion parameters, which is continuously develop in monitoring and controlling distributed processes [17]. Recently, SCADA are widely applied in the microgrid [18,19] which based on renewable energy such as solar energy [20], wind energy [21], and biological energy [22], etc. The SCADA technology is suitable for data-driven methods and big data analysis. Therefore, the rich data of SCADA system make anomaly detection of wind turbines more flexible and reliable.
Till now, various data-driven methods using SCADA data, such as fuzzy inference system (FIS) [23], support vector machine (SVM) [24] and deep neural network (DNN) [25] have been widely used. In [11], based on fuzzy theory, a generalized wind turbine anomaly detection model is proposed. In [12], a SVM-based method for fault detection in wind turbines was proposed, and the operating states of the wind turbine is classified. In [13], a framework based on deep neural network was developed to monitor anomalies of WT gearboxes.
In summary, some existed problems can be list as follows: (1) It is unreliable to select key attributes based on manual experience and judgment when establish the anomaly detection model. (2) Most existing methods can solve the problem of single anomaly detection. In addition, there are fewer methods for multi-anomaly detection and the detection accuracy is lower.
Due to the above problems, this paper propose the following method: first, we cluster the attributes collected from SCADA, and then reduce the dimensions. Finally, the multi-anomaly detection model is trained to realize the anomaly detection.
The contributions of this paper include: (1) The data preprocessing model is proposed. WT attributes collected from the SCADA system are clustered by k-means, and then the method of dimension reduction within class based on t-SNE is proposed. (2) The detection model is proposed based on the deep neural network. WT state is detected by the classification result 0 (abnormal) and 1 (normal). (3) A multi-anomalies detection method was proposed and the multi-anomalies detection could achieve a good performance.
The rest of this article is organized as follows: the architecture of the proposed anomaly detection method is described in Section 2. The data feature extraction is given in Section 3, and the architecture of the detection model is given in Section 4. Experimental cases are given in Section 5. Conclusions are made in Section 6.

Architecture of the Proposed Method
The anomaly detection model of this paper can be divided into two phases as summarized in Figure 1. Phase 1: Data feature extraction. The process consists clustering and dimension reduction, which provide valid input for the detection model of the Phases 2. Phase 2: Model generation. The deep neural network model will be trained to realize the classification of the input data. The flowchart of the proposed anomaly detection method of this paper is shown in Figure 2.
(1) The attributes collected by the SCADA system are clustered by k-means after determining the number of clusters, and SC is adopted to judge the effectiveness of clustering. (2) The attributes within classes are reduced to a fixed dimension based on t-SNE, the sum of the attributes after dimension reduction of each category are taken as the row input of the deep neural network . (3) The input data converted into square to generate many WT attributes images, the state of WT will be determine by the classification results after training abundant images.

Data Feature Extraction
To reduce the amount of data and eliminate data redundancy, a method of first clustering and then reducing the dimension within class is put forward and the accuracy of the model can be increased. The process of data feature extraction is described in detail below.

K-Means Clustering
The k-means algorithm can be applied to divide then data into k clusters so that the data in the same clusters are similar, while the data between different clusters are dissimilar.

Clarify the Maximum Number of Clusters
In paper [26], the distance cost function is applied as the space clustering validity test function, the spatial clustering result is optimal when the distance cost function reaches the minimum value, and the maximum number of clusters is determined as:

0-1 Normalization Processing
It is difficult to compare the data from different dimensions. Therefore, it is necessary to normalize the data. The data will be converted to dimensionless values in order to compare different parameters.
where V i represents the value of each attribute, and min(A) represents the minimum value in a class of attributes, and max(A) represents the maximum value in a class of attributes.

Determine the Number of Clusters
The feature attributes can be divided into 2-8 categories, and the silhouette coefficient (SC) is used to estimate the effectiveness of clustering. The SC which combines the degree of cohesion and separation, can be used to estimate the superiority of clustering. The value ranges from 1 to 2, and larger value represents better clustering effect. The calculation process is as follows: (1) calculate the average distance between X i and all other elements within the same cluster, denoted by a i ; (2) selecting a cluster b outside X i , and calculate the average distance between X i and b i . Finding the nearest average distance by traversing all other clusters, denoted b i . The formula is shown as: where S i represents the silhouette coefficient. The relationship between S i and the number of clusters K is shown in Equation (3).

T-SNE Dimensionality Reduction
The deep neural network requires fixed-dimensional input data. However, the number of attributes after dimension reduction using traditional methods such as Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) is not fixed. To solve this problem, the method of exploring high-dimensional data t-SNE is adopted, it has the advantage of reducing data in hundreds or thousands of dimensions to two or three dimensions [27]. All classes of attributes are reduced to the fixed dimension after the dimensionality reduction, which can generate valid input data for the deep neural network model.
Original data is represented as X = {x1, x2, . . . , xn}, the new data after dimension reduction based on t-SNE is represented as Y = {y1, y2, . . . , yn}. Firstly, set perplexity as Perp, set iterations as T, set learning rate as η, and set momentum as α(t). Then adjusting the parameters constantly to reach a relative optimal by the Equations (4)- (8). The similarity between high-dimensional data is calculated by Equation (4). Gauss distribution can be adopted to transform the distance between data points into probability distribution in the high dimensional space, as shown in Equation (5). The distance between the middle and lower dimensions have a larger distance after mapping by the Equation (6). The dimensionality reduction effectiveness can be determined by Equation (7), and the effect value is closer to zero, the better the effect.
where σ i differs depending on the data points and uses the binary search for the appropriate σ in the case of given Perp.
where C refer to loss function. The loss function is derived and the mapping Y in the low-dimensional space is optimized by using the gradient descent method, as shown in Equation (8): After K-means clustering, WT attributes are divided into K classes, and each class of attributes is reduced to N (2 or 3) dimensions. The new attributes after dimensionality reduction are X = K × N dimensions, which can be used as the line input for deep neural network.

Architecture of Detection Model
The deep neural network is adopted for anomaly detection. In this paper, the input data need to be organized as several normalized WT attributes images with the same pixels to be fed into the model for anomaly detection. The architecture of the detection model is described in detail below.

Deep Neural Network
The deep neural network is widely applied in much research [28], and it has advantages, such as data mining and image classification [29][30][31]. The proposed model consists of a normalization layer, two convolutional layers, two polling layers, and a fully connected classification layer.

Description of Each Layer
Normalization Layer. The normalized layer is added because the input of the deep neural network needs to be normalized to the same size. Firstly, the maximum and minimum values of the image input to the normalization layer and their corresponding positions should be found. Then, normalize them to the required size using down-sampling method. Finally, the maximum and minimum values are replaced.
Convolutional Layer. The convolution layer is the primary component of the deep neural network, and it can be used for feature extraction. A conventional layer includes two operations: convolution and nonlinearity. Each mapping is a feature representation of the input image. The convolution operation can be show as: where * denotes the convolution operation, y j is the j-th feature map of output, k ij is the convolutional kernel, x i is the i-th input. The convolution algorithm reduces the number of free variables by sparse connection (A) and weight sharing (B) so that the generalization performance of the network is increased. A. Sparse connections is crucial in the deep neural network [32], each neuron is only connected to a small part of the input. Although the direct connection is sparse, the deeper units can interact indirectly with the larger part of the input, which can be illustrated in Figure 3. The M − 1 layer is the input layer, and the input of the hidden layer M is the output of M − 1. Each neuron in the M layer can accept the input from the previous three neurons, each neuron in the M + 1 layer receives the input from the three neurons of the M layer. For adjacent layers, the accepted domain is 3; and for the M + 1 and M − 1 intervals, the receiving domain is 5. The complex interaction between units can be described effectively through sparse connections, and the over fitting risk can be reduced due to less parameters.
B. Weight sharing refers to using the same convolution kernel to complete convolution operations on images [33,34]. The convolution process is shown in Figure 4. When the size of input WT image x i is m1 × m1, the size of the convolution kernel k ij is a × b, so the size of the output feature y j is m 1 − a + 1 and m 1 − b + 1 after the convolution operation.  Add the bias to the convolution result and then the result obtained is input to the non-linear activation function. The saturation activation function ReLUs [32] is adopted in this paper, the operation is shown as: where (m, n) represents pixels in the figure. The x m,n represents the original value of the position (m, n), and y m,n represents the output value of the ReLUs. The process of the convolution layer can be shown in Figure 5.

Pool layer.
Maximum pooling operation is adopted in this paper so that the deep neural network can adapt to the small changes of the WT images. Firstly, the input WT images are divided into several non-overlapping rectangular regions of the same size. Then, the maximum value in the rectangular region is obtained by the maximal pooling operation. Figure 6 is a maximum pooling operation. Classification layer. The obtained features are converted into one-dimensional vectors and then input to the classification layer. The sigmoid activation function is adopted in this paper. The classification results 0 and 1 are used to determine the status of the WT: 0 is anomaly, and 1 is normal. As shown in Equation (11). The classification accuracy of the prediction will be divided into three parts: the abnormal accuracy, the normal accuracy and total accuracy. Represented by Q1, Q2 and Q, shown in Equations (12), (13), and (14), respectively. y = 0, y < 0.5 1, y ≥ 0.5 (11) where y indicates the states of the output.
where TA is true abnormal, FA is false abnormal, TH is true health, FH is false health.

Training Process of the Model
The proposed model is trained by back propagation (BP) gradients. Parameters are updated by the Equation (15): where i is the iteration index, ∆ ω is the dynamic variable, θ is the momentum value, ξ is the weight decay, and η is the learning rate. Weights and deviations are initialized to 0.

Experimental and Discussion
In this section, three cases of experiments are conducted to evaluate the effectiveness of the proposed detection method. Case 1: single anomaly detection of 1st attribute. Case 2: multi-anomalies detection of 6th attribute. Case 3: multi-anomalies detection of multi-attributes, and the 1st and 7th attributes were selected for experimentation. The configurations of the software environment are listed as follows, software: Matlab (2018a) Pycharm (2017.1), CPU: Intel (R) Core (TM) i7-8750H CPU@ 2.21GHz, Memory: 16 GB, GPU: NVIDIA GeForce GTX1060 and Hard disc: 1TB.

Data Description
The experiment data are collected from a wind farm in the south of China. There are 33 WTs in the wind farm and the WT 8 was selected for research. Figure 7 shows the sensor structure of WT. The SCADA data collected at an interval of 10 minutes are used in experiments. The data in this WF are well-collected, with a complete record of anomaly. Figure 8 shows image of the wind farm. Table 1 shows the parameters of WTs.

Model Parameters Setting
K-means clustering. There are 64 attributes in Table 1, and attributes are divided into eight categories according to the formula k max ≤ √ 64. After normalization and clustering, the relation between cluster number K and SC is shown in Table 2 and Figure 9. The effect of SC (0.8814) is optimal when K = 7. Therefore, the attributes are divided into seven categories.  t-SNE dimension reduction. The attribute dimension is not less than 3, so the obtained attribute is 3 dimensions after dimension reduction. After a lot of training, the parameters setting of each class are shown in Table 3. Through the data preprocessing, the new attribute is X = 21 dimensions, and is used to input into deep neural network. deep neural network model. Each input image is normalized to 21 × 21 in this experiment, other settings as shown in Table 4. The number of second-level convolution kernels is obtained through multiple training. The specific training parameters of the optimal model obtained through multiple experiments are shown in Table 5. Table 3. t-SNE parameters setting of each class.    Tables 4 and 5, the test example is shown in Figure 10. The normalized WT images are represented by C1, S1, C2, and S2.

Cases Analysis
Experimental data includes 20,000 training images and 100 test images. The ratio of normal data to abnormal data is 1:1. The size of the input picture is 21 × 21. The following three cases were conducted: (1) Single anomaly detection of 1st attribute. (2) Multi-anomalies detection of 6th attribute.

Cases 1: Single Anomaly Detection of 1st Attribute
The anomaly is that the temperature of gearbox output shaft is overheating. Figure 11 is comparison of the normal and abnormal images. The size of each image is 21 × 21 and every three rows in the image belongs a category. Figure 11a are normal states at different times. Figure 11b are abnormal states at different times. The first three rows of the image represent the 1st attribute. The model based on deep neural network is used to estimate the state of WT. The test samples are randomly selected. Figure 12 shows the five test experiments. The experiment including 48 normal data and 52 abnormal data. Gray lines represent actual values, five other colored lines represent predicted values. If the output value is less than 0.5, the state is 0 (abnormal), otherwise the state is 1 (normal). The result is shown in Table 6.   Table 7 and Figure 13 show the accuracy of Q1, Q2, Q. From the experiment results, it can be concluded that the proposed method is effective in single anomaly detection.  Figure 13. Accuracy of Q1, Q2, Q.

Cases 2: Multi-Anomalies Detection of 6th Attribute
The anomalies are the speed of the generator is reduced and the speed of gearbox is reduced. Figure 14 is comparison of the normal and abnormal images. Figure 14a are normal states at different times. Figure 14b are abnormal states at different times. Rows 16 to 18 indicate the 6th attribute. Figure 15 shows the five test experiments. The experiment including 50 normal data and 50 abnormal data. Table 8 shows five results of test experiments. Table 9 and Figure 16 show the accuracy of Q1, Q2, Q. The average accuracy is 95.4% , which is higher than case1. From the experiment results, it can be concluded that the proposed method is effective in multi-anomaly detection.

Cases 3: Multi-Anomalies Detection of Multi-Attributes
The anomalies are the temperature of gearbox oil and temperature of gearbox input shaft increase both in the 1st and 7th attributes. The attributes and anomalies are selected randomly. Figure 17 is comparison of the normal and abnormal images. Figure 17a are normal states at different times. Figure 17b are abnormal states at different times. Rows 1 to 3 indicate the 1st attribute and rows 19 to 21 indicate the 7th attribute. Figure 18 shows the five test experiments. The experiment including 50 normal data and 50 abnormal data. Table 10 shows five results of accuracy. The average accuracy of normal state is 95.6%, the average accuracy of abnormal state is 96%. Therefore, the average accuracy of the five test experiments is 95.8%, which is higher than case1. From the experiment results, it can be concluded that the proposed method is effective in multi-anomalies detection of multi-attributes.
To further verify the effectiveness of the proposed method, other two methods are adopted to make the comparison: (1) BPNN method, (2) SVM method. The experiment results shown in Table 11. It can be concluded that the average accuracy of proposed method is 95.8%, and it has the best performance in the experiment.

Conclusions
In this paper, a wind turbine anomaly detection method based on SCADA data mining is proposed. Firstly, WT attributes collected from the SCADA system are clustered by k-means, and then the method of dimension reduction within class based on t-SNE is proposed. Finally, the detection model is trained and the abnormal or normal state is detected by the classification result 0 or 1 respectively. Three cases are conducted in this paper to demonstrate the effectiveness of the proposed method. The results show that the proposed method has good performance in three cases: (1) single anomaly detection of 1st attribute, (2) multi-anomalies detection of 6th attribute, (3) multi-anomalies detection of multi-attributes. In the future, we will continue our research on anomaly detection, and developing more effective deep learning methods to predict anomalies before they occur.