Wind Turbine Condition Monitoring Using the SSA-Optimized Self-Attention BiLSTM Network and Changepoint Detection Algorithm

Condition-monitoring and anomaly-detection methods used for the assessment of wind turbines are key to reducing operation and maintenance (O&M) cost and improving their reliability. In this study, based on the sparrow search algorithm (SSA), bidirectional long short-term memory networks with a self-attention mechanism (SABiLSTM), and a binary segmentation changepoint detection algorithm (BinSegCPD), a condition-monitoring method (SSA-SABiLSTM-BinSegCPD, SSD) used for wind turbines is proposed. Specifically, the self-attention mechanism, which can mine the nonlinear dynamic characteristics and spatial–temporal features inherent in the SCADA time series, was introduced into a two-layer BiLSTM network to establish a normal-behavior model for wind turbine key components. Then, as a result of the advantages of searching precision and convergence rate methods, the sparrow search algorithm was employed to optimize the constructed SABiLSTM model. Moreover, the BinSegCPD algorithm was applied to the predicted residual sequence to achieve the automatic identification of deterioration conditions for wind turbines. Case studies conducted on multiple wind turbines located in south China showed that the established SSA-SABiLSTM model was superior to other contrast models, achieving a better prediction precision in terms of RMSE, MAE, MAPE, and R2. The MAE, RMSE, and MAPE of SSA-SABiLSTM were 0.2543 °C, 0.3412 °C, and 0.0069, which were 47.23%, 42.19%, and 53.38% lower than those of SABiLSTM, respectively. The R2 of SABiLSTM was 0.9731, which was 4.6% higher than that of SABiLSTM. The proposed SSD method can detect deterioration conditions 47–120 h in advance and trigger fault alarm signals approximately 36 h ahead of the actual failure time.


Introduction
In recent years, wind energy, as a rapidly developing and inexhaustible clean renewable energy source, has become an indispensable force in addressing the world energy problems of fossil energy depletion and ecological environment deterioration [1][2][3]. By the end of 2022, according to the latest statistics [4] obtained from the Global Wind Energy Council (GWEC), the newly installed and cumulative capacities of wind turbines have achieved values up to 78 GW and 906 GW, respectively. However, most wind turbines are located in remote areas (e.g., mountains, deserts, and offshore) and are usually long-term, operating in severe weather conditions and complex geographical environments, which leads to frequent failures and even shutdowns [5]. Moreover, unexpected faults and shutdowns result in a higher operation and maintenance (O&M) cost and lower reliability of wind turbines, which may adversely influence the financial gains from wind farms and have a significant impact on the thriving wind power industry [6]. Statistically, the O&M cost of an onshore wind turbine accounts for approximately 10-15% of the total production cost, while this figure is as high as 20-30% for offshore wind turbines [7]. Therefore, to minimize the O&M cost and increase the safety and reliability of wind turbines, it is essential to investigate wind turbine-condition-monitoring (WTCM) technologies for the early detection of potential faults, thus avoiding secondary damages and even catastrophic accidents [8].
To date, extensive research on WTCM methods has been conducted by numerous scholars in the field, which can be generally divided into two categories: physical-modelbased and data-driven-based approaches. Physical-model-based approaches, typically including constrained Kalman filter [9], parity equation [10], T-S fuzzy approach [11], and observed-based approach [12], have been widely applied in the WTCM field. Nevertheless, with the increasing scale and complexity of wind turbines, establishing precise mathematical models for their use becomes difficult, which severely restricts the use of physical-model-based methods in practical engineering tasks to a considerable extent.
In contrast, with the development of acquisition, transmission, and storage technologies, data-driven-based methods have become an attractive choice in the WTCM field, which only expands on the measured data instead of accurate physical or mathematical knowledge. Recently, numerous data-driven-based methods have been proposed in the literature and widely employed for WTCM methods, including vibration signal analysis [13][14][15][16], oil signal analysis [17], acoustic emission signal monitoring [18,19], electrical signal analysis [20], and others. However, the above-mentioned methods require the installation of additional signal acquisition equipment, which would result in a substantial improvement in the investment cost [21].
Practically, for most wind farms, the supervisory control and data acquisition (SCADA) system has been widely deployed to measure and record large high-dimensional operation data. These SCADA data contain continuous parameters (e.g., meteorological conditions, temperature, pressure, power, and electrical measurements) and discrete information (e.g., startups, shutdowns, fault records, etc.), which can reflect the wind turbine's operation or health conditions in a timely manner. Meanwhile, due to the advantages of their powerful feature representation capacity, artificial intelligence (AI) algorithms (e.g., machine learning and deep learning) have achieved a widespread application and development in many fields, presenting technological advances in their storage and processing power. Consequently, based on AI algorithms, large numbers of WTCM methods using SCADA data have been proposed in the literature and proved to be effective and superior, and, at present, they are research hotspots in the field of wind power [22][23][24].
Recently, numerous SCADA-AI-based WTCM methods used for potential fault-detection purposes have been studied extensively in research, including support vector machines (SVMs), XGBoost, back-propagation neural networks (BPNNs), Gaussian process, restricted Boltzmann machine (RBM), auto-encoder (AE), and stacked denoising auto-encoder (SDAE). Dhiman H.S. et al. [25] proposed a data-driven anomaly detection approach for a wind turbine gearbox by using a twin support vector machine (TWSVM). Trizoglou P. et al. [26] utilized the extreme gradient-boosting (XGBoost) algorithm to construct a normal-behavior model of a wind turbine generator to detect deterioration. Sun P. et al. [27] established a generalized method for the detection of potential faults in wind turbines based on backpropagation neural networks (BPNNs) through the use of SCADA data. Infield D. et al. [28] constructed a SCADA-based potential fault-detection approach for wind turbines by using the Gaussian process (GP). Meyer A. et al. [29] explored a multi-target model as a novel method for detecting a wind turbine's normal behavior. Yang W. et al. [30] designed an unsupervised early fault-detection approach for wind turbine anomaly detection purposes by applying a spatiotemporal pattern network (STPN) and stacked restricted Boltzmann machine (RBM). Renström N. et al. [31] designed a condition-monitoring framework based on an auto-encoder (AE) by using SCADA data and investigated the various hyperparameters that affect the model's performance. By utilizing stacked denoising auto-encoders (SDAEs), Chen J. et al. [32] introduced a multivariate analysis method to detect early faults in wind turbines.
It has been proved in the research that the methods mentioned above are effective and have made great progress in the WTCM field. However, the methods mainly focus on the Sensors 2023, 23, 5873 3 of 27 nonlinear correlation evident between different monitoring variables without considering the autocorrelation (i.e., temporal correlation) of every monitoring variable, which results in a limited model performance. In fact, SCADA data are essentially time series and monitor variable changes dynamically over time, therefore, they should be considered in the modeling process.
Recurrent neural networks (RNNs) are a class of neural networks that efficiently process sequential data with short-term memory capability. However, when the input sequence is relatively long, vanishing or exploding gradients are evident, which are also known as the long-term dependencies problem. The long short-term memory (LSTM) and gated recurrent unit (GRU) networks, as commonly used RNN variants, solve the problems of gradient explosion and gradient vanishing in traditional RNNs by introducing the gating mechanism. RNNs have been widely applied in the field of wind power generation, such as power prediction, early anomaly detection, and potential fault diagnosis. Zhang J. et al. [33] applied the long short-term memory network (LSTM) to predict the power of wind turbines and employed the Gaussian mixture model (GMM) to explore the distribution characteristics of the predicted residual. Lei J. et al. [34] designed a novel conditionmonitoring framework by using an end-to-end LSTM network to mine the temporal features inherent in a multi-variate SCADA time series. By combining the auto-encoder neural network and LSTM network, Chen H. et al. [35] proposed a novel early anomalydetection approach for wind turbine key components. The convolutional neural network (CNN) and gated recurrent unit (GRU) network were combined by Kong Z. et al. [36] to learn the spatial-temporal features inherent in SCADA data. The attention mechanism (AM) [37] can optimize resource allocations and make RNN models focus on more critical and highly relevant input features, thereby further improving the prediction accuracy of RNN models. A novel anomaly detection approach was developed by Xiang L. et al. [38] by using CNN and attention-based LSTM networks to monitor the condition of wind turbines. Case studies have demonstrated that the designed method was more effective and reliable in detecting potential anomalies for wind turbines. However, in the parameter initialization of the model training stage for the above-mentioned condition-monitoring models, the initial weight and bias parameters were randomly selected, which increases their potential to easily become stuck in local optima. In addition, numerous studies propose that Swarm Intelligence Optimization Algorithms (SIOA) [39][40][41][42] provide an effective solution for the iterative optimization process of model training.
To sum up, scholars have conducted numerous studies on the condition-monitoring and potential anomaly-detection methods of wind turbines from the aspects of data source, features extraction, model construction, and attention mechanism introduction; however, the following limitations still need to be resolved in the research.
(1) The previous studies mainly focus on the nonlinear relationships between different input variables, lack consideration of the input variable autocorrelation feature, and cannot fully mine the spatial-temporal features inherent in large high-dimensional SCADA time series. (2) During the model training process, the existing normal-behavior models for wind turbines mostly adopt random model initialization parameters, without experiencing intelligent optimizations, and easily fall into the local minimum values. (3) To date, the fixed or adaptive thresholds utilized in the existing studies for predicted error time series may result in missed detections or false alarms due to the overly large or too-small thresholds. Therefore, the accuracy and reliability of the anomaly detection method can still be enhanced by combining it with other statistical analytic techniques.
Consequently, to address the above-mentioned limitations, based on the sparrow search algorithm (SSA) [41], self-attention (SA) mechanism [37], bidirectional long shortterm memory network (BiLSTM), and binary segmentation changepoint detection algorithm (BinSegCPD) [43,44], for the key components (e.g., main bearings, gearbox, and generator) of wind turbines, a novel WTCM method (SSA-SABiLSTM-BinSegCPD, SSD) is presented in this study, which can conduct real-time condition monitoring to detect potential anomalies, realize predictive maintenance, and reduce the O&M costs, with the contributions being summarized below.
(1) A novel normal-behavior model (SABiLSTM) for wind turbine key components is constructed by combining the self-attention mechanism and the BiLSTM network. The self-attention mechanism can make the model focus on input variables that have greater impacts on the output variable. The SABiLSTM can effectively mine the spatial-temporal features hidden in the SCADA time series. Compared with four contrast models (e.g., XGBoost, BPNN, LSTM, and BiLSTM), the SABiLSTM model achieved a superior prediction performance with better evaluation metrics (the lowest MAE, RMSE, and MAPE and the highest R 2 ). (2) The sparrow search algorithm (SSA) is employed to intelligently optimize the constructed SABiLSTM model for optimal initialization weights or bias parameters, which can considerably enhance the model's overall performance and convergence rate. The comparative results, with two other optimization algorithms (i.e., particle swarm and crisscross optimization algorithms, denoted as PSO and CSO, respectively) [39,40], demonstrate that the introduced SSA algorithm performs better than the other two algorithms in terms of MAE, RMSE, MAPE, and R 2 . (3) A hybrid anomaly detection strategy, consisting of the binary segmentation changepoint detection algorithm (BinSegCPD) and threshold alarm, is designed to automatically identify deterioration conditions and detect the potential anomalies in a wind turbine in advance. Two actual fault case studies of main bearings illustrated that the designed hybrid strategy can detect deterioration conditions 47-120 h in advance and trigger the fault alarm signals approximately 36 h ahead of the actual failure time.
The remainder of this paper is structured as follows: the framework of the designed SSD condition-monitoring method for wind turbines is briefly introduced in Section 2; a detailed description of the designed SSA-SABiLSTM model structure, as well as the corresponding methodologies, is provided in Section 3; Section 4 presents a hybrid anomaly detection strategy that consists of a binary segmentation changepoint detection algorithm (BinSegCPD) and threshold alarm; the proposed SSD method is validated using SCADA datasets acquired from multiple wind turbines located in southern China in Section 5; finally, a brief conclusion is presented in Section 6.

Overview of the Proposed SSD Method
The main work to be done in this study was to build a wind turbine normal-behavior model, which was employed to conduct the real-time condition monitoring and identify the early anomalies by applying a designed hybrid anomaly detection strategy.
The novelties of this study are that an SSA-SABiLSTM normal-behavior model, which can effectively mine the dynamic temporal characteristics of SCADA data by introducing the self-attention mechanism and can achieve a well-trained model with smaller training loss by introducing the SSA algorithm, was conducted for key components of wind turbines. Additionally, a hybrid anomaly detection strategy consisting of the changepoint detection algorithm and alarm threshold was designed to decrease the ratio of missed detections or false alarms.
As illustrated in Figure 1, the designed SSD wind turbine condition-monitoring (WTCM) method primarily comprises two stages: offline training and online monitoring. In order to capture the nonlinear and temporal dynamic properties of wind turbines under normal-state conditions, the key component of this framework was to build a normalbehavior model (NBM) of wind turbines based on the SABiLSTM network optimized by the sparrow search algorithm (SSA). A representative advantage of the designed method was that the modeling datasets only relied on historical health SCADA data, which could be easily deployed for the actual application where fault datasets were difficult or even impossible to acquire. Moreover, the primary principle behind this framework was that, according to the statistical analysis of the residuals between the measurements and predictions generated by the well-trained NBM, the residual change trend could indicate the possible deterioration conditions or potential faults. Generally, lower residuals would be produced by the well-trained NBM for normal SCADA operation data, whereas higher residuals were produced for abnormal or fault data. On the other hand, for anomaly detection strategies, in addition to the common threshold alarm, a binary segmentation changepoint detection (BinSeg) algorithm was further introduced to identify deterioration conditions of wind turbines, which could improve the timeliness and accuracy of the anomaly detection method. Therefore, the specific steps performed for each phase of the designed method can be summarized as follows.
In order to capture the nonlinear and temporal dynamic properties of wind turbines under normal-state conditions, the key component of this framework was to build a normalbehavior model (NBM) of wind turbines based on the SABiLSTM network optimized by the sparrow search algorithm (SSA). A representative advantage of the designed method was that the modeling datasets only relied on historical health SCADA data, which could be easily deployed for the actual application where fault datasets were difficult or even impossible to acquire. Moreover, the primary principle behind this framework was that, according to the statistical analysis of the residuals between the measurements and predictions generated by the well-trained NBM, the residual change trend could indicate the possible deterioration conditions or potential faults. Generally, lower residuals would be produced by the well-trained NBM for normal SCADA operation data, whereas higher residuals were produced for abnormal or fault data. On the other hand, for anomaly detection strategies, in addition to the common threshold alarm, a binary segmentation changepoint detection (BinSeg) algorithm was further introduced to identify deterioration conditions of wind turbines, which could improve the timeliness and accuracy of the anomaly detection method. Therefore, the specific steps performed for each phase of the designed method can be summarized as follows. Stage 1-Offline modeling. In this stage, based on the historical health SCADA data acquired by the SCADA system from multiple wind turbines that are operating in normal conditions, the proposed SSA-SABiLSTM normal-behavior model for key components of Stage 1-Offline modeling. In this stage, based on the historical health SCADA data acquired by the SCADA system from multiple wind turbines that are operating in normal conditions, the proposed SSA-SABiLSTM normal-behavior model for key components of wind turbines were trained and tested. Specifically, firstly, the historical health-modeling dataset could be acquired after conducting data preprocessing (i.e., data cleaning and normalization) and variable selection processes on the raw SCADA datasets, which were collected from multiple wind turbines operating in normal conditions. The modeling dataset was then split into two sub-datasets: one for model training and the other for model testing. With the training sub-dataset, the designed SSA-SABiLSTM model could be trained to learn the nonlinear values between different variables and the temporal For the test sub-dataset, the prediction residuals could be computed by using the relevant measurements and the predictions generated by the well-trained SSA-SABiLSTM model. Furthermore, for normal SCADA data, by using the kernel density estimation algorithm, the probability density function (PDF) of prediction residuals could be calculated, so as to set the alarm threshold for potential early fault-detection purposes.
Stage 2-Online monitoring. In this stage, based on the well-trained SSA-SABiLSTM normal-behavior model, online SCADA data, and the hybrid strategies consisting of the changepoint detection algorithm and alarm threshold, the SSD method was used to conduct the real-time condition monitoring and early fault detection for wind turbine key components. Firstly, online monitoring SCADA data were preprocessed and fed into the well-trained SSA-SABiLSTM model to generate the predictions and corresponding predicted residuals. Then, the deterioration conditions of wind turbines could be automatically identified by using the BinSeg algorithm on the predicted residual sequence. Then, an early alarm could be triggered if the predicted residual value exceeded the alarm threshold value calculated in the offline training phase.

Data Preprocessing Method
It should be emphasized that the data preprocessing method (i.e., data cleaning and normalization) is an essential step in the WTCM method prior to constructing and training the normal-behavior model using SCADA data. Due to sensor errors or communication failures, a large number of invalid values or outliers may be generated and recorded by the SCADA system while a wind turbine continuously operates over a long period of time. Typically, these abnormal records severely influence the model's overall performance; therefore, they should first be removed.
In this paper, the quartile algorithm (QA) was employed to perform data cleaning, and the interquartile range I QR can be calculated by Equation (1): where Q 1 and Q 3 represent the first and third quartiles, respectively. Then, the interquartile range I QR can be used to calculate the cleaning threshold of outliers by Equation (2): The data beyond the threshold [T L , T U ] should be regarded as anomalies and eliminated from the raw SCADA dataset.
Furthermore, since different monitored variables generally present different value ranges, it is essential to reduce the original measurements to [0, 1], according to Equation (3), to eliminate the dimension affects and decrease the model training difficulty.
where X is the data prior to normalization; X max is the maximum value of dataset X; and X min is the minimum value of dataset X.

Variable Selection
Generally, the SCADA system records hundreds of monitored variable parameters. However, it is not necessary to use all variables for modeling purposes, because that improves the complexity of the wind turbine NBM and decreases its computing efficiency. Therefore, among all the monitored variables, those that present higher correlations with the target modeling variable (e.g., main bearing temperature) should be selected as the model input variables using some correlation calculation methods.
In this paper, the Pearson's correlation coefficient was adopted to implement the variable selection process, which can be calculated by Equation (4): where − X is the average value of X and − Y is the average value of Y.

Sparrow Search Algorithm
The sparrow search algorithm (SSA) proposed by Xue J. et al. [41] in 2020 is a swarm intelligence optimization approach, which is motivated by the group wisdom, foraging, and anti-predation behaviors of sparrows. Compared with other existing optimization algorithms, the SSA algorithm is superior in terms of its convergence speed, searching accuracy, and stability factors.
Specifically, from the perspective of biological characteristics, the SSA is designed based on the following assumptions. Sparrows generally can first be classified into two categories: producers and scroungers. Producers generally have a significant amount of energy reserve resources, which depend on the evaluation of the individual fitness values. They provide foraging directions and areas to scroungers through energetically searching for food sources. However, to improve their predation ratio, scroungers generally acquire food through following, monitoring, and competing with producers.
As for the mathematics, according to Equations (5) and (6), the positions and fitness values of sparrows can be denoted as matrix X and vector F(X), respectively.
where n represents the number of sparrows and d stands for the dimension of the optimized variables. As mentioned above, producers generally possess higher fitness values and have the responsibility of searching for food and guiding the movement direction of the sparrow population. Therefore, compared with the scroungers, producers have access to a wider range of locations to search for food, and the positions of the producers can be updated by Equation (7): where iter max indicates the maximum iteration number; α ∈ (0, 1] and Q~N (0, 1) are random numbers; L indicates a 1 × d matrix whose elements are all 1; R 2 ∈ (0, 1] represents the alarm threshold; and ST ∈ (0.5, 1] represents the safety threshold. The positions of the scroungers can be updated according to Equation (8): where X t P represents the best positions occupied by the producers; X t worst represents the global worst positions at iteration t; and A indicates a 1 × d matrix whose elements are randomly assigned as 1 or −1.
We assume that some sparrows, accounting for approximately 10% to 20% of the total population, have the ability to notice danger. Additionally, these sparrows' beginning placements within the population are selected at random, which can be expressed as Equation (9): where X t best represents the global optimal positions at iteration t; K ∈ [−1, 1] and β~N (0, 1) are random numbers; ε represents an extremely small constant for avoiding a zero-division error; f i indicates the fitness value of the ith sparrow; f g indicates the current global best fitness value; and f w indicates the current worst fitness value.
As described above, the specific procedures of the sparrow search algorithm can be summarized as the pseudo-code displayed in Algorithm 1.

Input: I:
the maximum iterations N P : the number of producers N D : the number of sparrows who perceive danger R 2 : the alarm value n: the number of sparrows Initialize a population of n sparrows and define its relevant parameters. Output: X best , f g . 1: while (t < I) 2: search for the best and worst individuals of the sparrow population by ranking the fitness values. 3: Updating the sparrow's position by Equation (3); 6: end for 7: for i = (N P + 1): n do 8: Updating the sparrow's position by Equation (4); 9: end for 10: for l = 1: N D do 11: Updating the sparrow's position by Equation (5); 12: end for 13: Obtain the current new position; 14: If the new position is better than before, update it; 15: t = t + 1 16: end while 17: return X best , f g

Structure and Theory of the Constructed SABiLSTM Model
The structure of the constructed SABiLSTM model displayed in Figure 2 mainly contains three parts: the self-attention, BiLSTM, and fully connected networks. Specifically, for minibatches of historical health or online datasets (i.e., X 1 , X 2 , . . . , X T ) acquired after the data preprocessing stage, the weighted time-series (i.e., T ) values, which can be used as the inputs for the fully connected network to obtain the final target outputs (e.g., main bearing temperature). The detailed theory of the BiLSTM network is described in Section 3.2.3.

Structure and Theory of the Constructed SABiLSTM Model
The structure of the constructed SABiLSTM model displayed in Figure 2 mainly contains three parts: the self-attention, BiLSTM, and fully connected networks. Specifically, for minibatches of historical health or online datasets (i.e., , , … , ) acquired after the data preprocessing stage, the weighted time-series (i.e., , , … , ) values can be calculated through the self-attention network, and the detailed calculated process of the self-attention mechanism can be observed in Section 3.2. Then, the weighted time-series (i.e., , , … , ) values, as model inputs, are fed into the BiLSTM network to generate the hidden variables time-series (i.e., ( ) , ( ) , … , ( ) ) values, which can be used as the inputs for the fully connected network to obtain the final target outputs (e.g., main bearing temperature). The detailed theory of the BiLSTM network is described in Section 3.2.3.

Self-Attention Mechanism
Self-attention (SA), also known as intra-attention [37], is an attention mechanism linking different positions of a time series so as to calculate the attention weights of the time series. Recently, the self-attention mechanism has been widely applied in different tasks, including abstractive summarization, reading comprehension, and textual entailment. Figure 3 presents the calculation process of the self-attention mechanism. Specifically, the self-attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output

Self-Attention Mechanism
Self-attention (SA), also known as intra-attention [37], is an attention mechanism linking different positions of a time series so as to calculate the attention weights of the time series. Recently, the self-attention mechanism has been widely applied in different tasks, including abstractive summarization, reading comprehension, and textual entailment. Figure 3 presents the calculation process of the self-attention mechanism. Specifically, the self-attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V, as can be observed in Figure 3. The input matrix X was first transformed into matrices Q, K, and V, then a SoftMax function was applied on matrices Q and K to compute the attention weights matrix A, which was used to compute the output by multiplying with matrix V. is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V, as can be observed in Figure 3. The input matrix X was first transformed into matrices Q, K, and V, then a SoftMax function was applied on matrices Q and K to compute the attention weights matrix A, which was used to compute the output by multiplying with matrix V. Mathematically, assume that the inputs sequence is denoted as = [ , … , ] ∈ ℝ and the outputs sequence is denoted as = [ , … , ] ∈ ℝ . Then, the input ∈ ℝ , query ∈ ℝ , key ∈ ℝ , and value ∈ ℝ vectors can be acquired through the linear mapping process. Moreover, for the whole input sequence X, three mapping matrices (i.e., Q, K, and V) and the output matrix can be calculated according to Equations (10)- (13). In this study, we selected a scaled dot-product as the attention scoring function.

Long Short-Term Memory Network
Recurrent neural networks (RNNs, including RNN, BRNN, LSTM, and GRU) are a family of artificial neural networks that are good at processing sequential or time-series data due to their short-term memory capability. Among the many RNN variants available in the research, by introducing gate mechanisms, long short-term memory (LSTM) networks can effectively address the problem of gradient disappearance or gradient explosion existing in deep recurrent neural networks. Figure 4 displays the intra-calculation process of an LSTM recurrent unit. As can be observed in Figure 4, compared with conventional RNN, LSTM introduces three gates, Mathematically, assume that the inputs sequence is denoted as X= [x 1 , . . . , x N ] ∈ R d x ×n and the outputs sequence is denoted as can be acquired through the linear mapping process. Moreover, for the whole input sequence X, three mapping matrices (i.e., Q, K, and V) and the output matrix ∼ X can be calculated according to Equations (10)- (13). In this study, we selected a scaled dot-product as the attention scoring function.
A is the attention matrix; and SoftMax is the normalization function.

Long Short-Term Memory Network
Recurrent neural networks (RNNs, including RNN, BRNN, LSTM, and GRU) are a family of artificial neural networks that are good at processing sequential or time-series data due to their short-term memory capability. Among the many RNN variants available in the research, by introducing gate mechanisms, long short-term memory (LSTM) networks can effectively address the problem of gradient disappearance or gradient explosion existing in deep recurrent neural networks. Figure 4 displays the intra-calculation process of an LSTM recurrent unit. As can be observed in Figure 4, compared with conventional RNN, LSTM introduces three gates, i.e., forget gate F t , input gate I t , and output gate O t . The forget gate determines the information ratio that the memory cell needs to discard, the input gate determines the information ratio that the candidate memory cell needs to reserve, and the output gate determines the information ratio that the memory cell needs to pass to the hidden state. i.e., forget gate , input gate , and output gate . The forget gate determines the information ratio that the memory cell needs to discard, the input gate determines the information ratio that the candidate memory cell needs to reserve, and the output gate determines the information ratio that the memory cell needs to pass to the hidden state. Mathematically, we assume that the input matrix is a minibatch ∈ ℝ at a specific time-step t, and the memory cell and hidden state of the previous time-step are ∈ ℝ and ∈ ℝ , respectively. Then, the forget gate , input gate , output gate , candidate memory cell , memory cell , and hidden state can be computed according to Equations (14)- (19): = tanh( + + ) Where , , , ∈ ℝ , , , , and ∈ ℝ are weight parameter matrices; , , , and ∈ ℝ are bias parameter matrices; and indicate the number of examples and the number of inputs in each example, respectively; ℎ indicates the number of hidden units; represents the sigmoid function; and the symbol ⨀ represents the Hadamard product operator.

Bidirectional Long Short-Term Memory Networks
A bidirectional long short-term memory network (BiLSTM) is a neural network that consists of two LSTM networks with identical structures but opposite propagation directions. Figure 5 illustrates the typical structure of a one-hidden-layer BiLSTM network, with ⃗ standing for the state of the forward-LSTM network that moves forward through time and ⃖ standing for the state of the backward-LSTM network that moves backward through time. The special structure of BiLSTM can make the output ( ) learn a feature representation that depends on both the past and the future. Mathematically, we assume that the input matrix is a minibatch X t ∈ R n×d at a specific time-step t, and the memory cell and hidden state of the previous time-step are C t−1 ∈ R n×h and H t−1 ∈ R n×h , respectively. Then, the forget gate F t , input gate I t , output gate O t , candidate memory cell ∼ C t , memory cell C t , and hidden state H t can be computed according to Equations (14)- (19): where W x f , W xi , W xo , W xc ∈ R d×h , W h f , W hi , W ho , and W hc ∈ R h×h are weight parameter

Bidirectional Long Short-Term Memory Networks
A bidirectional long short-term memory network (BiLSTM) is a neural network that consists of two LSTM networks with identical structures but opposite propagation directions. Figure 5 illustrates the typical structure of a one-hidden-layer BiLSTM network,  Mathematically, for a given minibatch input ∈ ℝ , the forward ⃗ and backward ⃖ hidden states can be calculated by Equations (20) and (21): where ( ) , ( ) , ( ) , and ( ) are weight parameter matrices; ( ) and ( ) are bias parameter matrices; and ∅ represents the hidden-layer activation function. Then, the hidden state value can be obtained by concatenating the forward and backward hidden states (i.e., ⃗ and ⃖ ) and provided to the output layer to compute the output ( ) (q indicates number of outputs): where is the weight parameter matrix and is the bias parameter vector. Additionally, in the deep BiLSTM networks with multiple hidden layers, the intermediate hidden state ( ) is supplied as the input value to the subsequent bidirectional layer.

Evaluation Metrics
Four commonly used evaluation metrics-root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and determination coefficient (R 2 )-were used to assess the effectiveness and superiority of the normal-behavior models for wind turbines, which can be calculated according to Equations (24)-(27): where represents the measurement; represents the prediction; and ̅ represents the mean of the measurements. Mathematically, for a given minibatch input X t ∈ R n×d , the forward → H t and backward ← H t hidden states can be calculated by Equations (20) and (21): where W where W hq is the weight parameter matrix and b q is the bias parameter vector. Additionally, in the deep BiLSTM networks with multiple hidden layers, the intermediate hidden state H (1) t is supplied as the input value to the subsequent bidirectional layer.

Evaluation Metrics
Four commonly used evaluation metrics-root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and determination coefficient (R 2 )-were used to assess the effectiveness and superiority of the normal-behavior models for wind turbines, which can be calculated according to Equations (24)- (27): where x t represents the measurement; x t represents the prediction; and − x represents the mean of the measurements.

Wind Turbine Condition Monitoring
Based on the historical health SCADA data and designed SSA-SABiLSTM network, the normal-behavior model for key components or subsystems can be constructed offline to learn the dynamic spatial-temporal characteristics of wind turbines under normal conditions. Then, the well-trained SSA-SABiLSTM model can be used to implement realtime wind turbine condition-monitoring (WTCM) activity through the use of online SCADA data. Typically, for SCADA datasets collected from wind turbines operating under normal conditions, the residuals generated by the SSA-SABiLSTM model are small with stable fluctuations, while the residuals are large for abnormal conditions. Therefore, through the real-time statistical analysis applied to the predicted residual, the deterioration conditions could be automatically identified, and the potential faults could be captured in advance.
In this section, a hybrid anomaly detection strategy, consisting of the binary segmentation changepoint detection algorithm (BinSegCPD) and threshold alarm, is introduced to automatically identify deterioration conditions and detect early potential faults of wind turbines in advance.

Binary Segmentation Changepoint Detection Algorithm
Changepoint detection [45], first proposed in 1954 [46], is the process of identifying changes in univariate or multivariate time series. In addition to generating significant activity in the fields of statistics and signal processing, it has also had a significant impact on several application areas, including voice processing, financial analysis, bioinformatics, climatology, network traffic data analysis, and complex system monitoring.
Specifically, for a given time sequence y = {y t } T t=1 that is divided into K + 1 subsequences by changepoints set J = {t 1 , t 2 , · · · ,t K } K≤T , the goal of a changepoint detection algorithm is to determine the optimal changepoints setĴ through minimizing the quantitative criterion V(J , y) according to Equation (28): where D denotes the dimension of the time series; K denotes the number of changepoints; T denotes the length of the time series; y t k :t k+1 refers to a sub-sequence of the time sequence; C (·) stands for the cost function; and pen(J ) stands for a constraint penalty term, which should be added to balance the number of obtained changepoints for an undetermined K. Consequently, a changepoint detection algorithm mainly contains three components: a search method for finding J , a cost function C (·), and a penalty term pen(J ) when K is undetermined.
In terms of the search method, the binary segmentation algorithm [43], denoted as BinSeg, was adopted to obtain the optimal changepoints setĴ in this study.
Based on BinSeg, the first changepoint estimatet (1) of time sequence y can be calculated by Equation (29): Sequence y is then divided in half at positiont (1) , and the same process is performed on the resulting sub-sequences until a stopping requirement is satisfied.
As for the cost function, the least squares deviation was employed in this paper, which measures the mean shifts in a time sequence, as presented in Equation (30) where y I represents the sub-sequence set and − y stands for the mean value of sub-sequence y i .

Alarm Threshold
In this study, the probability density function (PDF) of the prediction residual for wind turbines operating in normal conditions was calculated using the kernel density estimation (KDE) method, as presented in Equation (31): where h denotes the smoothing parameter; N indicates the total number of samples; and K(·) stands for the kernel function, which is subject to K(x) ≥ 0 and +∞ −∞ K(x)dx = 1. Then, for a given confidence α, the alarm threshold of the wind turbine conditionmonitoring (WTCM) method employed for potential fault-detection purposes can be determined by Equation (32):
It should be noted that, prior to being used to establish the main bearing NBM, the training dataset A and test dataset B needed to experience data cleaning in order to obtain the health datasets. According to the data preprocessing algorithm presented in Section 2.2.1, the result of the data cleaning stage for WT No.11 is presented in Figure 6. As can be observed in Figure 6, through the quartile algorithm (QA), abnormal SCADA data (e.g., shutdowns, sensor errors, and communication failures) displayed as red scatter were removed to acquire the historical health data. shutdowns, sensor errors, and communication failures) displayed as red scatter were removed to acquire the historical health data. When taking the model performance and computational efficiency into account, it is essential to conduct variable selection using the Pearson's correlation coefficient described in Section 2.2.2 by selecting the variables that have higher correlation coefficients with main bearing temperature values. The result of the variable selection can be observed in Table 2, and 16 variables were reserved as model inputs, whereas the main bearing temperature was the model output.

The SABiLSTM Model
In this section, based on health training dataset A and health test dataset B, the proposed SABiLSTM main bearings NBM was trained and tested. Four commonly used contrast models for time series, i.e., XGBoost, BPNN, LSTM, and BiLSTM, were established for comparison. Meanwhile, the learning rate and estimator number of XGBoost were set as 0.1 and 100, respectively. The structure of the BPNN was designed as 16-32-16-8-1. As can be observed in Table 3, the hyper-parameters of the remaining RNN (i.e., LSTM, BiLSTM, and SABiLSTM) models were set to the identical values for comparison. When taking the model performance and computational efficiency into account, it is essential to conduct variable selection using the Pearson's correlation coefficient described in Section 2.2.2 by selecting the variables that have higher correlation coefficients with main bearing temperature values. The result of the variable selection can be observed in Table 2, and 16 variables were reserved as model inputs, whereas the main bearing temperature was the model output. In this section, based on health training dataset A and health test dataset B, the proposed SABiLSTM main bearings NBM was trained and tested. Four commonly used contrast models for time series, i.e., XGBoost, BPNN, LSTM, and BiLSTM, were established for comparison. Meanwhile, the learning rate and estimator number of XGBoost were set as 0.1 and 100, respectively. The structure of the BPNN was designed as 16-32-16-8-1. As can be observed in Table 3, the hyper-parameters of the remaining RNN (i.e., LSTM, BiLSTM, and SABiLSTM) models were set to the identical values for comparison. For health test dataset B, the quantitative evaluation metrics of the predicted results of well-trained main bearing NBMs mentioned above are presented in Tables 4 and 5.  As shown in Tables 4 and 5, when compared to XGBoost and BPNN, the RNN models presented superior performances with lower MAE, RMSE, and MAPE values and a higher R 2 value. The reason for this was that the RNN models were better at processing time-series information and could extract the temporal features inherent to the wind turbine operating SCADA data. Due to the added backward layer, the BiLSTM model outperformed the LSTM network in terms of four mean metrics, which are presented in Table 5. The MAE, RMSE, and MAPE values of the BiLSTM were 9.68%, 9.19%, and 4.65% lower than those of LSTM, respectively, while the R 2 value was 2.45% higher than that of the LSTM network. Meanwhile, it can be easily observed in Tables 4 and 5 that, with the introduction of the self-attention mechanism, the prediction performance of the BiLSTM network improved to a significant extent. The MAE, RMSE, and MAPE values of the SABiLSTM network were 0.4819 • C, 0.5902 • C, and 0.0148, which were 4.82%, 11.02%, and 1.77% lower than those of the BiLSTM network, respectively. The R 2 value of SABiLSTM was 0.9271, which was 2.91% higher than that of the BiLSTM network.
The probability density distribution of the predicted residual of the RNN models (i.e., LSTM, BiLSTM, and SABiLSTM) for WT No.20 is presented in Figure 7. As displayed in Figure 7, all PDF curves were sharp around the zero-error interval, indicating a high proportion of small prediction residuals. The PDF curved shape of the SABiLSTM network was better than LSTM and BiLSTM networks, which was consistent with the quantitative evaluation results in terms of MAE, RMSE, and MAPE values, as shown in Tables 4 and 5.

The SSA-SABiLSTM Model
As described in Section 3.1, the model performance and convergence speed values of the NBMs for wind turbine main bearings could be significantly improved by using intelligent optimization algorithms (IOAs) to optimize the initialization weights or bias parameters. Therefore, in this paper, based on the proposed SABiLSTM network presented in Section 3.2, the optimal initialization weights and bias parameters could be acquired by introducing the sparrow search algorithm (SSA), one of the IOAs, to increase the training speed and prediction precision results.
To validate the feasibility of the early optimization of the model initialization parameters (weights and bias) using IOAs, and to compare the optimization effects of different IOAs on the SABiLSTM model, three commonly used IOAs, particle swarm optimization algorithm (PSO), crisscross optimization algorithm (CSO), and sparrow search algorithm (SSA), were employed to search for the optimal initialization parameters. The maximum iteration and population size of the three IOAs were selected as 50 and 100, respectively.
In addition, for health test dataset B, the quantitative evaluation metrics of the predicted results of three intelligent optimization algorithm models (IOAMs), SSA-SABiLSTM, PSO-SABiLSTM, and CSO-SABiLSTM, are presented in Tables 6 and 7, as well as the ranked result (denoted as F R ) of the Friedman test for the determination coefficient R 2 based on different wind turbine datasets.  As can be observed in Tables 6 and 7, by implementing intelligent optimizations in the SABiLSTM model using IOAs, the model's performance presented a significant improvement. The average MAE, RMSE, and MAPE values of IOAMs were 30.78%, 27.23%, and 37.40% lower than those of the SABiLSTM network, respectively, while the R 2 value of 3.07% was higher than that of the SABiLSTM network. Meanwhile, it can be observed in Tables 6 and 7 that, among the three IOAs, the SSA presented the best performance improvement on the SABiLSTM network in terms of four mean metrics. The MAE, RMSE, and MAPE values of the SSA-SABiLSTM networks were 0.2543 • C, 0.3412 • C, and 0.0069, which were 47.23%, 42.19%, and 53.38% lower than those of the SABiLSTM network, respectively. The R 2 value of the SABiLSTM network was 0.9731, which was 4.6% higher than that of the SABiLSTM network. Additionally, the ranked result of the Friedman test for the determination coefficient R 2 also illustrated the superiority of the proposed SSA-SABiLSTM model, which ranked first among the three comparative models. Figure 11 presents the probability density distribution of the predicted residuals of the SABiLSTM and IOAMs models (i.e., PSO-SABiLSTM, CSO-SABiLSTM, and SSA-SABiLSTM) for WT No.20. As can be observed in Figure 11, when compared to the SABiLSTM network, the PDF curves of IOAMs were sharper around the zero-error interval, presenting peaks, which indicated higher proportions of small residuals and more centralized residual distributions. Among the three PDF curves, the curve generated by the SSA-SABiLSTM network was the sharpest and most centralized compared to the other two, which validated the superiority of the SSA model in improving the model's overall performance.
SABiLSTM network, the PDF curves of IOAMs were sharper around the zero-error interval, presenting peaks, which indicated higher proportions of small residuals and more centralized residual distributions. Among the three PDF curves, the curve generated by the SSA-SABiLSTM network was the sharpest and most centralized compared to the other two, which validated the superiority of the SSA model in improving the model's overall performance.     As for the optimization effect of the model convergence speed, Figure 15 displays the training losses of the SABiLSTM and SSA-SABiLSTM models, from which we can observe the beneficial impact of the SSA model on the model training speed and model training difficulty. It can be observed in Figure 15 that the SSA-SABiLSTM model with a higher convergence speed converged after 300 iterations, while the SABiLSTM model showed a continuous declining trend in training loss values and did not converge, which verifies that the introduction of the SSA model can accelerate the convergence speed and reduce the training difficulty values. As for the optimization effect of the model convergence speed, Figure 15 displays the training losses of the SABiLSTM and SSA-SABiLSTM models, from which we can observe the beneficial impact of the SSA model on the model training speed and model training difficulty. It can be observed in Figure 15 that the SSA-SABiLSTM model with a higher convergence speed converged after 300 iterations, while the SABiLSTM model showed a continuous declining trend in training loss values and did not converge, which verifies that the introduction of the SSA model can accelerate the convergence speed and reduce the training difficulty values. the beneficial impact of the SSA model on the model training speed and model training difficulty. It can be observed in Figure 15 that the SSA-SABiLSTM model with a higher convergence speed converged after 300 iterations, while the SABiLSTM model showed a continuous declining trend in training loss values and did not converge, which verifies that the introduction of the SSA model can accelerate the convergence speed and reduce the training difficulty values. In summary, based on the modeling datasets (i.e., A and B) collected from multiple wind turbines, the verification and comparative analysis conducted between the proposed SSA-SABiLSTM model and contrast models was performed in terms of the evaluation metrics, frequency distribution, and intelligent optimization algorithm. The analysis results illustrate that the constructed SSA-SABiLSTM model has a superior prediction performance and can better learn the normal behaviors of wind turbine main bearings when operating under normal conditions. In summary, based on the modeling datasets (i.e., A and B) collected from multiple wind turbines, the verification and comparative analysis conducted between the proposed SSA-SABiLSTM model and contrast models was performed in terms of the evaluation metrics, frequency distribution, and intelligent optimization algorithm. The analysis results illustrate that the constructed SSA-SABiLSTM model has a superior prediction performance and can better learn the normal behaviors of wind turbine main bearings when operating under normal conditions.

Wind Turbine Condition Monitoring
Based on the well-trained SSA-SABiLSTM model constructed and presented above, as well as the proposed anomaly detection method (i.e., BinSeg changepoint detection system and threshold alarm) described in Section 4, the main bearing condition monitoring could be performed to successfully identify deterioration conditions and detect potential faults in advance. In this sub-section, a real fault dataset (i.e., fault dataset C), which consisted of two main bearing failure cases acquired from WTs No.14 and No.27, was used to verify the superiority and effectiveness of the designed SSD approach. According to the operation and maintenance (O&M) records of the wind farm, the main bearings of WTs No.14 and No.27 experienced over-temperature faults and the SCADA system issued alarm signals at 6/9/2020 16:32 and 8/17/2020 11:21, respectively.
With further endoscopic explorations, it can be observed in Figure 16 that the roller, inner raceway, and outer raceway of the main bearings suffered from serious damage, which may be caused by excessive instantaneous loads due to extreme wind conditions. Sensors 2023, 23, x FOR PEER REVIEW 23 of 28

Wind Turbine Condition Monitoring
Based on the well-trained SSA-SABiLSTM model constructed and presented above, as well as the proposed anomaly detection method (i.e., BinSeg changepoint detection system and threshold alarm) described in Section 4, the main bearing condition monitoring could be performed to successfully identify deterioration conditions and detect potential faults in advance. In this sub-section, a real fault dataset (i.e., fault dataset C), which consisted of two main bearing failure cases acquired from WTs No.14 and No.27, was used to verify the superiority and effectiveness of the designed SSD approach. According to the operation and maintenance (O&M) records of the wind farm, the main bearings of WTs No.14 and No.27 experienced over-temperature faults and the SCADA system issued alarm signals at 6/9/2020 16:32 and 8/17/2020 11:21, respectively.
With further endoscopic explorations, it can be observed in Figure 16 that the roller, inner raceway, and outer raceway of the main bearings suffered from serious damage, which may be caused by excessive instantaneous loads due to extreme wind conditions.  Figures 17 and 18,   Figure 16. Damage of the roller, inner raceway, and outer raceway.

Identification of Deterioration Conditions
Based on the well-trained SSA-SABiLSTM model and fault dataset C, the changepoint detection results of the predicted residuals of the main bearing temperature values for failure WTs No.14 and No.27 are respectively displayed in Figures 17 and 18, where it can be observed that three changepoints are detected for two WTs. Figure 16. Damage of the roller, inner raceway, and outer raceway.

Identification of Deterioration Conditions
Based on the well-trained SSA-SABiLSTM model and fault dataset C, the changepoint detection results of the predicted residuals of the main bearing temperature values for failure WTs No.14 and No.27 are respectively displayed in Figures 17 and 18, where it can be observed that three changepoints are detected for two WTs. To ensure the accuracy and reliability of the identification of main bearing deterioration conditions, we took the last two changepoints into account in this study. As can be observed in Figure 17, the predicted residual value of WT No.14 was relatively low and slightly fluctuated at around 0 °C before 6/5/2020 02:50 (changepoint 2). Then, it gradually increased and fluctuated at around 1 °C between 6/5/2020 02:50 and 6/7/2020 21:40 (changepoint 3), and finally surged to 7.66 °C at 6/9/2020 16:30 when the SCADA system issued an alarm signal. As for WT No.27, the predicted residual displayed in Figure 18 was relatively small and slightly fluctuated at around 0 °C before 8/12/2020 01:20 (changepoint 2). Then, it gradually increased between 8/12/2020 01:20 and 8/15/2020 08:50 (changepoint 3), and finally surged to 7.51 °C at 8/17/2020 11:20 when the SCADA system issued an alarm signal.
In summary, from the residual sequence changepoint detection results obtained for WTs No.14 and No.27, it may be concluded that the main bearings deteriorated prior to the SCADA system issuing alarm signals. In detail, compared with the failure time To ensure the accuracy and reliability of the identification of main bearing deterioration conditions, we took the last two changepoints into account in this study. As can be observed in Figure 17, the predicted residual value of WT No.14 was relatively low and slightly fluctuated at around 0 • C before 6/5/2020 02:50 (changepoint 2). Then, it gradually increased and fluctuated at around 1 • C between 6/5/2020 02:50 and 6/7/2020 21:40 (changepoint 3), and finally surged to 7.66 • C at 6/9/2020 16:30 when the SCADA system issued an alarm signal. As for WT No.27, the predicted residual displayed in Figure 18 was relatively small and slightly fluctuated at around 0 • C before 8/12/2020 01:20 (changepoint 2). Then, it gradually increased between 8/12/2020 01:20 and 8/15/2020 08:50 (changepoint 3), and finally surged to 7.51 • C at 8/17/2020 11:20 when the SCADA system issued an alarm signal.
In summary, from the residual sequence changepoint detection results obtained for WTs No.14 and No.27, it may be concluded that the main bearings deteriorated prior to the SCADA system issuing alarm signals. In detail, compared with the failure time recorded in the SCADA system, changepoints 2 and 3 detected in the residual sequence could identify the main bearing deterioration conditions 109.7 h and 42.87 h in advance for WT No.14 and 130.02 h and 50.52 h for WT No.27.

Early Fault Warning
In this study, we not only implemented changepoint detection methods in the residual sequence, but we also conducted a statistical analysis according to the kernel density estimation (KDE) algorithm described in Section 4.2. Specifically, we first calculated the probability density function (PDF) of the predicted residual for wind turbines operating under normal conditions; then, we determined the alarm threshold of wind turbine condition monitoring to detect potential faults in advance. Then, based on the well-trained SSA-SABiLSTM model and health dataset B, the alarm threshold could be calculated as 2.13 • C according to Equations (31) and (32). Figures 19 and 20 display the anomaly detection results of failure WTs No.14 and No.27. It can be observed in Figures 19 and 20 that the predicted residuals of two WTs exceeded the alarm threshold at 6/8/2020 08:20 and 8/15/2020 19:10, respectively. Then, the residuals urged to the maximum values at 6/9/2020 16:30 and 8/17/2020 11:20, respectively, corresponding to the alarm time (i.e., 6/9/2020 16:32 and 8/17/2020 11:21) recorded in the SCADA system.
Consequently, compared with the actual failure times of WTs No.14 and No.27, the alarm threshold calculated in this study detected the potential main bearing overtemperature faults approximately 32.2 h and 40.18 h in advance, which verified the rationality of the determined alarm threshold and the early fault-detection capability of the designed SSD approach.
To sum up, based on fault dataset C collected from two wind turbines, the effectiveness and practicability of the designed SSD approach was validated in terms of the identification of the main bearing deterioration conditions and their early fault warnings. Moreover, the validation results illustrate that the designed SSD WTCM approach can automatically detect the deterioration conditions 47-120 h in advance and the potential faults approximately 36 h ahead of the occurrence of an actual fault, which can provide sufficient time for the O&M technicians to take appropriate measures (e.g., timely repairing or replacing) to avoid possible major accidents, unnecessary O&M costs, and abundant downtime results.   Consequently, compared with the actual failure times of WTs No.14 and No.27, the alarm threshold calculated in this study detected the potential main bearing overtemperature faults approximately 32.2 h and 40.18 h in advance, which verified the rationality of the determined alarm threshold and the early fault-detection capability of the designed SSD approach.
To sum up, based on fault dataset C collected from two wind turbines, the effectiveness and practicability of the designed SSD approach was validated in terms of the identification of the main bearing deterioration conditions and their early fault warnings. Moreover, the validation results illustrate that the designed SSD WTCM approach can automatically detect the deterioration conditions 47-120 h in advance and the potential faults approximately 36 h ahead of the occurrence of an actual fault, which can provide sufficient time for the O&M technicians to take appropriate measures (e.g., timely repairing or replacing) to avoid possible major accidents, unnecessary O&M costs, and abundant downtime results.

Conclusions
In this study, a novel wind turbine condition monitoring method (SSA-SABiLSTM-BinSegCPD, SSD) was proposed to automatically identify deterioration conditions and provide an early fault warning. Based on the datasets collected from multiple WTs, the effectiveness and superiority of the designed SSD approach was verified by comparing it with other models. The following conclusions can be drawn from the above experimental results.
(1) A normal-behavior model (SSA-SABiLSTM) for wind turbine critical components or subsystems was constructed by combining the sparrow search algorithm (SSA) and BiLSTM network with the self-attention mechanism (SA). The SSA-SABiLSTM model can effectively learn the nonlinear temporal dynamics characteristics hidden in the SCADA data. The introduction of the SA and SSA methods significantly improved the predicted performance of the BiLSTM model. The MAE, RMSE, and MAPE values of the SSA-SABiLSTM model are 0.2543 • C, 0.3412 • C, and 0.0069, which were 49.77%, 48.56%, and 54.1% lower than those of the BiLSTM model, respectively. The R 2 value of the SABiLSTM model was 0.9731, which was 7.51% higher than that of the SABiLSTM model. (2) A hybrid anomaly detection strategy consisting of the changepoints detection and threshold alarm was designed, which can improve the accuracy, reliability, and timeliness of the early fault warnings. (3) A real fault dataset (i.e., fault dataset C) consisting of two actual main bearing failure cases was employed to verify the effectiveness and practicability of the SSA-SABiLSTM model and the hybrid strategy. The results illustrate that, compared with the failure time recorded by the SCADA system, the proposed SSD method can automatically identify the deterioration conditions 47-120 h in advance and detect the potential faults approximately 36 h ahead of the occurrence of the actual fault.
Additionally, except for main bearings, the proposed SSD method can be used for other key components (e.g., generator, gearbox) or even other machinery different to wind turbines. It also should be noted that this study only considered the SCADA operating data with a 10 min sampling interval without considering other monitoring signals, such as vibration signal. Therefore, in future studies, we aim to focus on the feature integration of 10 min SCADA and vibration signal data, taking these two types of data as model input parameters to improve the overall model's performance.