Proactive Critical Energy Infrastructure Protection via Deep Feature Learning

: Autonomous fault detection plays a major role in the Critical Energy Infrastructure (CEI) domain, since sensor faults cause irreparable damage and lead to incorrect results on the condition monitoring of Cyber-Physical (CP) systems. This paper focuses on the challenging application of wind turbine (WT) monitoring. Speciﬁcally, we propose the two challenging architectures based on learning deep features, namely—Long Short Term Memory-Stacked Autoencoders (LSTM-SAE), and Convolutional Neural Network (CNN-SAE), for semi-supervised fault detection in wind CPs. The internal learnt features will facilitate the classiﬁcation task by assigning each upcoming measurement into its corresponding faulty/normal operation status. To illustrate the quality of our schemes, their performance is evaluated against real-world’s wind turbine data. From the experimental section we are able to validate that both LSTM-SAE and CNN-SAE schemes provide high classiﬁcation scores, indicating the high detection rate of the fault level of the wind turbines. Additionally, slight modiﬁcation on our architectures are able to be applied on different fault/anomaly detection categories on variant Cyber-Physical systems.


Introduction
Nowadays, the demand for designing autonomous condition assessment and fault detection of cyber-physical systems and critical energy infrastructures has drawn tremendously. A major cause regards the current and widely-diverse structure of CEIs that makes extremely difficult the physical monitoring. On this direction, wind turbine (WT) systems are considered among the most complex Cyber-Physical infrastructures causing huge (cascading) effects to other critical infrastructures, such as Electrical Power and Energy Systems (EPES), communications, transportation, industry and finance. Wind turbine infrastructures are composed of condition monitoring and operational data (i.e., Supervisory Command and Data Acquisition-SCADA), including air-temperature, air-pressure, voltage and power with multiple types of parameters and periodic characteristics. In comparison with legacy SCADA systems, recent-developed infrastructures utilize less expensive and more scalable Internet-based technologies to enable data monitoring in near real time conditions [1].
However the main limitations of the wind turbine industry still pertain. Specifically, the maintenance cost and the urgent replacement of the malfunctioning components, makes autonomous fault detection highly important for the wind industry. The main types of damages that impair the proper functionality of wind turbines, are caused by unfavourable weather conditions, affecting several functional instruments [2]. An anomaly in a cyber-physical system, and specifically in wind turbine measurements can be considered any pattern that presents different behaviour from the normal state, for instance extremely high or even low bearing temperature or pressure values, among others [3]. Nevertheless, from a high-level perspective, fault detection techniques can be divided into: (i) model-based, that utilize specific dynamic formulations with main goal the synthesis of representative residuals for the fault detection architectures [4][5][6], and (ii) data-driven, that follow standard data mining techniques in order to identify any discrepancy among the model predictions and the ground truth measurements [2,[7][8][9][10][11][12][13][14].
In order to tackle the aforementioned limitations we design our proposed Deep Learning (DL) schemes, namely: Long-Short Term Memory [15]-Stacked Autoencoders [16] (LSTM-SAE), and the Convolutional [17,18]-Stacked Autoencoders (CNN-SAE) in order to address the problem of semi-supervised wind turbine fault detection. In our models' training phase, we utilize labelled data, including the anomaly types, while in the validation phase we consider only unlabelled data and we retrieve the corresponding categories via the proposed DL architectures. Consequently, by exploiting the structure of the internal representations, we are able to extract significant features that will facilitate the subsequent classification task [19]. In order to validate our claims, we utilize a real wind turbine dataset [13,20], including five different monitoring states: that is, normal state, where the turbine operates normally, and four different fault categories, varying from heating fault, excitation fault, feeding fault, and main-turbine fault. The proposed architectures can be easily extended to detect complex fault patterns or new anomaly types, that vary significantly from the current operating status, while they can also be applied to detect abnormal patterns in other cyber-physical systems' applications. The main advantages that the proposed work contributes on the CEI sector are over-viewed as follows:

•
The development of two challenging schemes for automatic feature learning in order to tackle the semi-supervised wind turbine fault detection problem. The proposed schemes can be extended to perform also unsupervised anomaly detection.

•
The flexibility that is provided via the proposed formulations, since they can be applied to any cyber-physical system after minor modifications.

•
Finally, according to the related state-of-the-art, we claim to be the first that design and develop the LSTM-SAE, and CNN-SAE architectures for the problem of wind turbine classification.
The remain of paper is structured as follows: The related state-of-the-art methodologies towards the current trends in anomaly detection, and the wind turbine fault detection approaches are posed in Section 2. Additionally, Section 3 demonstrates the proposed Stacked Sparse Autoencoders architectures, adhering to the Long-Short Term Memory and the Convolutional Neural Networks architectures, while Section 4 illustrates the validation on SCADA data of a real wind-turbine. Concluding, Section 5 provides the future guidelines of this work.

Anomaly Detection
Anomalies (i.e., faulty measurements) are considered the patterns that appear infrequently, and do not comply with the existing, denoted as normal behaviour [21]. In our case, abnormal are the wind turbine measurements that do not conform with the the already defined classes, by presenting missing values on several attributes (e.g., Temperatures, Wind Speed), or by depicting extremely high or low feature values within specific time intervals. Consequently, anomaly detection systems should provide high sensitivity in discriminating whether the upcoming SCADA measurements can be considered as normal or not [22]. Generally, anomaly/fault detection techniques may be discriminated into: (i) supervised: in which the anomaly type is available [23], (ii) unsupervised: in which no labelling information is appeared on the input data [24,25], and (iii) semi-supervised: in which partial labelling is available [26][27][28].
Regarding the supervised learning case, proper labelled training datasets are created, including all possible anomalous/fault types with their correspondent assignment to the individual categories. The main advantage of these classification approaches lie into their efficiency and flexibility in recognizing immediately whether a new example is considered anomalous on normal, based on the pre-existing attack types or patterns that are present in the input dataset. Consequently, classification algorithms exploit a training phase in order to build models based on the pre-defined normal activity, and a testing phase in which they determine any new example as normal or abnormal. Considering the scenario of multiple fault categories, proper labelling should be also implemented for each available class. This scenario is recognized as multi-class supervised classification [29][30][31][32].
In the unsupervised category, the majority of the literature approaches learn the internal representations of the input multivariate time-series data, and then set empirical thresholds in order to discriminate whether a new measurement can be considered as normal or abnormal [33][34][35]. Additionally, another state-of-the-art strategy is the consider clustering-based approaches that efficiently separate the input feature space into its corresponding categories (i.e normal or abnormal), and also provide empirical thresholds that determine the class in which each new measurement belongs [36][37][38][39][40][41].
Finally, regarding the last scenario of semi-supervised anomaly detection, in the training phase we consider labeled multivariate time series data, while in the validation phase, the system predicts in which category the new measurements correspond based on the historical measurements. The majority of semi-supervised anomaly detection techniques adhere to deep feature learning formulations [27,42,43]. Motivated by these examples, in this study we exploit two characteristic schemes for learning deep features hierarchies [44,45] for semi-supervised fault classification/detection on multivariate wind turbine time-series data.

Wind Turbine Anomaly Detection
Recently, a great amount of research work has been implemented on the wind turbine anomaly detection sector. In this paragraph we highlight the most significant techniques that exist on the recent literature. Specifically, the authors of Reference [46] propose an anomaly detection technique for offshore wind SCADA data by building an explicit model for the individual sensors that predicts the expected value for each time interval using the measurements of a subset of all the other sensors of the same time as well as of the most recent past measurements. In order to solve the specific time-series problem, the authors exploit the Least Absolute Shrinkage and Selection Operator (LASSO) optimization technique [10,47]. A state-of-the-art approach was also presented in Reference [48] where the authors exploit a Support Vector Machine scheme for wind turbine fault detection, while they pose an empirical threshold for determining the abnormal values of the wind signals. Similarly, the authors in Reference [32] combine a residual-based formulation with the mathematical framework of Support Vector Machines (SVM) for fault detection and isolation problem on wind turbines. This thresholding residual-based approach identifies the abrupt changes in several features.
Another interesting approach was presented in Reference [49], where the authors propose a Gaussian-based wind turbine condition monitoring technique. Specifically, they rely on probability distributions, and they form a real-time power curve in order to identify operational anomalies. Additionally, the authors of Reference [50] propose an anomaly detection approach using wavelet transforms and neural networks for the state monitoring of wind turbines . Specifically, a non-linear Autoregressive (AR) signal processing technique is adopted using artificial neural networks that estimates temperature features of the gearbox instruments. As a metric, the authors exploit the Mahalanobis distance, since it depicts efficiency in modelling deviations among the variant states, while the wavelet transform removes the extra noisy signals. Moreover, the authors of Reference [2] propose a deep auto-encoders model that learns the behaviour of wind turbine SCADA measurements.
For this purpose, multiple restricted Boltzmann machines (RBMs) are utilized. In this way, the relationships between the SCADA variables are extracted, while the components' condition is determined via the obtained reconstruction error. Another efficient approach is illustrated in Reference [14] where the authors adhere to a combination of Random Forests with Gradient Boosting (XGBoost), in order to perform autonomous wind turbine fault classification. Random Forests classification approach ranks the extracted features according to their importance, while XGBoost algorithm takes into consideration the top-ranked features and trains proper classifiers for the variant fault types.

Proposed Methodology: Anomaly Detection in Wind Turbine Time Series Data
In this section we provide the main formulations that were designed towards the problem of anomaly detection in wind turbine time-series data. The first architecture that we developed adheres to a Long Short Term Memory-Stacked Autoencoders (LSTM-SAE) scheme, while the second one follows a Convolutional Neural Network-Stacked Autoencoders (CNN-SAE) formulation.

Stacked Sparse Autoencoders
Traditionally, the deterministic feed-forward architecture of an autoencoder [16] is composed of an input layer, several intermediate layers, and a single output layer that contains the same number of hidden nodes with the input layer. This fully-unsupervised structure is trained via a state-of-the-art back-propagation technique [51]. From a high level perspective, Stacked Sparse Autoencoders learn an approximately identical representation of the input feature space. Consequently, the input feature space is encoded via σ : R N → R M , declared as the activation function, which is usually selected to be non-linear and maps each input vector s ∈ R N , to a new feature space composed of M hidden units, in order to synthesize the equivalent output feature vector.
In the following, we describe the mathematical formulation of a single layer SAE scheme. Let x ∈ R N be the input vector, h ∈ R M the hidden layer's vector, andx ∈ R N the output vector. According to the Sparse Autoencoders framework, the output layer units are considered to be equal with the input layer's units. Consequently, the main goal is to define the appropriate weight matrix W ∈ R M×N , and the corresponding bias term b ∈ R M in order to generate the hidden layer that provides the internal representation of the system and is able to able to reconstruct efficiently the input vector x as follows: In this formulation we choose the logistic sigmoid function declared as: σ(x) = 1 1+e −x , for the non-linear activation function. Additionally, we consider the inverse weight matrixŴ ∈ R N×M , which is responsible for the decoding phase, and thus connects the hidden representation with the output vector as follows:x In the aforementioned equation b 2 corresponds to the decoding bias parameter. According to the mathematical background of the SAE, tied weights are considered among the weight matrices: W =Ŵ. Additionally, in order to further guarantee the consistency among the input and output feature spaces, sparsity constraints [52] are enforced upon the minimization of the non-linear error function. For this purpose, we define J (X,X) to be loss function among the input, X, and the output,X, feature spaces: Finally, another key issue regards the restriction of the average activation of the loss function to a small threshold, by imposing the widely used Kullback-Leibler (KL) divergence constraint [53] as: where the KL-term is formulated as: λ stands for the sparsity balancing parameter, K is the total number of examples, and p corresponds to the average activation of the each vector upon the input data. In this way, the network activates the most representative hidden nodes, by learning the appropriate weights. An extension of the aforementioned analysis is the so-called Stacked Sparse autoencoders (SSAE) scheme in which multiple shallow SAE architectures are stacked. Consequently, the developed sequence of unsupervised feature layers can be efficiently trained via any greedy-optimization algorithm.

Long Short Term Memory-SAE for Wind Turbine Fault Detection
Long Short-Term Memory (LSTM) [54] networks belong to the wide category of Recurrent Neural Networks and instead of using neurons in the hidden layer, they use memory blocks. Specifically, the traditional structure of a LSTM-block considers a memory cell (C t ), an input, output, and forget gate denoted as: (i t ),(o t ),( f t ), respectively. For a given time-record t, we declare x t ∈ R d as the input vector, z t ∈ R m the hidden representation, andĉ t ∈ R m as the state vector, which is the candidate of the memory cell. The equations bellow provide the basic formulations for each gate and state: Additionally, σ corresponds to the sigmoid, while tan to the tanh activation function, and diag parameter stands for the diagonal matrices. The variables can be encoded via the proposed LSTM-SAE architecture as: where U ∈ R m×m and W ∈ R m×k are denoted as the RNN coefficient weight matrices, and z ∈ R m stands as the state vector. Additionally, for each X t , the output is formulated as: where z t i corresponds to the output vector of the i-th encoder unit, while we denote as φ the parameter values we imposed on the encoder. When the whole sequence is directed to the RNN-encoder, we impose a max-pooling operation, formulated as: where j-index indicates the number of rows of z t i . The decoder part of the LSTM-SAE architecture, reconstructs the input as follows:x where:ẑ and ψ stands for the parameters of the RNN-decoder part. When the reconstructed input is retrieved, is evaluated, and the LSTM encoder and decoder parameters are updated.
The final layer, that is, the fully connected, is selected to be activated usint the softmax function as: Finally we exploit a standard back-propagation algorithm in order to learn the model's trainable parameters. Specifically, in back propagation procedure, the model's parameters are updated via the alternating minimization of the cost function with respect to each parameter: In the aforementioned formulation, we denote D as the training dataset. In order to extract the final prediction, we calculate the maximum value of: In Figure 1 we depict the main diagram of our LSTM-SAE scheme for wind-turbine fault classification.

Convolutional Neural Networks (CNN) for 1D Signals
In contrast with the fully-connected deep-learning architectures, where the activation of each hidden vector is evaluated by performing the multiplication of the whole input vector with certain weights, Convolutional Neural Networks (CNNs) exploit an intelligent scheme that computes the activation term of each hidden unit for only a small portion of the input data, that is, the most representative ones. CNNs synthesize a hierarchy of increasingly abstract features, by merging several convolutional and sub-sampling (i.e., pooling) layers. This hierarchy is usually followed by a sequence of fully connected layers which are responsible for the final classification task. On the top of the convolution layer, the input feature vector is convolved with a learnt kernel and is directly passed through a non-linear activation function, in order to synthesize the output feature vector. From a theoretical point of view, we consider a square-region (k × k) that denotes a certain region extracted from our input data X ∈ R N×M . Additionally, we denote w ∈ R m×m the filter operator. Consequently, the output of the convolutional layer, forms a vector h ∈ R (k−m+1)×(k−m+1) that can be formulated as follows: In the aforementioned equation b is the bias parameter, and σ(·) is the non-linear activation function. The widely used choices for the non-linear activation function stand the hyperbolic tangent function, the logistic sigmoid function, and the Rectified Linear Unit (ReLU): f (x) = max(0, x) [18,21].
Normally, each convolutional layer is followed by a pooling layer that produces a down-sampled (i.e., lower dimensioned) version of the input vector. Among the multiple variant types of pooling operators, the most common are the average-and the max-pooling. Pooling operators partition the input data into a set of non-overlapping or overlapping samples and output the maximum or average value for each such sub-region. The greatest advantage of pooling operators concerns the reduction of the model's training computational complexity, since they provide translation invariance. Concluding, the final layer of the CNN-architecture, which is a fully-connected or dense layer assigns each output unit to certain probability value.

Proposed CNN-SAE Architecture
The proposed CNN-SAE scheme adheres to the state-of-the-art architecture of SAE that was posed in Section 3.1, except the model's shared weights W ∈ R M×N . Specifically, for a single layer network with x ∈ R N input units, h ∈ R M hidden units, andx ∈ R N output units, the latent representation is synthesized as: where σ denotes the activation function, b stands for the bias term, and * stands for the 1D convolution process. Since each filter specializes on features of the whole input vector, we use a single bias per latent map [19]. Consequently, the reconstruction (i.e., output) layer can be formulated as: where V ∈ R N×M stands for the separate weight matrix that connects the hidden with the output layer, b 2 stands for the decoding bias, and * is the 1D convolution product. Additionally, we apply a standard back-propagation algorithm in order to compute the gradient of the model's error function with respect to the input parameter values. This procedure can be summarized as: where δh denotes the delta of the hidden state, and δx correspond to the delta of the reconstruction state.
In this formulation, we update the weights via a Stochastic Gradient Descent (SGD) algorithm [18]. Regarding the proposed CNN-SSAE architecture, after each 1D convolutional layer, a max-pooling layer is utilised. In this way, the latent representation is sub-sampled using constant variables, by considering the maximum value over certain non-overlapping signal areas. Specifically, in this formulation a max-pooling layer with sparsity constraints is used, in order to eliminate all non-maximal values in the non-overlapping regions. In this way, we avoid the phenomenon of having trivial solutions. In the reconstruction phase, the used sparse representation decreases the average number of utilised filters, and thus it contributes to the decoding of each sub-region. This procedure forces the learnt filters to be more generic and representative [55]. Finally, the last layer, which is the classification layer is chosen to be fully connected, and it is activated with the non-linear Softmax function. Figure 2 illustrates the CNN-SAE architecture's block diagram.

Dataset Description
In this study we consider data extracted from the Wind Turbine Fault Detection (wt-fdd) API [20]. Specifically, data were acquired from a 3 MW direct-drive turbine located near the South coast of Ireland, supplying power to a large manufacturing facility. The acquired measurements correspond to an 11-month time period, varying from May 2014 until April 2015. The time-stamped operational SCADA data are separated into 10 min intervals, representing the average of the sensor readings over that time-frame. Out of the 61 dataset's features, we use the 29 features, including measurements of the Wind Energy Converter (WEC) that is related with the operating state of the turbine. Several characteristic features are the wind speed, rotation, power and bearing temperatures among others. Additionally, every time the operating state of the turbine is modified, a new time-stamped warning or alarm message is synthesized. It is assumed that the wind turbine operates in a specific state until the next status message is generated. However, multiple messages indicate abnormal of faulty operation of the turbine. Each message is associated with two status categories: the "main status" and the "sub-status". A characteristic example of the fault categories is provided in Table 1. The WEC status data that we utilize in this study include the following categories of abnormal/faulty measurements: A characteristic example of the data distribution is illustrated in Figure 3, where we demonstrate the evolution of the anemometer's Wind Speed and the ambient Control Cabin Temperature under three categories: the normal state, the feeding fault, and the generator's heating fault.  In all investigated scenarios, we used 147, 081 measurements for the training phase of our deep feature learning architectures, and 98, 054 for the testing phase. The data were separated for training and testing, using a 80-20% ratio, by considering random signal permutations.

Evaluation Metrics
The most significant indicator that quantitatively evaluates the performance of the proposed architectures is the (Accuracy, AC) metric, formulated as: In the aforementioned equation (T P ), indicates the true positives, which are the anomaly measurements that are classified as anomalous, (T N ) stand for the true negatives, and they correspond to the normal records that are declared as normal, (F P ) indicates the false positives, denoting the normal measurements that are classified as anomalous, and finally (F N ) correspond to the false negatives, denoting the anomalous measurements that were characterized as normal.
In order to further validate the quality of our models, we select the Area Under the Receiver Operating Characteristic (ROC) Curve (ROC-AUC) evaluation metric. The ROC-AUC score determines the degree of discrimination between the variant categories, by measuring the classification performance per different model categories [56]. The ROC score provides the ratio between the True Positive Rate (TPR) and the False Positive Rate (FPT), where: TPR = T N T N +F P , and FPR = F P F P +T N . The specific curve evaluates the models' degree of separability among the variant anomalous or normal states. Additionally, the scores close to 100% illustrate robust models that can successfully determine the correct classes. Moreover, we exploit the Precision, Recall, and F 1 −score metrics defined as: High score on precision metric indicates a lower FPR, that is, less fault-free data that were incorrectly marked as faulty, and less unnecessary checks on the turbine. On the other hand, high Recall score indicates low ratio of false negative measurements, and thus less cases of non-event detection. Concluding, F 1 -score captures both Precision and Recall metrics into one metric, and thus provides their harmonic mean value. Finally, the selected loss function, for our multi-class wind-turbine classification scenario, is the categorical cross-entropy: where y true stands for the ground truth, and y prediction for the predicted values, while K denotes the total number of classes.

LSTM-SAE for Wind Turbine Anomaly Detection
This paragraph investigates the classification accuracy and performance of our LSTM-SAE architecture when applied to the problem of semi-supervised anomaly detection of wt-fdd dataset's features. Specifically, we examine the performance of our LSTM-SAE scheme towards a multi-class (i.e., five-category) classification problem. For this purpose, we deployed the architecture that is provided in Table 2, considering 4 LSTM layers composed of 50 hidden units, followed by a Time-Distributed layer, and a Dense (i.e., fully-connected) layer. Regarding the output layer of the LSTM-SAE network which is a Time-Distributed,is approximately equal to the input layer. The classification layer is a Dense layer activated with the Softmax function and is responsible for assigning the corresponding probabilities to each class and provide the classification outcomes. The output shape indicates that each input 29-th value entry vector is directed into a 5-th dimension vector that corresponds to the probabilities of each class. The maximum value of these probabilities indicates the corresponding class. The number of the hyper-parameters where evaluated with a cross-validation approach, in which we have selected the best possible parameters for each architecture. Finally, the total number of trainable parameters for the LSTM-SAE network are 78, 229. In this formulation, we set the batch-size parameter into 64, after a cross validation procedure, while the number of internal iterations was set to 100. Figure 4 illustrates the evolution of the loss function of the LSTM-SAE architecture under different batch sizes. As we may observe, the lowest value of the Loss function for the validation set (i.e., 0.039) is achieved when we set the batch size parameter into 64.  Figure 5 illustrates the evolution of the Loss Function and classification accuracy in the training and validation sets. As we may observe, the Loss converges into a stationary value only within few epochs for both training and validation sets, achieving its lowest value of 0.038 and 0.039 for training and validation respectively. Additionally,after a small number of epochs, the classification accuracy in both the training and validation phases stabilizes into a stationary value, validating the high performance of the proposed scheme. Regarding the training set, the proposed LSTM-SAE system achieves 83.38% classification accuracy, while for the validation set the best classification accuracy within the interval of 100 epochs is 83.05%.    Additionally, Table 3 illustrates the confusion matrix towards the five-fault category classification scenario. We observe the the proposed LSTM-SAE architecture provides high performance of the detection rate among the ground-truth and predicted categories. Additionally, the Precision metric for our LSTM-SAE scheme achieves: 91% for the Normal State, 72% for the Feeding Fault, 65% for the Generator Fault, 97% for the Excitation Fault, and finally 91% for the Main Failure fault.

CNN-SAE for Wind Turbine Anomaly Detection
This paragraph investigates the performance of our second proposed scheme adhering to a Stacked Autoencoders framework with Convolutional hidden layers (CNN-SAE). We examine the performance of the proposed CNN-SAE scheme towards the multi-class wind turbine classification problem. For this purpose, we consider a sequence of Convolutional, Max-Pooling and Upsampling layers that build the encoder and the decoder, followed by a Dense layer using the Softmax activation function, in order to perform the final discrimination (i.e., classification) into the corresponding probabilities. The proposed CNN-SAE architecture is summarized in Table 4. Respectively, in this experimental setup the batch-size was fixed into 64, adhering to a cross-validation process. The number of internal epochs for the algorithmic process to converge was set to 100. The total number of trainable parameters is 30.771. Figure 7 demonstrates the evolution of the loss function of the CNN-SAE architecture under different batch size parameters. As we may notice, the lowest value of the Loss function for the validation set (i.e., 0.0126) is achieved when we set the batch size parameter into 64. In Figure 8 we depict the converge behaviour of the Loss function (i.e., Mean Squared Error) and the Classification Accuracy over the training and validation sets. Specifically, the proposed CNN-LSTM deep feature learning architecture achieves a classification accuracy of 94.1% for the training set, and 93.4% for the validation set, both within the interval of 100 running epochs, while the loss function reaches the lowest value of 0.1004, and 0.126, for the training and the validation sets respectively within the same epochs interval.   Figure 9 provides the AUC-ROC curve for the classification among the different status of the wind turbine. Specifically, Normal Status class achieves 97.9%, Feeding Fault has a ROC score of 94.3%, Generator Heating Fault reaches a 94.20% ROC score, Excitation Fault achieves a ROC score of 99.9%, and finally Main Failure class has 99.8% AUC-ROC score. All aforementioned values validate our assumption that the proposed CNN-LSTM architecture provides accurately classification results to the proper Normal/Faulty state. Consequently, the proposed approach is able to detect with high sensitivity the correct state for each testing measurement.

Comparison of the Developed Techniques
In the following paragraphs we compare our proposed Stacked Autoencoders architectures with ultimate goal to determine the best possible scheme for the wind turbine fault detection problem. As a baseline we consider the traditional form of a SAE architecture, and thus we compare the performance of the dense-hidden layer SAE architecture, with our proposed sophisticated scenarios of the LSTM-SAE and the CNN-SAE. Additionally, we validate our assumption, that our proposed deep feature learning schemes provide improved performance over the simplistic dense SAE layer scenario. Moreover, the proposed architectures solve the problem of semi-supervised wind turbine fault detection, by learning representative features through the proposed deep learning architectures. Consequently, fair comparison with recent literature approaches that are reported on the related work section cannot be achieved, since these techniques rely either on binary classification tasks [2], or use probability distributions and thresholding operators [14,32]. Table 6 provides the evaluation metrics of the Stacked Autoencoders technique with Dense Layers and our proposed LSTM-SAE and CNN-SAE architectures. Regarding the SAE scheme, we chose a deep learning scheme with 4 hidden layers, activated with the RELU function [57]. The final layer was activated with the Softmax function and thus it assigns the corresponding probabilities to the different states. To perform a fair comparison with the other approaches, we preserve the same parameters with the proposed two architectures, while we fix the batch size parameter into 64, and the number of internal epochs into 100.
We observe that our proposed architectures achieve highly accurate results, and in the majority of cases over 90%. Regarding the Precision score, the LSTM-SAE architecture achieves its highest value of 91% for the Normal State, while the CNN-SAE scheme achieves the highest values of 91%, 77%, 99%, and 99%, for the Feeding Fault, Generator Fault, Excitation Fault and Main Failure Fault States. In terms of the Precision metric both LSTM-SAE and CNN-SAE architectures outperforms the simplistic scenario of SAE architecture with Dense hidden layers. Consequently, both architectures present high performance in predicting the faulty/normal class of the WT-dataset's measurements.

Conclusions
This paper investigates the performance of two efficient architectures that learn deep and representative features in order to tackle the wind turbine anomaly detection problem. Specifically, Long Short Term Memory-Stacked Autoencoders, and Convolutional Neural Network-Stacked Autoencoders were trained on multivariate time-series data to classify between fault or normal states. The proposed Stacked Autoencoders techniques present high-quality results regarding the classification accuracy, the reduction of the loss function reconstruction error, and the evaluation metrics. One of our main future targets is to extend these methodologies towards unsupervised anomaly detection, and additionally towards the condition monitoring of other Critical Energy Infrastructures.