Ensemble Neuroevolution-Based Approach for Multivariate Time Series Anomaly Detection

Multivariate time series anomaly detection is a widespread problem in the field of failure prevention. Fast prevention means lower repair costs and losses. The amount of sensors in novel industry systems makes the anomaly detection process quite difficult for humans. Algorithms that automate the process of detecting anomalies are crucial in modern failure prevention systems. Therefore, many machine learning models have been designed to address this problem. Mostly, they are autoencoder-based architectures with some generative adversarial elements. This work shows a framework that incorporates neuroevolution methods to boost the anomaly detection scores of new and already known models. The presented approach adapts evolution strategies for evolving an ensemble model, in which every single model works on a subgroup of data sensors. The next goal of neuroevolution is to optimize the architecture and hyperparameters such as the window size, the number of layers, and the layer depths. The proposed framework shows that it is possible to boost most anomaly detection deep learning models in a reasonable time and a fully automated mode. We ran tests on the SWAT and WADI datasets. To the best of our knowledge, this is the first approach in which an ensemble deep learning anomaly detection model is built in a fully automatic way using a neuroevolution strategy.


Introduction
In the paper, we propose a high-level ensemble approach which is fine tuned by a neuroevolution algorithm.The presented method is model independent.It can be adapted to any deep learning anomaly detection model.The main advantage of the algorithm is its fully automated mode.
In the anomaly detection field, the deep learning models are those which achieve the best results on well-known benchmarks.These are mainly deep autoencoders based on LSTM layers, convolutional or fully connected sequence of layers.A wide variety of autoencoders are used, such as variational, denoise or adversarial autoencoders.Research shows that some further improvements like adding a discriminator as an additional verification module or some other GAN based autoencoder modifications can boost detection results.Recently we can also observe promising results in using deep graph neural networks in anomaly detection [1].

arXiv:2108.03585v1 [cs.LG] 8 Aug 2021
Neuroevolution is a form of artificial intelligence that uses evolutionary algorithms to generate artificial neural networks (ANN), parameters, topology and rules.The most popular algorithms are NEAT, HyperNEAT, coDeepNEAT etc.In the presented approach, is partially based on NEAT algorithm which is used for generating an optimal anomaly detection model.The search space and crossover/mutation rules are defined.The novelty of the proposed algorithm is that new search dimensions have been added.These dimensions are training data distribution, dividing data to subgroups and searching for the optimal composition of the ensemble model.
The proposed neuroevolution search space is based on forming encoders and decoders from single neural layers like fully connected, convolutional, recurrent or attention layers.There are two main dimensions of optimisation.Therefore, two populations are inside the algorithm.The first is the models population from which new single models are evolved by genetic operators.The second is the subgroup population, which is needed to form the ensemble model from the models population.This work concentrates on the data optimisation stage and the setting up of the ensemble model.It shows how this aspect can improve non-ensemble models.The last step in the NAS (neural architecture search) is fitness definition.In the presented approach, the fitness is the sum of F1 scores from the training dataset and from the random reduced validation dataset.
The main advantages of the presented algorithm are that enables building the ensemble model in automatic mode and creates a wide search space between various deep learning autoencoders, GAN architectures and optimal training data subgroups.

Related works
Anomaly detection has recently become quite a popular research subject.The basic unsupervised methods include linear model-based methods [2], distance-based methods [3] [4], density-based methods [5], isolation based-methods [6] and many others.The best f1-score for these methods is 23% on SWAT and 9% on WADI datasets.However, deep learning-based methods have recently gained significant improvements in anomaly detection over the aforementioned approaches.One of the most popular deep learning models for multivariate anomaly detection are auto-encoder models (AE), which use the reconstruction error as an anomaly inspection.Zong et al. proposes a deep autoencoding Gaussian mixture model (DAGMM) [7] which jointly optimises the parameters of the deep autoencoder and the mixture model simultaneously.This solution yields an f1-score of 55% for SWAT and 20% for WADI datasets.Park et al. introduced the LSTM-VAE model [8] which replaces the feed-forward network in the variational autoencoder (VAE) with LSTM.As a result of this approach, it was possible to gain an f1-score of 75% for SWAT and 25% for WADI datasets.Russo et al. use an autoencoder which consists of 1D convolution layers [9].This model was tested with the Urban Water Observatory Initiative (www.eawag.ch/uwo)datasets and has an anomaly detection accuracy of 35%.Audibert et al. proposed a fast and stable method called USAD [10] which is based on adversely trained autoencoders.This model contains only fully connected layers and achieves a 79% detection anomaly for SWAT and 23% for WADI dataset.Generative adversarial networks (GANs) as anomaly detectors were proposed in [11].The authors used LSTM as the generator and discriminator models in the GAN framework and anomalies were detected by the use of a combination of both model errors (DR-Score).Through the use of this approach, the anomaly accuracy for this model is 77% for SWAT and 37% for WADI datasets.Deng et al. [1] achieves an f1-score of 81% for SWAT and 57% for WADI datasets through the use of a graph neural network (GNN).The mentioned deep learning models -LSTM, USAD and CNN 1D are the baseline for proposed in this paper solutions.Recently neuroevolution algorithms are used in many machine learning tasks for improving accuracy for deep learning models [12].In [13] the neuroevolution search is used for evolving neural networks for object classification in high resolution remote sensing images.In [14] authors present a neuroevolution algorithm for standard image classification.Authors in [15] show the neuroevolution strategy scheme for language modeling, image classification and object detection tasks.It is based on the co-evolutionary NEAT algorithm which has two levels of optimization.The first one is single deep learning sub-block optimization.The second one is composition of sub-blocks to form a whole network.Presented results showed that in most of the cases optimized models achieved better results than models designed by humans.

Autoencoder architecture
Autoencoders are an unsupervised learning technique in which the neural network is trained to learn the compressed representation of raw data.The model consists of two parts: encoder E and decoder D. The encoder learns how to efficiently compress and encode the input data X to represent them in reduced dimensionality -latent variables Z.The decoder is taught how to reconstruct the latent variables Z back to its original shape.The model is trained to minimise reconstruction loss which means reducing the difference between the output of the decoder and the original input data, which can be expressed as: where The simplest kind of autoencoder is an Undercomplete autoencoder (UAE).These models learn the most important and relevant attributes of the input data through the use of a bottleneck with a smaller dimension than the input data.Another type of autoencoder, called the Denoie autoencoder (DAE), extracts important features from the data through reconstructing the original input after it has been contaminated by noise.In unsupervised tasks, the most popular type of autoencoders are variational autoencoders (VAE).These autoencoders replace the bottleneck vector with two vectors: one for representing the mean of the distribution and the second for representing the standard deviation of the distribution.VAE's for a given input in the encoding phase determine a distribution of the latent variables.By contrast the decoder determines the distribution of the inputs corresponding to the given latent variables.
Autoencoders are widely used in many fields, such as online intrusion detection [16], malware detection [17] and anomaly detection in streaming data [18].This kind of model can consist of various types of layers e.g.fully connected layers, CNN, LSTM etc.

Neuroevolution ensemble approach
The prototype of our framework presented in figure 1 consists of two separate populations.The first is the models population.The second is the data subgroup population.This enables formation of the ensemble model using an approach that is similar to bagging-based technique.The framework starts with a generation of initial groups of features with the use of correlation (which is explained in detail in further subsections), then it mutates models and groups via the genetic algorithm.The final effect of those actions is an optimized ensemble model, that can be used to detect anomalies.During our experiments with various models, we noticed that almost all models detect a similar set of anomalies despite changes in their hyperparameters.Of course, the results were slightly different depending on the hyperparameters, but none of the changes had a significant impact on detection.Therefore, we decided to try to apply an ensemble model based on dividing available features into smaller groups and training each model on a separate subset of features.As a result of this, the models can discover more precise dependencies and relations between features.A simplified schema of our approach is presented in Algorithm 1.We apply a neuroevolution approach for searching for an optimal partition into groups (line 1, algorithm 1).After classifying data points using every model (line 2, algorithm 1), we use a voting mechanism to determine whether a data point should be considered as an anomaly by the whole ensemble model (line 3, algorithm 1).
Algorithm 1: Simplified general schema of our approach Result: Classification of the anomalies Find the best partition of features into groups using a genetic algorithm; Train and evaluate a separate model for every group; Evaluate an ensemble model using a voting algorithm To find an optimal partition of features into groups, we apply a genetic algorithm.The simplified schema of the genetic algorithms is presented in Algorithm 2.
The single gene provides information that a feature f is present in a group t.The single solution represents k groups, each containing zero or more features.A sample solution for k = 3 could be: [[0, 1, 5, 12], [2,3,4,9], [6,7,9]], where numbers in groups mean which features are present.Population P contains N P solutions.
The parameters for the neuroevolution approach in this work are: • k -maximal number of submodels in an ensemble model, • p m -probability of the mutation in a single group of features, • N g -number of generations in a genetic algorithm, • N P -size of the population in a genetic algorithm, • N par -number of parents mating, • N ep -number of epochs to train while calculating fitness.

Elements of the genetic algorithm
To improve convergence of a genetic algorithm, instead of using random initial population, we create it based on correlation between features.We use a hierarchical clustering with the addition of a little randomness to achieve a diverse population.
The method for calculating fitness for a single solution is presented in Algorithm 3.For every used dataset (SWAT and WADI), we split a normal part of the data into training and validation datasets.We calculate fitness for every feature group in the solution.As the first step, we train a chosen model on selected features from the training data for a given number of epochs (line 3, algorithm 3).After that, we evaluate the trained model on training and validation data (lines 4 and 5, algorithm 3) calculating losses.To normalise loss, we calculate the weighted loss from the training and validation datasets (lines 6 and 7) and we also divide weighted loss by the number of features in the group (line 8, algorithm 3).The final fitness for every solution is calculated as a negated sum of losses for groups in the solution (lines 9 and 10, algorithm 3).The value is negated because we want to minimise the total loss of an ensemble model, while in the genetic algorithm, the goal is to maximise the fitness.
During the crossover part (line 6, algorithm 2), we create a new solution based on two selected parents.The detailed steps of the method are presented in Algorithm 4. For every pair of groups of parents, we determine what range of Algorithm 3: Fitness calculation Result: Fitness value for solution S Input: S -solution Input: X t -train dataset Input: X v -validation dataset Input: N ep -number of epochs to train while calculating fitness loss sum = [] ; for g ∈ S do model = train_model (X t , g, N ep ); loss t = evaluate (model, X t , g); loss sum = loss sum + loss g return − loss sum features is present in the groups and choose random split point (lines 3-5, algorithm 4).The new group for offspring then contains a parts of the groups from both parents (lines 6-13, algorithm 4).
Offspring created via a crossover algorithm can also be affected by mutations (line 7, algorithm 2).In our work, we use three types of mutation: 1. Duplicating selected feature to other group (presented in Algorithm 5), 2. Vanishing features that exists in more than one group in a single solution (presented in Algorithm 6), 3. Adding features that do not exist in any group in a single solution (presented in Algorithm 7).
The goal of mutations is to help to maintain diversity in the population.Mutation 1 allows having the same feature available in a few groups.Mutation 2 protects solutions from having a few groups with exactly the same features and from overusing any feature.Mutation 3 makes it possible to restore features lost in other genetic operations.

Results
In this section, we describe used datasets and models.We also demonstrate the improvements that were possible to achieve by the usage of proposed solutions.We provide a comparison with methods from the state of the work articles.
group.remove(f eature) ; return S All of our presented calculations were performed on the Nvidia Tesla V100-SXM2-32GB 1 .In order to reduce both training times i.e during the evolution algorithm and the final training, each subgroup is calculated on the separate GPGPU.The values of the parameters of the genetic algorithm were the same for all experiments and are presented in Table 1 (column basic value).Moreover, the model which gained the best results (CNN 1D) was also run once again with a higher value of the following parameters: population size and parents mating (column Rerun value in table 1) to check how it would affect the efficiency of the algorithms.

Datasets
As training and testing data, the following were used: • Secure Water Treatment (SWaT) Dataset [19] -it contains data gathered from a scaled-down version of a real water treatment plant.Data were collected for an 11-day period in two modes -7 days of a normal operation of the plant and 4 days during which there were cyber and physical attacks executed.In our experiments, we are using the newest version as this is recommended by the authors of the dataset.

Models
In this paper, in order to detect anomaly we are using three models of the autoencoder.The first of these was proposed in [9] where the encoder contains three 1D CNN layers with kernel sizes k 1 = 8, k 2 = 6, k 3 = 4 and filter maps f 1 = 64, f 2 = 128, f 3 = 256.Each CNN layer is followed by LReLU [21] activation function and batch normalisation calculation.The decoder is a mirror reflection of the encoder where CNN layers are replaced by transposed CNN layers.
The second model is a variational autoencoder [8] where both the encoder and the decoder contain two LSTM layers with hidden sizes equalling 16 which are followed by the LReLU activation function.In the each case of the training phase the batch size is set at 32 and over the test phase, it is set at 1.The third model is the USAD model proposed in the literature [10].It utilises the idea of GAN networks and the architecture of autoencoders.The USAD model consist of two autoencoders built from one shared encoder and two decoders.They are trained with the use of proposed two-phase training which includes standard autoencoder training and adversarial training specific to GAN networks.
The architectures of the models were partially explored by us, including searching for optimal window size, number of layers, and numbers of neurons in layers.We use the found hyperparameters in the neuroevolution approach.A more advanced search is planned for future works (see section 7).In the evolution algorithm (see Section 4), each model is trained over 15 epochs, and during the final training, each model is trained over 70 epochs.We divide the multivariate time series into sub-sequences with a sliding window and we determine the size of this parameter in an experimental way.Consequently, sliding windows have sizes of 4 for autoencoders with CNN layers, 8 in the case of LSTM-VAE and 12 in the case of USAD.In order to speed up training, we are using down-sampling with the ratio 5, which reduces the size of the data.As was indicated in the literature [10] this operation did not cause a significant drop in accuracy.Table 3 contains the number of trainable parameters for each model used, the optimal values for some of them and the time that is necessary to perform the whole process.As it turns out, the most parameters are included in the USAD model and therefore this takes the longest time to train.[1].Those results were achieved based on SWAT and WADI-2017 dataset.Additionally, this table contains results which were generated for this paper and in the case of WADI dataset, wadi 2019 were used (marked as *).Moreover, for our baseline models (USAD, LSTM-VAE, CNN 1D), we present outcomes from our experiments for the SWAT dataset, in which for some cases, the results are slight different to the original result, as it required preparing our own implementation (marked as **).
Table 5 contains gained results after introducing splitting into the groups through the use of the genetic algorithms.We can observe a huge improvement on the WADI dataset in the case of the USAD model and CNN-based autoencoder.It has the smallest impact in the LSTM-VAE model.The best results were gained for the CNN 1D autoencoder.Due to that, for the CNN 1D model, we rerun the experiment with a higher value of the following parameters of the genetic algorithm: population size and parents mating.As a result, it was possible to improve the F1-score by about 2% in the case of SWAT and WADI datasets.These results are marked as (***) in the table 5.

Conclusions
The results show that data distribution and dividing the input signals to subgroups and a feeding ensemble model can significantly improve the efficiency of the anomaly detection process.The neuro-evolution process helps to find near optimal subgroups.The tests were run on WADI and SWAT benchmarks.In both cases best results were achieved among no graph neural network models.The improvements on WADI dataset are significant.The reason of this fact is more sensors and time series samples than in SWAT dataset.

Future work
The paper presents a framework for evolving ensemble deep learning autoencoders for anomaly detection.Future work will concentrate on further enhancements of the algorithm.The most important enhancements are ensemble model based on graph networks, new crossover to mix different architectures together e.g.attention with discriminator, and graph networks with USAD.The main work will concentrate on evolving optimal autoencoder architecture.The last action would be to run longer simulations which can give further improvements in F1 score.Further simulations will be run in bigger populations and with more iterations.

Figure 1 :
Figure 1: The architecture of the framework

Algorithm 2 :
Genetic algorithm Result: Final population after N g generations Input: N P -Size of population Input: N g -Number of generations in the genetic algorithm Input: k -Max number of groups in a single solution Generate initial population ; generation = 0 ; while generation < N g do For every solution in population calculate a fitness; Choose the best solutions as a parents; Create offspring using crossover; Mutate offspring; generation ← generation + 1 Return final population

•
[20]r Distribution (WADI) Dataset[20]-this dataset contains data from a scaled-down version of a water distribution network in a city.Collected data contains 14 days of normal operation and 2 days during which

Table 1 :
Parameters of genetic algorithm

Table 2 :
Statistics of the used datasets As present in Table 2, there are two WADI collections from 2017 and 2019 available.

Table 3 :
Parameters of the models

Table 4 :
Anomaly detection accuracy (precision (%), recall(%), F1-score(%)) on two datasets without splinting into groups.Results marked as * was generated by usage of the WADI-2019 dataset.** means that we had to reimplement a model on our own