Membrane System-Based Improved Neural Networks for Time-Series Anomaly Detection

: Anomaly detection in time series has attracted much attention recently and is quite a challenging task. In this paper, a novel deep-learning approach (AL-CNN) that classiﬁes the time series as normal or abnormal with less domain knowledge is proposed. The proposed algorithm combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) to effectively model the spatial and temporal information contained in time-series data, the techniques of Squeeze-and-Excitation are applied to implement the feature recalibration. However, the difﬁculty of selecting multiple parameters and the long training time of a single model make AL-CNN less effective. To alleviate these challenges, a hybrid dynamic membrane system (HM-AL-CNN) is designed which is a new distributed and parallel computing model. We have performed a detailed evaluation of this proposed approach on three well-known benchmarks including the Yahoo S5 datasets. Experiments show that the proposed method possessed a robust and superior performance than the state-of-the-art methods and improved the average on three used indicators signiﬁcantly.


Introduction
Anomaly detection aims to find abnormal behavior of data and is widely studied in many fields, like fault detection or predicted maintenance in industrial systems [1]. The reason anomaly detection is important is because anomalies usually contain useful and critical message. To cope with the increasing data collected by research institutions and industries through the Internet of Things (IoT), it is important to have automated procedures that separate the anomalies from normal data.
However, anomaly detection is considered a hard problem [2]. The extremely unbalanced data distribution is the biggest difficulty, and the negative class rate is extremely low. One detection algorithm which works very well on a certain benchmark might get surprisingly bad performance on another. Moreover, anomaly detection for time series is much more difficult due to the issue inherent in time series. For these reasons, this paper tries to find an effective and robust detection algorithm. Many scholars have studied the methods of detecting abnormal patterns by extracting data features in the field of anomaly detection. Anomaly-detection methods mainly consist of three types: statistical modeling [3][4][5][6], such as the k-means clustering and Random forest methods, temporal feature modeling [7][8][9][10] which is mainly based on the LSTM, and spatial feature modeling [11][12][13] which takes the advantages of CNN. Traditionally, time-series anomaly detection has been tackled using distance-based methods, such as the dynamic time wrapping algorithm (DTW) [14], meanwhile, artificial neural networks have become powerful tools for time-series anomaly detection due to the large amount of data.

Long Short-Term Memory and Convolutional Neural Networks
Long short-term memory is a mainstream kind of RNN [27] and is much more complex, capable of learning long-term dependencies. LSTM relieves the problem of vanishing gradient by replacing the self-connected hidden units with memory blocks [28]. LSTM has been adopted widely for machine translation and time-series forecasting. The architecture of an LSTM is shown in Figure 1. The formula of LSTM is given below, the W * s, R * s and b * s are the input weights, the recurrent weights and the biases, respectively, z t , h t indicates input and output of the LSTM unit, s t is the current cell state: Convolutional neural network (CNN) is also a type of ANN and was developed for image classification problems. CNNs can be applied to one-dimensional sequences of data as well, such as human activity recognition; the model can learn an internal representation of the time-series data and achieve comparable performance. The CNN employs a convolution operation and is defined as: This formula can be regarded as a weighted average of x(τ) at the time stamp t, where weight is calculated by w(−τ) shifted by amount t. One-dimensional convolutional is defined as:

Attention Mechanism
The attention mechanism was proposed by Bahdanau [29] and is used in various deep-learning models. As the function given by the following equation shows, the context vector c i for the output is calculated using a weighted sum of the annotations h i which means that the context vector c i depends on a sequence of annotations (h 1 , ..., h T x ) . Each annotation h i contains specific information and drops the irrelevant information about the whole input.
where weight α ij is the attention score of each annotation. It can be calculated as follows: where e ij is the output score of a neural networks given by a(v i−1 , h j ) , v i−1 is the hidden state, h j indicates the j-th annotation, e ij attempts to capture the alignments of the input at j and output at i.

Tissue-Like and Cell-Like Membrane Systems
In this section, we briefly introduce some concepts related to P systems which are distributed computational parallel models. Membrane computing was inspired by the structure and functions of cells, tissues and organs. In recent years, researchers have turned to the application of membrane computing models. Generally, there are three main families: cell-like P system [30], tissue-like P system [31,32] and neural-like P system [33]. The structure of the tissue-like P system can be viewed as a net, a tissue-like membrane system of degree m > 0 is constructed as follows [34]: where O represents finite non-empty alphabets of objects; ω i (1 ≤ i ≤ q) are initial multisets of objects present in cell i; R i are finite sets of evolution rules in cell i(1 ≤ i ≤ q), R is a finite set of communication rules; i 0 ∈ {0, 1, ..., q} indicates the output region where the computation results are placed. A cell-like P system has a hierarchical arrangement of membranes inside a skin membrane. Each membrane delimits a region where multisets of objects and rules are placed and a set of evolution rules take the form [ω → ω ] [34].

Squeeze-and-Excitation
Hu [35] proposes a squeeze-and-excitation Network (SENet) for CNNs to improve the channel interdependencies. SENet is an architecture for transformation F tr : X → U, X ∈ R W ×H ×C , U ∈ R W×H×C . We can then represent output of F tr as U = (u 1 , u 2 , ..., u c ), u c is defined as follows, * represents the convolution operation and v s c is the kernel.
Hu improves the channel interdependencies through squeeze-and-excitation operation. The squeeze process uses a global average pool to get a global understanding of each channel. In our case, similar to the image data, the U is generated by shrinking through the temporal dimension T to achieve the channel-wise statistics, z ∈ R C , the cth element of z is calculated by F sq (u c ) which is defined as follows: To use the aggregated information obtained from the squeeze stage, excite operation that uses two fully connected layers is employed to get the channel-wise dependencies. We employ an equation given below: where F ex is a neural network, σ and δ indicates the sigmoid function and ReLU function respectively, W 1 and W 2 are learnable parameters of F ex . Finally, the output of the block is gained by rescaling U as follows:x .,x C ) and F scale (u c , s c ) represents the channel-wise multiplication between the feature map u c and the scalar s c . Normally, time-series anomaly detection can be transformed into a binary classification problem but is much more complex, for the data is extremely imbalanced. LSTM and CNN have rarely been combined to realize time-series anomaly detection. Similar to the LSTM-FCN proposed by Fazle Karim [36], the model we proposed (AL-CNN) combines both and extends the LSTM with attention mechanism; furthermore, one-dimensional convolution (1DCNN) is added before the attention LSTM to improve the efficiency of the model. In particular, we extend the Squeeze-and-Excitation block to the case of 1D sequence models to enhance the anomaly-detection accuracy. The model can handle both the point anomaly and discords no matter how univariate or multivariate time series. The procedure of the proposed AL-CNN is shown in the Figure 2.  Generally, both tissue-like and cell-like P systems are predigestion and they are not applied to deal with hard problems in the real world. In this work, we intend to use the strengths of both tissue-like and cell-like membrane structure to develop a hybrid dynamic membrane structure as shown in Figure 3. The graph-based and tree-based membrane structure is depicted via rounded rectangle and squares, respectively. A hybrid dynamic membrane system is constructed as the form: where O represents a finite set of objects; E ⊆ O is the set of objects in the environment; µ is a membrane structure which include µ T and µ G , here, µ T are Tree-based membranes and µ G represent Graph-based membranes; the symbols ω 1 , ..., ω q are finite sets of strings over O of q membranes; the i 0 represents the output membrane of Π and R 1 , ..., R q are finite sets of rules including two types described below: The G-rule is used in the HM-AL-CNN to establish a synchronous communication channel within the computation cells, exchange the multiset a i of cell x with multiset b i of cell y, x and y are membrane labels: The C-rule compares the output of each AL-CNN and picks the best one as the final result: Batch nornalization

Computation Mechanism
In this paper, the G-rule is introduced to implement several AL-CNNs and Pre-Output cell gives the result of every AL-CNN. Objects in the P system evolve according to the step of AL-CNN described in Section 3.2 during the computation phase. Then, the C-rule is applied to choose the best objects among the Pre-Outputs as the final result of the HM-AL-CNN.

Termination and Output
The above computing procedures are processed iteratively, and the maximum computation iteration is used as the halting condition. The membrane system halts when the maximum number of iterations is reached and all the objects in the output cell are considered to be the final results of the P system.

Experiments Settings
To evaluate the proposed method, HM-AL-CNN has been tested on three benchmarks which are described in Section 4.4. The model was optimized using Adam with an initial learning rate of 1 × 10 −6 and the convolution kernels are initialized by the He initialization scheme [37], ReLU was used as the activation function for the hidden layers. The number of training epochs was determined based on the length of the input; for the Yahoo Webscope S5, the model was trained for 500 epochs using batches of 128. The Classic Anomaly Datasets and Space Shuttle Valve Dataset were trained for 700 epochs using batches of 256.
Time-series data need to be transformed into sequences of overlapping windows of size w so that the system makes sense. For x t at time step t, its condition (normal or abnormal) is used as the label of the former w elements; w is the time window size which is also called a history window. Then, we can define the data as a form of (N, Q, M), where N is the number of samples in the time series, Q indicates the maximum time steps and M represents the number of variables; we define the M to 1 if the time series is univariate.
In addition, both the train and test datasets are normalized using Equation (20). x and x represents the value of the actual time-series data and the normalized value, respectively. Moreover, we define fixed-sized anomaly windows with each window centered around an anomaly; points in the anomaly window are labeled abnormal. For instance, if the anomaly window size is set to 10, indicating the former 5 points and latter 5 points are labeled abnormal. Only the training sets are operated as such; this up-sampling operation can relieve the extremely imbalance of the data and enhance the performance significantly especially the recall rate.

Loss Function and Output
Cross-Entropy Loss given in Equation (21) has been employed to measure the difference between the actual value y j and predicted valueŷ j .
In our case, the SoftMax layer classifies the output into two classes either normal or abnormal as described in Equation (22). C indicates the class, d is the output of the fully connected layer, w is the weight, L represents the last layer and N c is the total number of classes.

Evaluation Metrics
The proposed approach is evaluated using Precision, Recall, F-score and AUC. If an abnormal case is classified as a normal, this type of error is considered to be false negative (FN). True positive (TP), true negative (TN) and false positive (FP) is defined similarly; each algorithm was evaluated through TP, TN, FP and FN rates. In addition, AUC is also the most commonly used metric for evaluating anomaly-detection methods.

Datasets Description
In this section, we describe three well-known benchmarks from different domains, including real-world and the synthetic datasets which have been applied in previous works on anomaly detection, including the Yahoo Webscope S5, Classic Anomaly-Detection Datasets and Space Shuttle Valve Dataset. •

Yahoo Webscope S5 Datasets
Yahoo Webscope S5 consists of four classes. Class A1 contains the real Yahoo membership login data, and A2, A3 and A4 contain synthetic anomaly data (https://research.yahoo.com). Table 1 shows the characteristics of each sub-benchmark. This dataset contains 367 time series. Each time series consists of almost 1500 data including 0.02% abnormal values. Figure 4a,b show the statistical graphs for the class A1. We can see from the two figures that the data distribution of each file is significantly different; it is not easy to carry out anomaly detection using statistical analysis techniques. Figure 5a shows a real-world time series of the A1 class.   •

Classic Anomaly-Detection Datasets
Six commonly used natural datasets have been adopted in this section, which can be found at the UCIRepository [38] and OpenML; anomaly cases have already been marked as ground truth, including the Pima, Covertype, Ionosphere, Mammography, Shuttle and Kddcup99. We have removed all non-continuous attributes as done in [39,40]. Properties of each dataset are shown in Table 2. This dataset collects values which control the flow of fuel on the space shuttle. Some subsequences are normal and few subsequences are abnormal. Figure 5b shows this time series, and the time series is segmented to several subsequences with an orange dotted line; some subsequence are considered abnormal or, in other words, discord subsequences.

Comparison to State-of-the-Art
Experiments on the Yahoo Webscope S5 are compared to several deep-learning approaches, including the CNN, LSTM, CNN + LSTM, DeepAnt [41] and two popular tools, Yahoo EGADS which was released by Yahoo Labs to detect anomalies in large scale time-series data and Twitter Anomaly-Detection method which aims to detect anomalies of social network data [38]. There are also many different previous works related to the classic anomaly benchmarks mentioned in Section 4.4, for the sake of brevity, we select the popular anomaly-detection techniques for comparison including the Isolation Forest (iForest), OCSVM, LOF [42].

• YAHOO Webscope S5 Datasets
The experimental results of the proposed compared to the other detection algorithms are shown in Table 3 which demonstrates that the proposed improves the detection performance compared to other algorithms, including the deep-learning and classic anomaly-detection algorithms. Figure 6a,b indicate the experimental result of an example time series in the A1 class; HM-AL-CNN detects five out of six anomalies and has only one false positive. Furthermore, HM-AL-CNN does the detection before the true anomalies occur, which is vital to the real application scenery, especially in the industry field. We compared our results with previous methods using the t-test as shown in Table 4. The p-values for F-score of A1 are all < 0.05, and the proposed approach achieves a statistically significant improvement over other methods. Table 5 shows a comparison of the proposed with other algorithms on the whole Yahoo Webscope S5. This table gives average F-score of the comparison algorithms along with the proposed data of each sub-benchmark. HM-AL-CNN outperforms other methods in three sub-benchmarks and works slightly worse for sub-benchmark A4. We compared our results with previous methods using a Wilcoxon signed-rank test as shown in Table 6; the proposed approach achieves a statistically significant improvement over other methods except DeepAnt. Even though HM-AL-CNN is not always the best, it achieves better means than DeepAnt and performs better in the whole dataset.   •

Classic Anomaly-Detection Datasets
To evaluate different anomaly-detection algorithms along with the proposed on the Classic Anomaly-Detection Datasets, AUC has been used. AUC is used commonly for evaluating the detection approach on the mentioned datasets. We compare the results of three state-of-the-art anomaly-detection methods with HM-AL-CNN and the results are shown in Table 7. For iForest, OCSVM, and HM-AL-CNN, 40% of the actual data are used for training and rest for testing. We have used the default parameters suggested in [39] for iForest; RBF kernel for OCSVM and k = 10 is applied for LOF. Figure 7 shows that HM-AL-CNN has an arithmetic rank of 1.66 and performs better than the existing methods via a critical-difference comparison of the average arithmetic ranks.  Figure 7. Critical difference of the arithmetic means of the ranks on six datasets.

NASA Space Shuttle Valve Dataset
The above experiments have already shown that HM-AL-CNN is able to detect point anomalies in time-series data. In this section, HM-AL-CNN is proved to be suitable to time-series discord detection as well. Discords are subsequences that are different from the rest of a longer time series [43]. In this experiment, this proposed algorithm can label most of the points in an abnormal discord cycle and label the points in normal cycles normally. If the abnormal number is over the threshold we set, we classify the sequence discord. Figure 8a,b show the experiment results, there are four normal sequences and one discord in the test set; these experimental results demonstrate that the proposed algorithms work well.

Discussion
The proposed method achieves better results than the other methods in most cases. Due to the different distribution of the time series, the proposed works slightly worse in some cases. From Table 3, we found that the proposed performs better than CNN and LSTM-based algorithms, which indicates that squeeze-and-excitation and attention mechanism could be used in this case to improve the detection performance; in addition, the proposed achieves a recall value of 0.79, which improves the previous methods significantly. Table 5 shows that the proposed works slightly worse in A4. Whether or not the additional attributions such as the change-point and noise caused the slightly bad performance still need to be explored, generally, the proposed works better on the whole Yahoo S5. Table 7 indicates that HM-AL-CNN has an arithmetic rank of 1.66 and can find anomalies in a multi-variant dataset as well.

Conclusions
In this paper, we propose a novel hybrid dynamic membrane system which takes advantages of tissue-like and cell-like P system for a time-series anomaly-detection task. To get more accurate detection results, CNN and LSTM with attention mechanism are combined, and 1D Squeeze-and-Excitation mechanism is introduced to better learn effective features. Two types of rules are introduced in the designed membrane system, profiting from the parallelism of P system; this proposed HM-AL-CNN can process several AL-CNN models individually, which consumes less time. Experiments show that the proposed possesses better performance than other time-series anomaly-detection algorithms in different benchmarks. However, there are still many important parameters that need to be chosen manually in our system, which remains to be addressed. Evolutionary algorithms such as the particle swarm optimization could be used in the future. Moreover, the design of a more effective membrane system to solve complex problems is also meaningful.