A Deep Learning Ensemble for Network Anomaly and Cyber-Attack Detection

Currently, expert systems and applied machine learning algorithms are widely used to automate network intrusion detection. In critical infrastructure applications of communication technologies, the interaction among various industrial control systems and the Internet environment intrinsic to the IoT technology makes them susceptible to cyber-attacks. Given the existence of the enormous network traffic in critical Cyber-Physical Systems (CPSs), traditional methods of machine learning implemented in network anomaly detection are inefficient. Therefore, recently developed machine learning techniques, with the emphasis on deep learning, are finding their successful implementations in the detection and classification of anomalies at both the network and host levels. This paper presents an ensemble method that leverages deep models such as the Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) and a meta-classifier (i.e., logistic regression) following the principle of stacked generalization. To enhance the capabilities of the proposed approach, the method utilizes a two-step process for the apprehension of network anomalies. In the first stage, data pre-processing, a Deep Sparse AutoEncoder (DSAE) is employed for the feature engineering problem. In the second phase, a stacking ensemble learning approach is utilized for classification. The efficiency of the method disclosed in this work is tested on heterogeneous datasets, including data gathered in the IoT environment, namely IoT-23, LITNET-2020, and NetML-2020. The results of the evaluation of the proposed approach are discussed. Statistical significance is tested and compared to the state-of-the-art approaches in network anomaly detection.


Introduction
Critical structures such as Internet Industrial Control Systems (ICS) and Sensitive Industrial Plants and Sites (SIPS) need to be functional and operate reliably even when subjected to unforeseen threats or external attacks. A number of systems are dedicated to mitigating the cascading impact resulting from the failure of or an attack on critical infrastructure. Those systems, however, can be the subject of a cyber-attack [1].
The attack surface of SIPS is wide and must be protected in both cyber and physical spaces. It comprises the SIPS data management, control, and communication layers. The main goal of a malicious user is to get access to one of these layers to steal or tamper with sensitive data. This is doable physically, remotely, or through a combined vector of attack. On the transmission side, SIPS are exposed to cyber-attacks. The likely targets of the cyber-attacks are remotely controlled devices (e.g., switches, transformer taps, valves). Their hijacking could potentially damage several physical assets and cause widespread losses. A likely target that the malicious user can attack is the server

Motivation and Objectives
Intrusion Detection Systems (IDSs), working as remote components or as part of a larger Intrusion Detection Network (IDN), are designed to facilitate attack detection capabilities to protect a network. This paper considers the problem of network anomalies, providing a machine learning-based anomaly detector and optimal solutions to make the detection tools resilient.
The objective is to improve the classification effectiveness of intrusion detection systems and networks operating in adversarial settings by developing an ensemble method facilitated by stacked generalization techniques to analyze the outliers. Diverse anomaly detection methods may cause different results on the same data. Thus, the task of offering a suitable technique for a specific domain is rather challenging. The stacked generalization approach is utilized to enhance model performance [6]. The scope of this work covers the following objectives: • Selection of the latest network traffic datasets featuring labeled network flows and corresponding features to evaluate the proposed approach; • developing a suitable model that integrates the key features in the feature engineering phase to facilitate the design architectures that resist, at least to some extent, certain types of deliberate attempts to evade detection or, more generally, subvert the protection offered by the proposed method; • the design of the deep models integrated by the stacked generalization method, which is generated by a group of selected models (designated by the ensemble) followed by a meta-classifier that learns the best way of facilitating the synergy of a group of models.

Contributions and Organization
In this work, the problem of an efficient anomaly-based network intrusion detection method is discussed. It involves a careful examination of recent benchmark datasets, IoT-23, LITNET-2020, and NetML-2020, with promising accuracy and minimal computational complexity. The work contained in this text aims at providing an efficient anomaly detection mechanism meeting the following description: • featuring flow-level analysis utilizing novel methods of aggregation and class balancing using the Synthetic Minority Over-sampling Technique (SMOTE) followed by the Edited Nearest Neighbors (ENN) approach to enhance the performance of the classification efficiency via extracting only significant features with the capability of coping with network data on a large scale; • developing and applying a dimensionality reduction approach such as a Deep Sparse AutoEncoder (DSAE) for minimizing the size of the input feature vector to the classifier in order to reduce the computational complexity; • a stacked ensemble approach using different Level-0 models (such as DNN, LSTM) and a Level-1 model (Logistic regression) is offered, and its performance is evaluated on three benchmark datasets; • the disclosed set of techniques is thoroughly evaluated through a series of experiments to illustrate the extent of the improvements in the detection rate in the chosen datasets in comparison to several state-of-the-art techniques.
The paper discusses this contribution in detail. In Section 2, the related works are reviewed. Section 3 gives an introduction about the anomaly-based intrusion detection platform, its data pre-processing approach, feature engineering techniques, and the proposed methodology of the stacked ensemble machine learning approach. Section 4 introduces the recently built IoT-23, LITNET-2020, and NetML-2020 flow-based datasets and their characteristics. Section 5 discusses the experimental results. Finally, in Section 6, conclusions are drawn, and the scope of future research is elaborated.

Related Work
The significance of the subject and the focus on network traffic analytics brought multiple recently-published studies. In this section, various works on the use of machine learning to detect anomalies in computer networks are investigated.

Deep Learning
When it comes to machine learning techniques, it is especially important to take note of deep learning and Deep Neural Networks (DNN). With the emergence of high-computational GPUs, these have revolutionized artificial intelligence and many areas of engineering and computer science. It should be expected that it will be similar in the area of cybersecurity. Although the techniques utilizing DNNs in IDS are still in their early stages, there are several examples of their usage in cybersecurity. In their work, Jihyun et al. [7] utilized Long Short-Term Memory (LSTM) networks and used the KDDCup '99 dataset. The applied LSTM-RNN achieved 96.93% accuracy and a recall of 98.88% [8]. Abolhasanzadeh [9] used a deep autoencoder as an attack detector in Big Data. The conducted experiment on the NSL-KDD test set was applied to achieve the bottleneck features in dimensionality reduction as part of intrusion detection. It was concluded that the achieved accuracy was promising for real-world intrusion detection [8]. In [10], Fiore pondered the utilization of a Restricted Boltzmann Machine (RBM). The network would be fit for anomaly detection by allowing it to learn on a real-world dataset collected from 24 h of workstation traffic [8]. Alom et al. [11] offered the Deep Belief Network's (DBN) abilities to perform intrusion detection through a series of experiments. The authors trained the DBN on NSL-KDD data to distinguish unforeseen attacks. Further development of hardware resources and computing capabilities of GPU cards makes DNN methods an interesting alternative to existing traditional methods, and they can be seen as a panacea to overcoming the challenges, which are faced by institutions responsible for CIP.

Ensemble Learning and Stacked Generalization
The literature on machine learning proposes multiple approaches regarding ensemble designing. Ensemble learning is an approach that builds on top of a collection of models. The method very often performs better than any of those models could on their own. This learning paradigm found its applications in both supervised learning and in unsupervised approaches [12]. An ensemble design involving the Level-0 and Level-1 model demonstrates superior outcomes by consolidating different individual models. Govindarajan et al. [13] introduced a hybrid model comparing individual models in a classification task. The authors of this work indicated that building an ensemble improves the accuracy of the model tremendously in a task involving both normal and abnormal traces from a mail application. Hansen et al. [14] found in their research that for an ensemble model to be more accurate as compared to the individual models, (a) every single model has to perform better than random guessing, and (b) singular models need to have output uncorrelated with one another (the ensemble has to be manifold).
Another form of increasing the performance of machine learning models can be achieved by stacking them. In this approach, also known as stacking generalization, the prediction outcomes of the models are combined using a meta-classifier [15]. Stacked Generalization (SG) was initially introduced by Wolpert [16]. Following the three-step formula, stacking is accomplished when: (a) the creation of models is accomplished by utilizing at least two different learning methods; (b) the outputs of the models contribute to the creation of a novel dataset in conjunction with the original hold-out set; (c) this novel dataset is used to create an all-encompassing model taking into account the preceding outputs [17]. Cerqueira et al. [18] investigated a method similar to stacking, where several models, e.g., LOFand hierarchical agglomerative clustering, provide the prediction outcomes, which are appended to the original dataset. In [19], Karthick et al. merged the hidden Markov and naive Bayesian models in order to facilitate flexible intrusion detection. The research suggested that this stacked generalization provides satisfactory results.
The literature review suggests that flow-based data methods in network intrusion detection are winning the attention of researchers. There is always a high demand for novel cybersecurity datasets, and the question of an ideal set of network flow features still remains open.

Proposed Methodology
The most significant objective of this research is to obtain a reliable classification of outliers by using a stacked ensemble method. In the following section, a general synopsis of the framework is offered. Then, the data pre-processing, feature engineering, and classifier modeling steps are discussed in detail.

General Overview
This section presents an overview of the proposed deep-stacked ensemble approach applied in anomaly detection on network traffic data. The detailed architectural diagram is depicted in Figure 1.
The individual phases of the proposed framework are (1) dataset selection, (2) data preprocessing integration by feature engineering, data balancing, and dimensionality reduction using a deep sparse autoencoder followed by (3) data output, (4) data splitting, and (5) classification into "normal/anomaly" using a stacked ensemble approach unified by deep models and a meta learner (i.e., logistic regression). The detailed descriptions of those steps are provided in Sections 3.2-3.5, respectively.

Data Pre-Processing
Fairly frequently, the methods of data collection produce attributes that are either duplicated or superfluous for network data obtained via analysis of network traffic [20]. Clearing the insignificant and unnecessary information is a step to produce a more robust representation, which supplies better applicable inputs to the classifier [21]. The preprocessing steps of our approach are provided below.

Feature Selection
NetFlow datasets often include non-identical feature attributes that are categorized as flow, basic, content, time, additionally generated, and labeled features, respectively. However, the information gathered from packet captures also provides a variety of irrelevant or redundant details. Removing unnecessary information offers better, unbiased detection. In addition to the above pre-processing, the selected datasets (i.e., IoT-23, LITNET-2020, NetML-2020) contain a certain amount of unusable values. Feature imputation was performed to replace the infinities with max values. For missing records, the feature means were used.

Feature Normalization
The continuous values in the network traffic datasets have a variety of magnitudes. This causes problems for a variety of classifiers. Thus, scaling is performed as a means of normalizing the attributes, squeezing the values to the range of 0 to 1. Therefore, the features are scaled according to Equation (1): wherex m,n represents the scaled feature, max n (x m,n ) is the maximum value of the data in the m-th feature, i is the number of samples in both the training and testing data, and f is the number of features from the feature selection process.

Data Balancing
In situations where the learning dataset has an imbalance in class distribution, machine learning algorithms may experience problems. In imbalanced data learning, under-sampling the majority class is a conventional plan of operation. Oversampling techniques followed by undersampling can be used to balance the datasets. We used two established approaches, namely the Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN), to balance the IoT-23, LITNET-2020, and NetML-2020 datasets [22,23]. Recently, hybrid approaches have become popular [24]. Methods like SMOTE+ENN, among other, have often been utilized for alleviating the issue of class imbalance to boost the efficiency of the classifier.
The capability of the selected method, in which the training datasets were cleaned or reduced, is illustrated in Sections 3.2.1 and 3.2.2. The count of samples in each class in the pre-processed dataset is subjected to the balancing procedure. Following [24], the SMOTE + ENN approach for class balancing is presented in Algorithm 1. After applying the SMOTE+ENN method to enhance the pre-processed data (see Section 3.2), the artificially balanced dataset is utilized as a training set for the neural network learning process. It is worthwhile to mention that the term tr represents (X tr , y tr ).

Feature Dimensionality Reduction
The DSAE learning algorithm is employed for dimensionality reduction. The autoencoder can learn the latent representation of features in a reduced space in an unsupervised setting. The DSAE builds on the concept of an autoencoder adding the sparse penalty term, which hinders feature learning to achieve a concise representation of the input vector [25,26]. Furthermore, an autoencoder using non-linear activation functions and multiple layers possesses the ability to obtain non-linear relationships, unlike Principal Component Analysis (PCA). It is also more efficient, in terms of model parameters, to train several layers of an autoencoder rather than train one huge transformation with PCA.
Following [25], the input vector x defines sets of NetFlow data x = {x 1 , x 2 , ...., x n }, which are later reconstructed into an N × M dataset; therefore, x(i) ∈ R M . These NetFlow data are used as an input matrix X initially condensed to a lower dimension that is expressed by a set of one or more hidden layers h = {h 1 , h 2 , ..., h n }. The hidden representation ofh is then depicted as the output Let us consider k as the token node parameter in the nodes of layer l, and j is the counter parameter in the preceding layer l − 1. Therefore, the outcome of the neurons in a computational layer h can be expressed by Equation (2).
The size of the weight matrix is expressed by W ∈ R m×n , and the bias is b ∈ R m . Instead of ReLU or tanh, sigmoid is utilized for neural activations.
The applied reconstruction error E r between the input vector x and the reconstructed inputx followed by the mean squared error function is shown in Equation (3).
n represents the count of input examples. DSAE is chosen to obtain a reconstructed depiction of the input vector. Hence, sparsity is included in the DSAE cost function to cap the average activation value of neural nodes in the computational layers. Let ρ be the average activation value of sparse parameters, and a penalty term is employed to prevent ρ k from deviating from parameter ρ. Therefore, the mean activation value of the k-th neuron in the hidden layer is expressed by: where both n's represent the count of input data points and i the sample. Since the average activation ρ k approaches ρ (a constant close to 0), the KL divergence is adopted to facilitate sparsity.
Next, the cost function of the neural network is set as c(W, b, x,x). Adding the sparse penalty term is expressed by Equation (6).
The β rate influences the strength of the sparsity penalty. L2 Regularization (L2R) is employed in the cost function as a means to avoid overfitting (see Equation (7)).
where α is a coefficient and ||w|| 2 represents the penalty term. This means that an increase in the value of α translates to a deeper weight attenuation. The work contained herein considers a three hidden layer sparse auto-encoder (so-called deep sparse autoencoder), which utilizes the sigmoid in place of the neural activation functions. The input layer has N nodes depending on the selected dataset: IoT-23, LITNET-2020, and NetML-2020. N varies taking into account the total number of features in each dataset after the data pre-processing phase. The initial computational layer reduces the dimensions to 19 (IoT-23), 30 (LITNET-2020), and 20 (NetML-2020) nodes with a favorable error approximation. The following layer decreases the number of features to 15 (IoT-23), 20 (LITNET-2020), and 16 (NetML-2020). Finally, in the third layer, the features are reduced to 10 (IoT-23), 15 (LITNET-2020), and 12 (NetML-2020), respectively. Once the learning stage of the autoencoder is successfully concluded, the network is placed as the input vector provider to the classifier in the final stage. The following set of parameters was determined by running preliminary experiments with a trial and error approach: the weight decay λ = 0.0003, the sparsity parameter ρ = 0.5, and the sparsity penalty term β = 6, respectively. The method is delineated in Algorithm 2, and the visualization of the applied dimensionality reduction on each dataset is presented in Figure 2.

Algorithm 2 Deep Sparse AutoEncoder (DSAE).
Input: Input vector x Output: Reconstructed inputx Initialization: 1: Initialized weights W and thresholds b 2: Obtain the reconstruction error E r according to Equation (3) 3: Add sparsity regularizer to cost function according to Equation (5) 4: Add L2 regularization to cost function according to Equation (7) End

Classifier Modeling
The proposed framework utilizes a stacked generalization approach over flow-based data. The results obtained by the Level-0 classifiers (x) are forwarded to the meta-classifier. Thus, the overall approach incorporates two different deep learning algorithms (Deep Neural Network (DNN), Long Short-Term Memory (LSTM)) and a meta-classifier (Logistic Regression) (illustrated both in Sections 3.5.1 and 3.5.2). Stacked generalization is employed to synergize the deep learning methods, as a way to cooperatively use several models to potentially eliminate the generalizing biases concerning a particular training set. Figure 3 illustrates the concept, which consists of the base and meta-classifier. The implementation of a stacked generalization involves two kinds of models: (a) base models (Level-0 classifiers) and (b) meta-models (Level-1 or meta-classifier). The essence of stacked generalization lies in using the Level-1 classifier to perform the task by learning from Level-0 models.  In general, machine learning can be viewed as a minimization of the problem as shown in Equation (8).

Yes/No
The first term of Equation (8) is the loss function, which expresses the distance between the predicted and real values. The second term in Equation (8) is a regularization term that measures the complexity of the function f . Furthermore, one of the key methods in the proposed approach is to obtain the training set (d cv ) for the Level-1 model (m mtc f ) using the cross-validation approach. For example, given a pre-processed dataset d randomly split into K equal folds {d 1 , d 2 ,...,d K } (in this work, K = 5), let us consider d tr k and d te k to be the training and test sets for the k-th fold in a K-fold cross-validation. Given different deep learning models (M 1 , M 2 ,....,M L ), where L = 2 in this research, each M L is trained by d tr k and predicts each instance on d te k . Let p k (x) denote the prediction of the model M L onx. Then, we have: At the end of the entire cross-validation process of each model M L , the data (d cv ) are an ensemble from the outcome of the L models (see Equation (10)). d cv = (y n , z 1n , ..., z Ln ), n = 1, 2, ..., N where d cv is the training set for the Level-1 model (i.e., meta-classifier) m mtc f . Hence, to finish the training, Level-0 models m L are trained using the pre-processed dataset d, and m mtc f is trained by d cv followed by 5-fold cross-validation. Now, in the testing phase (i.e., hold-out set for testing), given a new instance, Level-1 model m mtc f produces the final outcomes for that instance. It is important to mention that 5-fold cross-validation was performed and obtained the average accuracy, standard deviation, and standard error of the mean as presented in Tables 1-3, respectively. The offered framework ( Figure 3) works via the end-to-end prototype where the Level-1 classifier (i.e., meta-classifier) is trained by the d cv obtained from the Level-0 classifier. Following [27], the structure of the DNN utilized in the proposed stacking framework for an anomaly detection system is described below.
We initialize a four layer neural network. Back-propagation is used to train the network with stochastic gradient descent as the optimizer. The input data are propagated through the hidden layers and then transformed to the final output. A loss function is employed to penalize the network by back-propagating the error to adjust the weights. The network parameters (i.e., weights) are updated for each mini-batch at each iteration (i.e., epoch).
The proposed model has four subsequent layers with 52, 34, 26, and 16 nodes on the hidden layers. The last part of the model is a fully connected neural network with 2 output nodes. Furthermore, the following parameters of DNN are defined as the activation function (ReLU, Sigmoid), optimizer (Adam), batch size (512), epochs (500), and loss function (binary-crossentropy). This model is also using an early stopping algorithm to gain the best value of validation accuracy.

Modeling Long Short-Term Memory
To obtain the time-related non-linear dynamics in the provided data, an LSTM network is utilized. LSTM is an improvement over the RNN architecture, extending the ability to capture long-term dependencies [28].
A four layer LSTM is built on top of the DSAE to obtain the encoded temporal patterns. Finally, a fully connected dense layer is stacked together to process the outputs of the LSTM. The utilized optimizer was Adam. The activation functions (tanh, Sigmoid), batch size (512), epochs (500), and loss function (binary-crossentropy) were also defined as the network parameters. Similar to DNN, this model also uses the early stopping algorithm to gain the best value of the validation accuracy.

Selection of Datasets
Labeled network traffic datasets facilitate the power of supervised methods to offer necessary information to train the IDS efficiently in order to achieve extraordinary accuracy and dependability in distinguishing a wide range of network attacks.
Recently, several new flow-based benchmark datasets have been introduced to the public domain such as IoT-23, LITNET-2020, and NetML-2020. Due to their temporal proximity, these have not yet been widely used by the cybersecurity community. However, we use these datasets for efficient anomaly-based network intrusion detection, on the realistic and up-to-date network traffic data.

IoT-23
IoT-23 (https://www.stratosphereips.org/datasets-iot23) is a network traffic dataset that incorporates 20 malware subsets and three benign subsets. The dataset was initially made available in January 2020 by the Stratosphere Laboratory in Czechia. The objective of the dataset is to present a sizable dataset of labeled malware and benign traffic coming from real captures to develop intrusion detection tools featuring machine learning algorithms.
The On the other hand, the number of flows belonging to benign is 30,858,735 flows. Nevertheless, the dataset has 21 feature attributes including the class label. Therefore, a total of 21 attributes determining the features of connections are present in each data instance. The attributes are mixed in nature, with some being nominal, some being numeric, and some taking on time-stamp values. The general overview of the dataset is presented in Table 4, followed by a description of the attack class labels. Due to the large size of the IoT-23 dataset, only seven scenarios (Malware-1-1, Malware-3-1, Honeypot-4-1, Honeypot-5-1, Honeypot-7-1, Malware-34-1, Malware-43-1) were evaluated in this work. The feature attributes with their description in the IoT-23 dataset are presented in Table 5.

LITNET-2020
The LITNET-2020 (https://dataset.litnet.lt) NetFlow dataset [29] consists of senders and collectors. The senders are made up of Cisco routers and Fortige (FG-1500D) firewalls, which were utilized to evaluate NetFlow data passing through the collectors. The collector incorporates software that accounts for receiving, storing, and filtering data. The specific counts of data samples in the classes of the dataset (with a total of 45,492,310 flows) is presented in Table 6. All instances are categorized into ordinary data (45,330,333 flows) and malicious data (5,328,934 flows). The malicious instances are further categorized into nine classes, taking into account the type of network attack. The data preprocessor selects initially 49 features that are specified to the NetFlow V9 protocol [30] to arrange the dataset. Furthermore, an additional 15 feature attributes are supplemented by the data extender. Having said that, an additional 19 attributes are offered to recognize attack types. Therefore, the final datasets have a set of 84 feature attributes. The feature attributes with their descriptions of the LITNET-2020 dataset are summarized in Table 7. The captures were accumulated in a real-world network.

NetML-2020
The NetML-2020 (https://evalai.cloudcv.org/web/challenges/challenge-page/526/overview) dataset was created for anomaly detection tasks by obtaining 30 traffic data from Stratosphere IPS (https://www.stratosphereips.org/). The flow features are extracted in the JavaScript Object Notation format by offering a raw pcap file as an input to the feature extraction tool, and each flow sample is listed in the output file. Finally, a unique id number is placed to identify every flow and the label information taking into account the raw traffic packet capture file. The NetML dataset provides 484,056 flows and 48 feature attributes (since we consider "top-level" granularity, only 26 meta-features were selected after the data pre-processing stage). The detailed description of the feature attributes and selected capture files is listed both in Tables 8 and 9.
Every capture offers flows belonging to a different category. Flows derived from the capture-win1.pcap file are considered malicious in the highest granularity labels and Tinba in the bottom-level labels. Three annotations, "top-level", "mid-level", and "fine-grained", are chosen to express the class granularity. At the highest granularity, the dataset is binary (benign or malware). The "mid-level" distinguishes specific software communication, e.g., Facebook, Skype, Hangouts, etc. Lastly, the "fine-grained" level classifies multiple distinct types of attacks, e.g., Ramit, HTBot, etc.

Experiments and Results
To assess the classification performance, metrics such as the accuracy score, precision, recall, F1 score, and Matthews Correlation Coefficient (MCC) were calculated. Additionally, geometric mean (g-mean), omnibus tests, and the Friedman rank were used as evaluation indicators to evaluate the algorithm's results. It is important to mention that in this work, we consider binary classification: normal and malware (zero, one) for both individual algorithms and the proposed stacked ensemble framework, respectively.
Evaluation matrices and experimental settings are discussed both in Sections 5.1 and 5.2, while Sections 5.3-5.5 present the result of the experiments (including statistical tests), respectively.

Evaluation Metrics
A standard suite of metrics was used in this work for the evaluation of the approach: Accuracy (ACC), Precision (Pr), Recall (Re), F1-score, False Positive Rate (FPR), Matthews Correlation Coefficient (MCC), and g-mean. The formulation of g-mean, as used in this work, is included in the next subsection.

Geometric Mean
This work uses the g-mean definition presented in [24]. The geometric mean (g − mean) is applied to assess the results given by the classifiers for datasets containing imbalanced classes; it is expressed in the following manner: (g − mean ∈ [0, 1]): Acc m (11) where Acc signifies the accuracy of classification in a given category and M is the count of defined categories (in this case, M = 2). In order to further analysis, a statistical test was performed, e.g., the Friedman test [31], a non-parametric method. The initial phase of this method performs a gradation of the algorithms for every selected dataset (i.e., best-1, second best-2, etc.).

Experimental Settings
The experiments were performed on a Linux setup with 16 GB RAM and the Intel Core i7 10-th Generation Processor. All the experiments were administered using the Python programming language and the TensorFlow/Keras/Scikit-learn stack.

Experimental Evaluation on the IoT-23 Dataset
The following section gives the evaluation of the results attained by the stacked ensemble classifier proposed against individual classifiers, namely DNN, LSTM, and random forest [27]. The IoT-23 dataset was used to assess the efficiency of the classifiers. It is one of the most recent network traffic datasets (20 malware, three benign), and the description is detailed in Section 3.1. Each classifier (i.e., DNN, LSTM, and stacked) underwent the learning procedure on the training set. The results are reported in terms of accuracy, average accuracy, standard deviation, and Standard Error of the Mean (SEM) using five-fold cross-validation, as shown in Table 1. These classifiers were trained and tested on the same data distribution. The performance comparisons are also reported using statistical tests and evaluation matrices such as the overall accuracy, g-mean score, and Friedman average rank. Table 10 presents the overall accuracy and Friedman average rank, followed by the result of the g-mean score secured by the state-of-the-art classifiers vs. the stacked ensemble classifier. Note that a lower value in ranking represents improved classification results. Furthermore, the evaluation results taking into account the precision, recall, FPR, and MCC are presented in Table 11.

Experimental Evaluation on the LITNET-2020 Dataset
In this section, we performed our experiments with the most recent, publicly available benchmark dataset, LITNET-2020 [29], published by Kaunas Technological University in May 2020. To examine the performance of our framework using selected features, we first trained the individual classifiers and stacked ensemble framework using five-fold cross-validation and present the results in terms of accuracy, average accuracy, standard deviation, and SEM in Table 2.
Similar to Section 5.3, the detection performance of the tested methods such as DNN, LSTM, and random forest vs. stacked ensemble algorithms is also compared. The results are reported in terms of statistical tests and evaluation matrices such as the overall accuracy, F1-score, g-mean, and Friedman average rank in Table 12. The evaluation of the performance results concerning the precision, recall, false positive rate, and MCC is also presented in Table 13.

Experimental Evaluation on the NetML-2020 Dataset
In this section, we demonstrate the effectiveness and performance of the proposed stacked ensemble framework on the challenging NetML-2020 dataset [32]. This dataset was released during the open challenge Network Traffic Analytics using Machine Learning (NETAML) workshop sponsored by the Intel Corporation.
We selected two basic classification models including DNN and LSTM classifiers as our baselines since DNN is frequently utilized in multiple domains and LSTM presents a strong deep learning approach that can achieve the foremost increases with regards to accuracy. The selected baseline methods underwent the training procedure using the same data that were used by our proposed stacked ensemble framework. Five-fold cross-validation was performed, and the results are reported in Table 3.
Thereafter, the proposed stacked approach is compared with the baseline algorithms and the methods reported in [32], as shown in Table 14, and it showed an increase in performance over the baseline algorithms and those found in the subject literature. Similar to the previous section, the evaluation of the model performance with respect to precision, recall, FPR, and MCC is presented in Table 15. The showcased performance increases prove a significant improvement in performance of the proposed stacked ensemble method as compared to the baseline. The method shows promise in terms of classification accuracy and other evaluation metrics. It was also investigated and proven experimentally that the deep sparse autoencoder improves the accuracy of the proposed algorithms as compared to the case where it was not employed [27,32]. The results also exhibit that the enhanced effectiveness can be obtained not only by improving the topology of the neural network or performing hyperparameter tuning, but also by enhancing the preprocessing stage. Figure 4 depicts the duration of the training of the proposed framework comparing state-of-the-art methods. The training duration is obtained by calculating the computation time required for forming the classification model. During the investigation, the proposed classifier used in the final stage had appreciably lower training time as compared to more complex classifiers working with the inputs obtained as the output of Level-0 training (d cv ). It is to be noted that the baseline classifiers were also trained with the same set of outputs (including the algorithms previously used in the Level-0 classifier: DNN, LSTM).  Lastly, as a means to further show how effective the proposed method performs, we further analyze the cumulative amount of incorrect detection.

Conclusions and Future Work
This work addressed an ensemble approach incorporating deep learning algorithms using the concept of stacked generalization for an effective anomaly-based network intrusion detection system. In this paper, various feature engineering methods were applied together with dimensionality reduction to achieve the highest efficiency. A combination of DNN and LSTM followed by a meta-classifier resulted in significant performance and detection of anomalies w.r.t. the most recent network traffic datasets. Three heterogeneous datasets, IoT-23, LITNET-2020, and NetML-2020, were used to assess the effectiveness of the proposed stacked ensemble framework. Following statistical significance tests, we came to the verdict that the suggested approach outperforms the state-of-the-art individual classifiers and meta-classifiers such as random forest and support vector machine.
From the series of conducted experiments, it is inferred that the proposed approach provides a significant improvement in terms of evaluation metrics when validated against pre-specified testing sets. Briefly, the proposed framework can eliminate the challenge of providing recent network traffic datasets and provide an acceptable accuracy to detect anomaly behaviors in the desired network.
For future work, the implementation strategy can be further extended to conduct experiments on more sophisticated datasets if those can be acquired. Advanced computational methods like Apache Spark will be utilized in the future to boost the processing speed and facilitate the scalability for massive volumes of network data. Additionally, the approach is to be validated for solving a multi-class problem. At the moment, we also focus on the second part of the model (i.e., transfer learning). The study will apply a lifelong learning algorithm along with a deep learning one to make the method [33] more robust to unknown and known attacks. Finally, first steps have already been taken to secure the deep learning component itself against the threat of adversarial attacks [34], and we plan to continue research in that regard.

Data Availability
The datasets used in this work are publicly available for research purposes.