Deep Autoencoder-Based Integrated Model for Anomaly Detection and Efﬁcient Feature Extraction in IoT Networks

: The intrusion detection system (IDS) is a promising technology for ensuring security against cyber-attacks in internet-of-things networks. In conventional IDS, anomaly detection and feature extraction are performed by two different models. In this paper, we propose a new integrated model based on deep autoencoder (AE) for anomaly detection and feature extraction. Firstly, AE is trained based on normal network trafﬁc and used later to detect anomalies. Then, the trained AE model is employed again to extract useful low-dimensional features for anomalous data without the need for a feature extraction training stage, which is required by other methods such as principal components analysis (PCA) and linear discriminant analysis (LDA). After that, the extracted features are used by a machine learning (ML) or deep learning (DL) classiﬁer to determine the type of attack (multi-classiﬁcation). The performance of the proposed uniﬁed approach was evaluated on real IoT datasets called N-BaIoT and MQTTset, which contain normal and malicious network trafﬁcs. The proposed AE was compared with other popular anomaly detection techniques such as one-class support vector machine (OC-SVM) and isolation forest (iForest), in terms of performance metrics (accuracy, precision, recall, and F1-score), and execution time. AE was found to identify attacks better than OC-SVM and iForest with fast detection time. The proposed feature extraction method aims to reduce the computation complexity while maintaining the performance metrics of the multi-classiﬁer models as much as possible compared to their counterparts. We tested the model with different ML/DL classiﬁers such as decision tree, random forest, deep neural network (DNN), conventional neural network (CNN), and hybrid CNN with long short-term memory (LSTM). The experiment results showed the capability of the proposed model to simultaneously detect anomalous events and reduce the dimensionality of the data.


Introduction
The proliferation of the Internet of things (IoT) with the widespread vulnerabilities discovered in IoT devices has attracted the interest of malicious agents to subvert those devices.Today, IoT devices can be used to build large-scale botnets with significant computational power and network bandwidth.Botnet is a type of malware that converts networked consumer devices into bots or zombies in order to launch distributed denial of service (DDoS) attacks [1].DDoS is a popular threat that is typically launched by botnets and attempts to degrade IoT services by exhausting resources such as CPU, memory, and bandwidth of the devices.This type of attack may also target servers or network equipment [2].Botnets have command and control servers (C&C), which are used to communicate with running bots and broadcast commands to them.Bots can carry out a variety of tasks, such as scanning other devices for vulnerabilities, infecting weak systems, sending spam emails, or carrying out various attacks.Bashlite (also called Gafgyt) and its successor Mirai are examples of botnets that have recently caused large-scale DDoS attacks by exploiting IoT devices.Due to the open source codes of these botnets, attackers managed to create many novel variations [3,4].
Cyber security is the technology of securing internet devices and networks from threats and breaches by detecting and monitoring risks and fixing vulnerabilities.Networks usually implement an intrusion detection system (IDS) as a defense wall against different types of cyber-attacks.Malicious activities in IoT networks must be detected and stopped immediately; hence, the function of intrusion detection systems (IDS) has become critical in securing IoT networks by detecting anomalies.Anomaly detection is an emerging type of IDS to detect known and unknown attacks without human intervention, unlike traditional signature-based methods [5,6].Anomaly detection based on machine learning (ML) and deep learning (DL) techniques has been necessary and effective in detecting cyber-attacks in real time [7].ML and DL methods can be used to determine whether the traffic flow is benign or malicious through binary classification, and also to classify the malicious attacks into groups of different attack types as a multi-classification task [8].ML classifiers such as decision tree (DT), random forest (RF), and naïve Bayes (NB) can provide acceptable performance when the training and testing data share similar data distribution; however, when tested on new data with a different distribution, they fail to provide good results [9].Hence, deep neural networks (DNN), conventional neural networks (CNN), and long short-term memory (LSTM) have been recently used in different applications since they have the capability to learn the complex and nonlinear structure of the input data [5,9].
Datasets with many features often contain a lot of redundant or duplicated information.The purpose of the dimensionality reduction technique is to eliminate these interferences and consequently improve the training and detection time.There are two techniques for dimensionality reduction, which are feature selection and feature extraction [10].In feature selection, the most important features are selected based on the employed algorithm such as mutual information (MI), Chi-square, and analysis of variance (ANOVA).On the other hand, the feature extraction technique produces new low-dimension features from the original features.Principal component analysis (PCA) and linear discriminant analysis (LDA) are well-known linear dimensionality reduction techniques for feature extraction.PCA is an unsupervised method while LDA is a supervised method.Early data dimensionality reduction systems generally use linear approaches since they are simpler to calculate than nonlinear methods.However, many of the problems we encounter include nonlinear and time-varying systems, and there are currently relatively fewer studies on nonlinear feature reduction techniques.
Autoencoder (AE) is a nonlinear unsupervised learning model that is trained to minimize the error between the input and output sets, where it is usually trained using a part of the data such as the normal set only.AE consists of an encoder that transforms the input data into a compressed form known as bottleneck features or latent space, and a decoder to transform back the compressed data into the original data.AE is used to detect anomalies based on reconstruction error (RE) calculated from the difference between the original and reconstructed samples.To distinguish between normal and anomalous data, it usually produces higher RE for anomalous samples than normal samples [6].In addition to that, it can be used as a feature extraction technique, where the compressed data is used for other purposes such as classification [11].AE has been shown to have a useful function in addition to its dimensionality reduction ability such as detecting repetitive structure (i.e., redundant information) in the MINST dataset, which is considered to be a good feature for many applications [12].
In this work, we focus on implementing AE for anomaly detection and also exploiting the trained AE model to reduce dataset features to further classify groups of malicious attacks.The process of projecting high-dimensional data into a low-dimensional space will inevitably result in the loss of some original information.Therefore, the main challenge is to obtain useful reduction data from the high-dimensional dataset in order to meet the recognition accuracy and storage requirements while optimally maintaining the essential characteristics of the original data.The proposed integrated AE model aims to provide new lightweight IDS that is computationally efficient and effective for the IoT environment.The main contributions of this article are summarized as follows:

•
We propose a new lightweight IDS system for IoT networks by fully utilizing a deep AE model to do anomaly detection, and feature reduction for multi-classification of the detected cyber-attacks, unlike the existing methods that used AE for either anomaly detection or feature reduction; • The proposed system is extensively evaluated with real datasets, namely N-BaIoT and MQTTset, which contain normal and malicious network traffic.Classification performances of five IoT devices are evaluated based on accuracy, precision, recall, F1 score, and execution time; The effectiveness of the proposed IDS system is compared with state-of-the-art methods.
This paper is organized as follows: Section 2 provides the literature works related to AE, feature reduction methods, popular datasets, and ML and DL techniques that are used.Section 3 introduces AE and the proposed integrated system.Section 4 presents our experimental results and discussion, where it starts by introducing the implemented datasets and experiment setup, followed by our findings for anomaly detection, feature extraction analysis, and multi-classification. Finally, Section 5 concludes the paper.

Related Works
In [13], Meidan et al. proposed an autoencoder (AE) for anomaly detection based on a real developed dataset called N-BaIoT for detecting botnet attacks on nine IoT devices.The IoT devices were infected by two common types of botnets, namely BASHLITE (Gafgyt) and Mirai attacks.The performance of the AE model was promising compared to other methods such as one-class support vector machine (OC-SVM), isolation forest (iForest), and local outlier factor (LOF) in terms of detection time, true positive rate (TPR), and false positive rate (FPR).The dataset consisted of a big number of samples and 115 features extracted from the network traffic based on different statistical measures such as means, variance, covariance, and correlation coefficient.In [14], a hybrid deep learning model based on convolutional neural network (CNN) and long short-term memory (LSTM) networks was proposed to detect botnet attacks based on the N-BaIoT dataset.The model was used to classify benign and 10 malicious attacks on 9 IoT devices.In [15], a fusion model combining two deep neural networks was proposed to do binary normal/attack classification and then multi-attack classification.The system was tested on ZYELL's dataset, and compared to a baseline model that employed multi-classification for benign and malicious attacks together as in [14].Aygun et al. [16] proposed to use AE and denoising AE (DAE) to detect zeroday attacks with high accuracy.A stochastic approach was also developed to determine the threshold value of the AEs and it was tested on the NSL-KDD dataset.The results showed that the AE and DAE performed very close to each other in terms of accuracy, precision, F-score, and recall, where they outperformed other anomaly detection techniques such as Fuzzy classifier, random tree, and naïve Bayes (NB) tree in terms of accuracy.Zavrak et al. [17] employed AE and variational AE (VAE) techniques to detect unknown attacks using the CICIDS2017 network traffic dataset.Receiver operating characteristics (ROC) and area under the curve (AUC) were evaluated and compared with the oneclass support vector machine (OC-SVM) method.The experimental results showed that VAE outperformed AE and OC-SVM in most cases.Min et al. [18] proposed a memoryaugmented deep AE (MemAE) for a network intrusion system, where the AE was trained to reconstruct the input of anomalous data that was close to normal samples in order to solve the over-generalization problem of AEs.Experiments on the NSL-KDD, UNSW-NB15, and CICIDS 2017 datasets confirmed that the proposed method outperformed other OC-SVM models.However, in all of these studies, even though the dataset had a huge number of features, which will most probably increase the computational time, the feature reduction method was not used.
In [19], feature extraction based on AE was implemented to transform features to lower dimensions to accelerate the training process and improve accuracy.AE with different activation and loss functions were used to study how accuracy can be improved with the support vector machine (SVM) classifier.The authors found that the ReLU activation function and cross-entropy loss function gave the best accuracy compared to other functions using KDD-CUP'99 and NSL-KDD datasets.The authors also compared AE with principal component analysis (PCA) and linear discriminant analysis (LDA) techniques, and found that SVM with AE produced much better accuracy, precision, and F1-score than SVM with PCA and LDA; however, feature extraction by AE took a longer time.Nevertheless, the authors did not compare the SVM classifier with other ML classifiers.In [7], an anomalybased IDS for IoT networks was proposed based on 1D, 2D, and 3D CNN models to create a multi-class classification system.A feature selection technique called recursive feature elimination (RFE) was used to extract features from different datasets.The proposed model was validated using different datasets such as MQTT-IoT-IDS2020, BoT-IoT, IoT-23, and IoT network intrusion.In [11], feature extraction based on AE for the CICIDS2017 dataset was proposed where the latent features were then extracted to do multi-malicious classification based on the random forest (RF) method.The proposed system was evaluated with different encoder/decoder layer structures in terms of accuracy, precision, recall, F1-score, as well as training and testing time; however, no comparisons to other feature extraction and multi-classifier methods were performed.Similarly, in [20], the latent features of AE were extracted to do binary (legitimate/anomaly) classification based on OC-SVM.In [21], an AE-based anomaly detection method was proposed since AE has the ability to capture the nonlinear correlation between features.Convolutional AE (CAE) was selected for feature extraction and compared with AE and PCA reduction methods.CAE was preferred since it has a smaller number of parameters and requires less training time compared to AE.The system was then evaluated using the NSL-KDD dataset and the results showed that CAE outperformed AE and PCA in terms of detection accuracy and false positive rate (FPR).In [22], a hyper approach based on LSTM autoencoder and OC-SVM that can detect anomalies in intrusion detection systems for SDN (InSDN) dataset was proposed.The lowdimensional features of the AE were trained with OC-SVM for anomaly classification.The experiment result showed that the proposed model provided a higher detection rate and reduced the processing time significantly.In [23], Rao et al. proposed a two-staged hybrid IDS based on sparse AE (SAE) and a deep neural network (DNN).In the first stage, SAE was used to extract the feature to effectively represent the dataset.Then, DNN was employed for multi-class classification.The experimental results showed that the proposed model outperformed the conventional models in terms of overall detection rate and false positive rate when tested on KDDCup99, NSL-KDD, and UNSW-NB15 datasets.Conventional IDS requires a long time to detect attacks and is unable to detect zero-day attacks; therefore, IDS based on the PCA-DNN model was proposed in [24] to detect abnormal network behavior such as DDoS, DoS, brute force, heartbleed, botnet, and insider infiltration attacks using CSE-CICIDS2018 dataset.The model was used for binary and multi-class classifications.Experiment results showed that the accuracy achieved without PCA was 98%, but it took a long time to process, whereas the same performance can be achieved with PCA when the number of components was 12.
It is important to note that these studies used AE either for anomaly detection only, as in [13,[16][17][18], or used it for feature extraction only, as in [19][20][21][22][23]; hence, they did not fully make use of the advantages of AE.However, in our proposed unified model, we exploit AE for both anomaly detection and feature extraction to provide a cost-effective system that does not require an external feature reduction method.The aim of the analysis in this paper is to show that the proposed model provides robust and fast anomaly detection and investigates the feature extraction ability of AE in comparison with state-of-the-art dimensionality reduction techniques such as PCA and LDA.

Methodology
This section provides an overview of the autoencoder (AE) and presents our proposed model.

Autoencoder (AE)
Autoencoder (AE) is a type of neural network model that is used to learn a compact representation of the input data, which can be used for the reconstruction of the original input [25].As illustrated in Figure 1, AE consists of an encoder network, a bottleneck layer called a hidden latent vector, and a decoder network.A stacked AE is a deep learning AE in which the encoder and the decoder have multiple hidden layers.
posed model.

Autoencoder (AE)
Autoencoder (AE) is a type of neural network model that is used to learn a compact representation of the input data, which can be used for the reconstruction of the original input [25].As illustrated in Figure 1, AE consists of an encoder network, a bottleneck layer called a hidden latent vector, and a decoder network.A stacked AE is a deep learning AE in which the encoder and the decoder have multiple hidden layers.
AE with a single hidden layer in the encoder and decoder can be represented by the following equations [18]: where  represents the input vector (normal/benign data),  represents the bottleneck low-dimensional feature space,  is an activation function, and  and  are the weight and bias parameters.Equation ( 1) is used by the encoder network to map the input features into low-dimensional features, while the decoder network uses Equation ( 2) to recover the original data  from the projected low-dimensional features .
The main objective of training the AE model is to minimize the reconstruction error (RE) function representing the difference between the input and output of the AE model, which is given as [20]: where ‖•‖ represents the Euclidean norm.Once the AE model is obtained from the training dataset, the AE model can be used for the validation dataset to determine a threshold value.Then, the mean square error (MSE) between the true validation data  and predicted validation data  is calculated as: AE with a single hidden layer in the encoder and decoder can be represented by the following equations [18]: x = σ(W decoder z + b decoder ) where x represents the input vector (normal/benign data), z represents the bottleneck low-dimensional feature space, σ is an activation function, and W and b are the weight and bias parameters.Equation ( 1) is used by the encoder network to map the input features into lowdimensional features, while the decoder network uses Equation (2) to recover the original data x from the projected low-dimensional features z.
The main objective of training the AE model is to minimize the reconstruction error (RE) function representing the difference between the input and output of the AE model, which is given as [20]: where • 2 represents the Euclidean norm.Once the AE model is obtained from the training dataset, the AE model can be used for the validation dataset to determine a threshold value.Then, the mean square error (MSE) between the true validation data x val and predicted validation data xval is calculated as: where N is the number of validation samples.The threshold value that discriminates between the normal and attack data is calculated from the sample mean and standard deviation as [13]: The RE function is used by the AE model in network anomaly detection tasks to determine whether a network traffic sample is anomalous or not.Since the AE model has been trained on normal network traffic only, it should produce low RE when it receives normal data and high RE when it encounters anomalous data.

Proposed System
Figure 2 shows the structure of the proposed integrated autoencoder (AE) for anomaly detection and feature extraction for multi-classification.The system consists of three stages, namely training, testing, and detection.The upper part can be used to detect an anomaly, which can be considered as a binary classification, whereas the lower part can be used to further classify the type of cyber-attack if an anomaly is detected.

Datasets
We used the N-BaIoT dataset, which contains real data collected from IoT devices connected to the network and infected by botnets such as Gafgyt (BASHLITE) and Mirai, as shown in Figure 3.The experiment setup included scanner and loader components, which were used for scanning and finding vulnerable IoT devices and loading malware into such devices.Once the IoT device became infected, it immediately began scanning the network for new victims while awaiting commands from the C&C server.DHCP server was used to assign dynamic IP addresses to the IoT devices on the network, and Wireshark was used as a tool for packet sniffing.When a packet arrived, a behavioral snapshot of the hosts and protocols that communicated with this packet was recorded using port mirroring on the network switch.The dataset contains normal data and cyber-attack samples, where the AE uses r 1 % of the normal data for training purposes; then, the trained AE model is saved in the storage for later testing and reduction of the attack features.The remaining r 2 % of the normal data will be combined with the cyber-attack data and fed to the saved AE model to detect anomalies.If no anomaly is detected, the system is terminated.If an anomaly is detected, the encoder part of the trained/saved AE model is extracted and used to reduce the features of the cyber-attack data.After that, the extracted feature with lower dimensions is divided accordingly, with r 1 % for training a multi-classifier (MC) such as DT, RF, NB, or DNN, while the remaining r 2 % is used to test the performance of the trained MC model and finally output the type of attack.
It is important to note that the proposed system is cost-effective (lightweight) since the trained AE model is used to detect anomalies and is also used to extract features instead of employing new dimensionality reduction techniques such as PCA or LDA.In addition, transforming the samples to a new subspace requires only passing each sample through the trained encoder which contains the weights.On the other hand, PCA, for instance, involves using the whole training dataset to compute the mean, covariance matrix, and principal components, then transforming each sample into a new low-dimensional subspace.

Datasets
We used the N-BaIoT dataset, which contains real data collected from IoT devices connected to the network and infected by botnets such as Gafgyt (BASHLITE) and Mirai, as shown in Figure 3.The experiment setup included scanner and loader components, which were used for scanning and finding vulnerable IoT devices and loading malware into such devices.Once the IoT device became infected, it immediately began scanning the network for new victims while awaiting commands from the C&C server.DHCP server was used to assign dynamic IP addresses to the IoT devices on the network, and Wireshark was used as a tool for packet sniffing.When a packet arrived, a behavioral snapshot of the hosts and protocols that communicated with this packet was recorded using port mirroring on the network switch.

Datasets
We used the N-BaIoT dataset, which contains real data collected from IoT devices connected to the network and infected by botnets such as Gafgyt (BASHLITE) and Mirai, as shown in Figure 3.The experiment setup included scanner and loader components, which were used for scanning and finding vulnerable IoT devices and loading malware into such devices.Once the IoT device became infected, it immediately began scanning the network for new victims while awaiting commands from the C&C server.DHCP server was used to assign dynamic IP addresses to the IoT devices on the network, and Wireshark was used as a tool for packet sniffing.When a packet arrived, a behavioral snapshot of the hosts and protocols that communicated with this packet was recorded using port mirroring on the network switch.The N-BaIoT dataset contains both normal traffic and malicious traffic (five Gafgyt attacks and five Mirai attacks) for nine IoT devices.It consists of 115 features collected from network traffic based on different statistical measures such as means, variance, number, radius, magnitude, covariance, and correlation coefficient which are aggregated by source IP, source MAC-IP, and channel.In our analysis, we used N-BaIoT datasets for 4 IoT devices as described in Table 1, where the first three devices contained big data compared to the last one, which only had a Gafgyt attack.The Gafgyt botnet consists of five attacks, namely Scan, Junk, UDP flooding, TCP flooding, and COMBO (sending spam data and opening a connection to a specified IP address and port).The Mirai botnet also comprises five attacks, namely Scan, Ack flooding, Syn flooding, UDP flooding, and UDPplain.In the Scan attack, the botnet scans all IP addresses on the network for unsecured devices in order to determine their login information.By continuously delivering acknowledgment (Ack) packets and initial connection request (SYN) packets, the Ack flooding and Syn flooding attacks try to make the target inaccessible to legitimate traffic [1].UDPplain and UDP flooding both send an enormous number of UDP packets to the destination, where the former sets a UDP port as the destination and the latter sends the packets to random ports [1,2].The dataset for the fifth IoT device (IoT5) was taken from the MQTTset dataset, which was introduced by Vaccari et al. [26] and then improved in [27] by introducing a large number of features (i.e., 76).It is composed of 10 sensors of different natures (i.e., motion, temperature, humidity sensors, etc.) which are located in two separate rooms.The N-BaIoT dataset focuses on Wi-Fi communication, while the MQTTset dataset includes communications based on the message queue telemetry transport (MQTT) protocol.MQTTset contains both normal network traffic and 5 malicious attacks with 76 features as described in Table 1.

Experiment Setup
AE was built using Keras and TensorFlow libraries that are available freely on Python, while the ML techniques and classification performance metrics were extracted from the Sklearn library.The encoder of the proposed AE consists of 3 hidden layers with 64, 32, and 16 neurons; between each hidden layer, BatchNormalization and LeakyRelu layers were used [13].The decoder is the symmetrical form of the encoder with 16, 32, and 64 neurons in the three hidden layers.The input and output layers have 115 neurons, whereas the bottleneck layer has 2, 3, or 8 neurons.Adam/SGD optimizer and MSE as the loss function were used to compile the model, where 20 epochs with a batch size of 60 were used to train the model.
The number of normal and attack data of the IoT1 dataset is illustrated in Figure 4a, where Figure 4b displays the distribution of the cyber-attack data.The data in Figure 4a was used for anomaly detection and the cyber-attack data in Figure 4b was used for multiclassification. Figure 4a shows the number of attacks or anomalous data is greater than normal or benign data.Figure 4b  The number of normal and attack data of the IoT1 dataset is illustrated in Figur where Figure 4b displays the distribution of the cyber-attack data.The data in Figur was used for anomaly detection and the cyber-attack data in Figure 4b was used multi-classification. Figure 4a shows the number of attacks or anomalous data is gre than normal or benign data.Figure 4b

Anomaly Detection
Table 2 shows the performance of different anomaly detection techniques suc OC-SVM, iForest, and AE to classify data on the four IoT devices.OC-SVM sho comparable performance with AE in terms of accuracy, precision, recall, and F1-s while iForest showed a slight degradation in performance.AE with eight extracted tures (AE-8) had better performance than AE-2 and AE-3.In terms of training time consumed a much longer time than OC-SVM and iForest; however, the detection tim AE was much better than the other two classifiers.For example, the anomaly detec time for IoT1 and IoT2 testing datasets based on OC-SVM and iForest took more th and 2 min, respectively, while AE's detection time was less than 0.5 min.For IoT5, we see that the models had totally separated all anomaly samples, which were 100,044 s ples; however, they had difficulties in distinguishing normal data, especially iFo OC-SVM obtained comparable results with AE but it consumed high training and tes time.AE based on two, three, and eight extracted features outperformed OC-SVM iForest in terms of performance metrics and detection time.

Anomaly Detection
Table 2 shows the performance of different anomaly detection techniques such as OC-SVM, iForest, and AE to classify data on the four IoT devices.OC-SVM showed comparable performance with AE in terms of accuracy, precision, recall, and F1-score, while iForest showed a slight degradation in performance.AE with eight extracted features (AE-8) had better performance than AE-2 and AE-3.In terms of training time, AE consumed a much longer time than OC-SVM and iForest; however, the detection time by AE was much better than the other two classifiers.For example, the anomaly detection time for IoT1 and IoT2 testing datasets based on OC-SVM and iForest took more than 1 and 2 min, respectively, while AE's detection time was less than 0.5 min.For IoT5, we can see that the models had totally separated all anomaly samples, which were 100,044 samples; however, they had difficulties in distinguishing normal data, especially iForest.OC-SVM obtained comparable results with AE but it consumed high training and testing time.AE based on two, three, and eight extracted features outperformed OC-SVM and iForest in terms of performance metrics and detection time.Figure 5a shows the reconstruction error (RE) (for only RE with smaller values) where the threshold value 0.0038 was calculated based on Equation (5).The samples of the normal network traffic had low RE, with the highest being ~0.2, while the anomalous samples had greater RE than the selected threshold value and were extended to more than 50 k.The threshold value was able to correctly separate the anomaly data; however, 61 samples of the normal data above the threshold line were wrongly classified as an anomaly, while 12,326 samples that were below the threshold line were correctly classified as normal traffic.In the case of normal traffic, the trained AE was able to reconstruct the input samples from the output layer with low RE; however, in the case of anomalies, the trained AE model failed to reconstruct the samples since it had not seen these samples before, consequently giving them a high RE. Figure 5b shows the ROC for the three anomaly detection techniques, where we can see that AE-8 and OC-SVM performed better than iForest since the ROC was very close to one, which indicates that the models can successfully separate the positive and negative rates.
In Figure 6, we plotted the boxplot for the RE of each attack to obtain a deep insight into the statistics of these attacks.The boxplot shows the median (line in the middle of the box), minimum and maximum (line coming out of the box, extended from minimum (lower) to maximum (upper)), and outliers (circles) values.The lowest RE (less than 0.22) was attained by the normal network traffic, which indicates that AE had been well trained, followed by G.UDP and G.TCP attacks, where both attacks had similar data distribution and were depicted as outliers by the boxplot, as shown in the magnified subset of Figure 6.It is expected that ML classifiers will confuse these two attacks since it is difficult to discriminate between them.The other attacks had no outliers, which are represented by the clean whisker boxes.In Figure 6, we plotted the boxplot for the RE of each attack to obtain a deep insight into the statistics of these attacks.The boxplot shows the median (line in the middle of the box), minimum and maximum (line coming out of the box, extended from minimum (lower) to maximum (upper)), and outliers (circles) values.The lowest RE (less than 0.22) was attained by the normal network traffic, which indicates that AE had been well trained, followed by G.UDP and G.TCP attacks, where both attacks had similar data distribution and were depicted as outliers by the boxplot, as shown in the magnified subset of Figure 6.It is expected that ML classifiers will confuse these two attacks since it is difficult to discriminate between them.The other attacks had no outliers, which are represented by the clean whisker boxes.

Feature Extraction
Table 3 shows the time taken by the PCA and LDA techniques to reduce the dimensionality of the training and testing datasets.Generally, PCA took a shorter time than LDA to reduce the data to a low dimension for both training and testing datasets.It is important to note that the proposed AE does not require a training stage for the dataset to reduce the features; indeed, each sample will pass through the trained/saved encoder, which contains the previously obtained weights from training the normal traffic data during the anomaly detection stage.However, PCA, for instance, requires the whole training dataset to compute the mean, covariance, and eigenvalues to obtain the principal components needed to get the low-dimension features.For this reason, our proposed unified AE model represents a lightweight solution compared to the existing solution that employs PCA or LDA.In Figure 6, we plotted the boxplot for the RE of each attack to obtain a deep insight into the statistics of these attacks.The boxplot shows the median (line in the middle of the box), minimum and maximum (line coming out of the box, extended from minimum (lower) to maximum (upper)), and outliers (circles) values.The lowest RE (less than 0.22) was attained by the normal network traffic, which indicates that AE had been well trained, followed by G.UDP and G.TCP attacks, where both attacks had similar data distribution and were depicted as outliers by the boxplot, as shown in the magnified subset of Figure 6.It is expected that ML classifiers will confuse these two attacks since it is difficult to discriminate between them.The other attacks had no outliers, which are represented by the clean whisker boxes.

Feature Extraction
Table 3 shows the time taken by the PCA and LDA techniques to reduce the dimensionality of the training and testing datasets.Generally, PCA took a shorter time than LDA to reduce the data to a low dimension for both training and testing datasets.It is important to note that the proposed AE does not require a training stage for the dataset to reduce the features; indeed, each sample will pass through the trained/saved encoder, which contains the previously obtained weights from training the normal traffic data during the anomaly detection stage.However, PCA, for instance, requires the whole training dataset to compute the mean, covariance, and eigenvalues to obtain the principal components needed to get the low-dimension features.For this reason, our proposed unified AE model represents a lightweight solution compared to the existing solution that employs PCA or LDA.

Feature Extraction
Table 3 shows the time taken by the PCA and LDA techniques to reduce the dimensionality of the training and testing datasets.Generally, PCA took a shorter time than LDA to reduce the data to a low dimension for both training and testing datasets.It is important to note that the proposed AE does not require a training stage for the dataset to reduce the features; indeed, each sample will pass through the trained/saved encoder, which contains the previously obtained weights from training the normal traffic data during the anomaly detection stage.However, PCA, for instance, requires the whole training dataset to compute the mean, covariance, and eigenvalues to obtain the principal components needed to get the low-dimension features.For this reason, our proposed unified AE model represents a lightweight solution compared to the existing solution that employs PCA or LDA.
We applied PCA to the training data to select the best number of components.We can see that the first 2, 6, and 10 components for the IoT1 device captured 81.6%, 97.34%, and 99.5% of the variability in the original data, as shown in Figure 7a.For LDA, the first component captured about 97% variability while the second component captured 99% variability, where the number of components for LDA depends on the number of classes-1.The results for the PCA and LDA components for IoT2 that are shown in Figure 7b were almost similar to that obtained for IoT1.For IoT5, LDA scored 75.5%, 95.5%, and 100% with 1, 2, and 4 components, respectively, while PCA scored 54%, 72%, 93%, and 95% with 1, 2, 8, and 10 components, respectively.We applied PCA to the training data to select the best number of components.We can see that the first 2, 6, and 10 components for the IoT1 device captured 81.6%, 97.34%, and 99.5% of the variability in the original data, as shown in Figure 7a.For LDA, the first component captured about 97% variability while the second component captured 99% variability, where the number of components for LDA depends on the number of classes -1.The results for the PCA and LDA components for IoT2 that are shown in Figure 7b were almost similar to that obtained for IoT1.For IoT5, LDA scored 75.5%, 95.5%, and 100% with 1, 2, and 4 components, respectively, while PCA scored 54%, 72%, 93%, and 95% with 1, 2, 8, and 10 components, respectively.

Multi-Classification
In this section, we analyzed the proposed system using different ML and DL classifiers such as DT, XTree, RF, and DNN for five IoT datasets.We compared the performance of AE with other feature reduction techniques such as PCA and LDA.The dataset was divided into 75% for training and 25% for testing; for deep learning models, the training set was further split into 75% for training and 25% for validation.The results obtained by the CNN-LSTM model proposed in [14], where the dataset was divided into 70% for training and 30% for testing, are presented in Tables 4-7 to confirm that their results are close to what we obtained.We implemented the hybrid CNN-LSTM model and our results are presented in Tables 8 and 9.In addition, two architectures of DNNs were implemented, where DNN1 has two hidden layers (64, 32) and DNN2 has three hidden layers (64, 64, 64).
Tables 4 and 5 present the analysis of different ML and DL multi-classifiers without and with feature extraction based on PCA, LDA, and AE.The comparison was conducted in terms of performance metrics such as accuracy, precision, recall, F1-score, and computational time for training and testing datasets.The table first shows the results obtained from the classifiers when no feature extraction techniques (i.e., without case) were applied for the dataset of IoT1.DT, XTree, RF, and DNN2 showed comparable results,

Multi-Classification
In this section, we analyzed the proposed system using different ML and DL classifiers such as DT, XTree, RF, and DNN for five IoT datasets.We compared the performance of AE with other feature reduction techniques such as PCA and LDA.The dataset was divided into 75% for training and 25% for testing; for deep learning models, the training set was further split into 75% for training and 25% for validation.The results obtained by the CNN-LSTM model proposed in [14], where the dataset was divided into 70% for training and 30% for testing, are presented in Tables 4-7 to confirm that their results are close to what we obtained.We implemented the hybrid CNN-LSTM model and our results are presented in Tables 8 and 9.In addition, two architectures of DNNs were implemented, where DNN1 has two hidden layers (64, 32) and DNN2 has three hidden layers (64, 64, 64).
Tables 4 and 5 present the analysis of different ML and DL multi-classifiers without and with feature extraction based on PCA, LDA, and AE.The comparison was conducted in terms of performance metrics such as accuracy, precision, recall, F1-score, and computational time for training and testing datasets.The table first shows the results obtained from the classifiers when no feature extraction techniques (i.e., without case) were applied for the dataset of IoT1.DT, XTree, RF, and DNN2 showed comparable results, where the best results in terms of performance metrics were achieved by RF, while the worst performance was attained by the NB classifier.XTree provided a good trade-off between accuracy and computational time, while RF showed good robustness with increased computational time compared to DT and XTree.
Deep learning models utilized the highest time in training and detection compared to previous classifiers.We then applied PCA with 2, 8, and 10 components; we found that the best performance that can be achieved was close to the results of the "without" case, either by using 8 or 10 components.For the LDA method, the best performance can be achieved by using four components.We also tested the proposed AE at different numbers of extracted features such as two, three, and eight to show the capability of the extracted features to be applied for the classification task.We found that the performance improved as the number of extracted features increased, where eight extracted features yielded results close to that achieved by PCA-8 and LDA-4.In terms of computation time, we can observe a clear reduction in the training and detection time for DT, XTree, NB, and certain cases for DNN1 and DNN2.The proposed AE showed better computational time consumption compared to PCA using DNN1 and DNN2.RF showed not much reduction in time; indeed, the training time and testing time increased with a certain number of components compared to the case of no reduction.The correlation or the degree of the relationship between the features of the cyberattack samples can be analyzed to determine whether the feature reduction method is needed.The correlation matrix of all 115 features and extracted features for the proposed AE are illustrated in Figure 8.A perfect positive correlation is represented by 1 (white color), −1 (black color) indicates a perfect negative correlation, no correlation is represented by 0, while a weak correlation is represented by values that ranged near 0 (red color).In Figure 8a, it is shown that there was a strong correlation between certain features, especially the first 30 features.When two features are highly correlated, it is possible to drop one of them to reduce the dimensionality of the dataset [15]; the alternative is to apply a feature reduction method such as the proposed AE. Figure 8b shows the correlation matrix after applying AE-2, which showed no correlation between the two extracted features; meanwhile, AE-3 showed one extracted feature had a high correlation (0.86) with the first feature, as can be seen in Figure 8c.For AE-8 in Figure 8d, there was only one extracted feature having a high correlation (0.81) and one feature having a medium correlation (0.53) with the first feature.In Table 6, we compared the performance of classifiers such as DT, XTree, RF, and DNN2.The best result was achieved by AE-8.We can observe that the performance metrics such as accuracy, recall, and F1-score had been maintained before and after applying AE-8, with only a 2% drop in the precision of XTree and DNN2.PCA-8 and LDA-4 also performed well with regard to the degradation in performance metrics of DNN2 classifiers.In terms of execution time, RF showed a small decrease in the training time without any improvement in the detection time.On the other hand, DT, XTree, and DNN2 showed a good reduction in execution time for training and testing sets.The confusion matrices for the RF classification of cyber-attacks using IoT2 with different feature reduction methods are given in Figure 9.We can observe that the classifier had problems in classifying G.TCP and G.UDP attacks.This is due to the similarity of the feature data, which is also evident from the reconstruction error obtained earlier in Figure 6 for IoT1 (the RE for IoT2 was also similar to the RE of IoT1).The confusion matrix without employing any reduction techniques showed the best performance (Figure 9a), where most of the off-diagonal elements are zero, followed by the proposed AE-8 (Figure 9d) and PCA-8 (Figure 9b).LDA-4 (Figure 9c) showed the worst performance, especially for classifying G.Combo and G.Junk attacks, where AE-8 managed to classify them with lower false values.
Table 7 shows the performance of the evaluated techniques using IoT3.We also compared our obtained results with the hybrid CNN-LSTM classifier proposed in [14] when no reduction techniques were applied.Generally, the three feature reduction techniques showed competitive performance in terms of accuracy and F1-score, especially with RF classifiers, where all the classifiers achieved the same accuracy and F1-score.With DT and XTree classifiers, a slight degradation of 1% in accuracy was observed for AE-8, and a 1% reduction in F1-score was observed for PCA-8 and AE-8.The training time and testing time for DT and XTree after feature reduction showed a decrease in computational time to about half the time required when no reduction method was applied; however, the RF classifier showed a slight improvement in certain cases only.DNN2 showed good performance in accuracy with a 2% degradation with LDA-4, and only 1% degradation with PCA-8 and AE-20.However, there were significant improvements in training time and detection time for all feature extraction techniques.Figure 10 shows the accuracy and loss curves of DNN2 before and after applying feature extraction methods based on PCA, LDA, and AE.It can be seen that the accuracy and loss of DNN reached a stable performance without feature reduction after a few epochs (i.e., less than 20); however, after applying feature reduction, the convergence of PCA-8, AE-8, and AE-20 to a stable performance required an increased number of epochs.PCA-8 and AE-20 showed almost similar convergence, but AE-8 showed a slightly lower convergence rate, accuracy, and loss performance.LDA-4 showed robust convergence stability with few numbers of epochs, but with degraded performance in terms of accuracy and loss compared to PCA and AE.In Table 6, we compared the performance of classifiers such as DT, XTree, RF, and DNN2.The best result was achieved by AE-8.We can observe that the performance metrics such as accuracy, recall, and F1-score had been maintained before and after applying AE-8, with only a 2% drop in the precision of XTree and DNN2.PCA-8 and LDA-4 also performed well with regard to the degradation in performance metrics of DNN2 classifiers.In terms of execution time, RF showed a small decrease in the training time without any improvement in the detection time.On the other hand, DT, XTree, and DNN2 showed a good reduction in execution time for training and testing sets.The confusion matrices for the RF classification of cyber-attacks using IoT2 with different feature reduction methods are given in Figure 9.We can observe that the classifier had problems in classifying G.TCP and G.UDP attacks.This is due to the similarity of the feature data, which is also evident from the reconstruction error obtained earlier in Figure 6 for IoT1 (the RE for IoT2 was also similar to the RE of IoT1).The confusion ma- trix without employing any reduction techniques showed the best performance (Figure 9a), where most of the off-diagonal elements are zero, followed by the proposed AE-8 (Figure 9d) and PCA-8 (Figure 9b).LDA-4 (Figure 9c) showed the worst performance, especially for classifying G. Table 7 shows the performance of the evaluated techniques using IoT3.We also compared our obtained results with the hybrid CNN-LSTM classifier proposed in [14] when no reduction techniques were applied.Generally, the three feature reduction techniques showed competitive performance in terms of accuracy and F1-score, especially with RF classifiers, where all the classifiers achieved the same accuracy and F1-score.With DT and XTree classifiers, a slight degradation of 1% in accuracy was observed for epochs (i.e., less than 20); however, after applying feature reduction, the converge PCA-8, AE-8, and AE-20 to a stable performance required an increased numb epochs.PCA-8 and AE-20 showed almost similar convergence, but AE-8 show slightly lower convergence rate, accuracy, and loss performance.LDA-4 showed r convergence stability with few numbers of epochs, but with degraded performan terms of accuracy and loss compared to PCA and AE.In Table 8, we compared the performance of attacks classification for the IoT4 dataset with and without feature extraction methods based on DT, XTtree, FR, DNN2, and CNN classifiers, as well as our implementation of the hybrid CNN-LSTM model proposed in [14].The architecture of CNN is similar to the CNN-LSTM model in [14] but without the two LSTM layers.The performance metrics such as accuracy, recall, and F1-score of DT, XTree, and RF before feature extraction were maintained after feature extraction based on PCA-8, LDA-4, AE-8, and AE-20.The execution time for DT and XTree was reduced to more than half the time required in the case without feature extraction, while RF showed a slight drop in the execution time.DNN2 and CNN showed good performance with PCA-8 and AE-20, where the accuracy, recall, and F1-score were preserved.AE-8 showed only a 1% drop in accuracy, whereas LDA-4's accuracy dropped by 3%.PCA-8 with hybrid CNN-LSTM showed better performance, whereas AE-20 had only a 1% drop in accuracy; however, LDA-4 had shown sensitivity to the performance metrics with all the deep learning models.In terms of execution time, we can observe a significant reduction in training and testing time for the CNN and hybrid CNN-LSTM models after applying the feature extraction methods.Without feature extraction, both models consumed a huge amount of time for training and testing; therefore, it is highly recommended to apply feature extraction methods for both models.
Figure 11 shows the accuracy and loss curves for deep neural network classifiers such as DNN2, CNN, and CNN-LSTM after using AE-20.It is shown that DNN2 without feature reduction converged to maximum accuracy and minimum loss very fast, i.e., approximately with only 20 epochs; meanwhile, with feature reduction methods, it required extra epochs to converge.CNN showed the best convergence speed compared to DNN2 and CNN-LSTM.In Table 9, we present the results of testing the proposed integrated AE model on the MQTTset dataset, which contains five types of attacks.The number of features in this dataset is 76, which is different from the previously discussed IoT devices.The AE structure used in the analysis was the same as the one used before, with only a change in In Table 9, we present the results of testing the proposed integrated AE model on the MQTTset dataset, which contains five types of attacks.The number of features in this dataset is 76, which is different from the previously discussed IoT devices.The AE structure used in the analysis was the same as the one used before, with only a change in the optimization model where we used the SGD optimizer because we found that it performed better than the Adam optimizer when we used the unified AE model for feature reduction.The settings for ML and DL techniques that were used with previous IoT devices were also used here.The dataset was divided into 75% (75,033) for training and 25% (25,011) for testing.For the DL models, the training data was further divided into 75% (56,274) for training and 25% (18,759) for validation.
In terms of performance metrics, we can observe that RF, DT, and XTree performed the best, followed by DNN2, CNN, CNN-LSTM, and KNN.The worst performance was attained by linear SVM.After feature reduction, most techniques provided good performance with only a 1% drop in accuracy and a 2-3% drop in F1-score.Comparing PCA-8 with the proposed AE-8, both techniques provided almost the same result, whereas the proposed AE-8 performed slightly better with RF, DNN2, and CNN classifiers.In terms of computational time, all DL models (DNN2, CNN, and CNN-LSTM) showed a significant reduction in training and testing time.The proposed AE-20 showed a signification reduction in training and testing time for DNN2 compared to PCA-8 and LDA-4.Feature extraction for SVM resulted in a slight saving in training and testing time for PCA-8 and AE-8; on the other hand, a good time saving was achieved for LDA-4.For KNN, the testing time was greatly decreased for all techniques with a slight increase in training time.

Conclusions
In this paper, we proposed a novel unified model based on deep AE for anomaly detection and feature extraction for multi-classification of cyber-attacks.We analyzed the performance of AE for anomaly detection based on N-BaIoT and MQTTset datasets, and we found that it had robust performance and fast detection time compared to its counterparts, OC-SVM and iForest.We then analyzed the performance of different classifiers such as DT, XTree, RF, DNN, CNN, and CNN-LSTM with and without feature extraction methods.We compared the performance of the proposed AE with two popular feature extraction techniques, namely PCA and LDA.We found that the feature extraction methods can reduce the computation time of the training and testing stages for most of the classifiers with a slight degradation in performance metrics, where there was an approximately 0-2% drop in accuracy and F1-score.The DT and XTree algorithms demonstrated a good trade-off between performance and computational complexity compared to the other classifiers.The RF classifier showed good performance stability after feature reduction but without any significant improvement in the execution time.DNN model showed sensitivity to the extracted features and required more epochs to converge to the optimum value, while CNN provided better results with improved convergence rates using the same number of epochs.All neural network-based models delivered a significant saving in execution time after feature extraction.To conclude, the proposed integrated AE model has the capability to detect anomalies more efficiently and extract useful features as well as to compete with well-known feature extraction techniques such as PCA and LDA.Future works can be focused on implementing other forms of AEs to our proposed unified model such as convolutional AE (CAE), variational AE (VAE), sparse AE (SAE), denoising AE (DAE), or other new emerging AE to maintain the performance metrics of the multi-classifiers before and after feature reduction as much as possible.

IoT 2023, 4 , 7 Figure 2 .
Figure 2. Structure of the proposed integrated system based on deep autoencoder (AE) for anomaly detection and feature extraction for multi-classification of cyber-attacks.

Figure 2 .
Figure 2. Structure of the proposed integrated system based on deep autoencoder (AE) for anomaly detection and feature extraction for multi-classification of cyber-attacks.

Figure 3 .
Figure 3. Experiment setup of N-BaIoT dataset considered in [13] for detecting IoT botnet attacks such as Gafgyt (BASHLITE).The N-BaIoT dataset contains both normal traffic and malicious traffic (five Gafgyt attacks and five Mirai attacks) for nine IoT devices.It consists of 115 features collected
also shows an imbalance distribution of attack data especially with G.Junk, G.Scan, and M.UDP, where G stands for Gafgyt/BASHLITE attack and M stands for Mirai attacks.IoT 2023, 4, FOR PEER REVIEW from the Sklearn library.The encoder of the proposed AE consists of 3 hidden layers 64, 32, and 16 neurons; between each hidden layer, BatchNormalization and Leaky layers were used[13].The decoder is the symmetrical form of the encoder with 16 and 64 neurons in the three hidden layers.The input and output layers have 115 neur whereas the bottleneck layer has 2, 3, or 8 neurons.Adam/SGD optimizer and MSE a loss function were used to compile the model, where 20 epochs with a batch size o were used to train the model.
also shows an imbalance distribution of attack especially with G.Junk, G.Scan, and M.UDP, where G stands for Gafgyt/BASHL attack and M stands for Mirai attacks.(a)(b)

Figure 10 .Figure 10 .
Figure 10.(a) Accuracy and (b) loss curves for DNN2 with different feature extraction m using attack dataset of IoT3.

Figure 11 Figure 11 .
Figure11shows the accuracy and loss curves for deep neural network classifiers such as DNN2, CNN, and CNN-LSTM after using AE-20.It is shown that DNN2 without feature reduction converged to maximum accuracy and minimum loss very fast, i.e., approximately with only 20 epochs; meanwhile, with feature reduction methods, it required extra epochs to converge.CNN showed the best convergence speed compared to DNN2 and CNN-LSTM.

Figure 11 .
Figure 11.(a) Accuracy and (b) loss curves for different deep learning models using AE-20 based on IoT4 dataset.

Author
Contributions: K.A.A.: conceptualization, methodology, investigation, and drafting the manuscript.H.-S.L.: supervision, revising technical design, and editing the manuscript.M.H.M.S.: co-supervision, writing-review and editing.Y.S.Y.: review and editing.All authors have read and agreed to the published version of the manuscript.

Table 1 .
IoT devices and corresponding datasets used in our analysis.

Table 2 .
Performance metrics of anomaly detection models.

Table 3 .
Time (in seconds) taken by feature reduction techniques on training and testing a dataset.

Table 4 .
Comparison of different multi-classifiers of cyber-attacks for IoT1 using different feature extraction techniques.

Table 5 .
Comparison of DNN1 and DNN2 classifiers of cyber-attack for IoT1 using different feature reduction techniques.

Table 6 .
Performance metrics of multi-classification of cyber-attacks for IoT2 using different feature extraction methods.

Table 7 .
Performance metrics of multi-classification of cyber-attacks for IoT3 using different feature extraction methods.

Table 8 .
Performance metrics and execution time based on IoT4 dataset.

Table 9 .
Performance metrics and execution time based on IoT5 dataset.

Table 6 .
Performance metrics of multi-classification of cyber-attacks for IoT2 using different feature extraction methods.