An Efficient Deep Learning-Based Detection and Classification System for Cyber-Attacks in IoT Communication Networks

With the rapid expansion of intelligent resource-constrained devices and high-speed communication technologies, Internet of Things (IoT) has earned a wide recognition as the primary standard for low-power lossy networks (LLNs). Nevertheless, IoT infrastructures are vulnerable to cyber-attacks due to the constraints in computation, storage, and communication capacity of the endpoint devices. From one side, the majority of newly developed cyber-attacks are formed by slightly mutating formerly established cyber-attacks to produce a new attack tending to be treated as a normal traffic through the IoT network. From the other side, the influence of coupling the deep learning techniques with cybersecurity field has become a recent inclination of many security applications due to their impressive performance. In this paper, we provide a comprehensive development of a new intelligent and autonomous deep learning-based detection and classification system for cyber-attacks in IoT communication networks leveraging the power of convolutional neural networks, abbreviated as (IoT-IDCS-CNN). The proposed IoT-IDCS-CNN makes use of the high-performance computing employing the robust CUDA based Nvidia GPUs and the parallel processing employing the high-speed I9-Cores based Intel CPUs. In particular, the proposed system is composed of three subsystems: Feature Engineering subsystem, Feature Learning subsystem and Traffic classification subsystem. All subsystems are developed, verified, integrated, and validated in this research. To evaluate the developed system, we employed the NSL-KDD dataset which includes all the key attacks in the IoT computing. The simulation results demonstrated more than 99.3% and 98.2% of cyber-attacks’ classification accuracy for the binary-class classifier (normal vs anomaly) and the multi-class classifier (five categories) respectively. The proposed system was validated using k-fold cross validation method and was evaluated using the confusion matrix parameters (i.e., TN, TP, FN, FP) along with other classification performance metrics including precision, recall, F1-score, and false alarm rate. The test and evaluation results of the IoT-IDCS-CNN system outperformed many recent machine-learning based IDCS systems in the same area of study.


Introduction
The Internet of Things (IoT) is comprised of a collection of heterogeneous resource-constrained objects interconnected via different network architectures such as wireless sensor networks (WSN) [1]. These objects or "things" are usually composed of sensors, actuators, and processors with the ability to communicate with each other to achieve common goals/applications through unique identifiers with respect to the Internet Protocol (IP) [2,3]. Current IoT applications include smart buildings; telecommunications; medical and pharmaceutical; aerospace and aviation; environmental phenomenon monitoring; agriculture; industrial and manufacturing processes etc. The basic IoT layered architecture is shown in Figure 1. It has three layers: the perception layer (consist of edgedevices that interact with the environment to identify certain physical factors or other smart objects in the environment), the network layer (consists of a number of networking-devices that discover and connect devices over the IoT network to transmit and receive the sensed data), and the application layer (consists of various IoT applications/services that are responsible for data processing and storage). Indeed, most cyber-attacks target the application and network layers of the IoT system. IoT is a promising profound technology with tremendous expansion and effect. IoT infrastructures are vulnerable to cyber-attacks in that within the network, simple endpoint devices (e.g. thermostat, home appliance, etc.) are more constrained in computation, storage, and network capacity compared with the more complex endpoint devices (e.g., smartphones, laptops, etc.) that may reside within the IoT infrastructure. Once the IoT infrastructure is breached, hackers have the ability to distribute the IoT data to unauthorized parties and can manipulate the accuracy and consistency of IoT data over its entire life cycle [5]. Therefore, such cyber-attacks need to be addressed for safe IoT utilization. Consequently, vast efforts to handle the security issues in the IoT model have been made in the recent years. Many of the new cybersecurity technologies were developed by coupling the fields of machine learning with cybersecurity. It should be noted that, the majority of IoT new attacks are slight deviations (i.e. mutations) of earlier known cyberattacks [6]. Such slight mutations of these IoT attacks have been demonstrated to be difficult to identify/classify using traditional machine learning techniques. Promising state-of-art research has been conducted for cybersecurity using deep neural networks [7][8][9][10][11][12]. Table 1 summarizes research of conventional and traditional machine learning approaches to solve cybersecurity issues In this paper, a new intelligent system that can detect mutations of common IoT cyberattacks using non-traditional machine learning techniques exploiting the power of Nvidia-Quad GPUs is proposed. The proposed system employs the convolutional neural network (CNN) along its associated machine learning algorithms to classify the NSL-KDD dataset records (we denote our system using the acronym IoT-IDCS-CNN). The NSL-KDD dataset stores non-redundant records of all the key attacks of IoT computing with different levels of difficulties. Specifically, the main contributions of this paper can be summarized as follows: • We provide a comprehensive efficient detection/classification model that can classify the IoT traffic records of NSL-KDD dataset into two (Binary-Classifier) or five (Multi-Classifier) classes. Also, we present detailed preprocessing operations for the collected dataset records prior to its use with deep learning algorithms.
• We provide an illustrated description of our system modules and the machine learning algorithms. Furthermore, we demonstrate a comprehensive view of the computation process of our IoT-IDCS-CNN.
• We provide an inclusive development, validation environment and configurations along with an extensive simulation results to gain insight into the proposed model and the solution approach. This includes simulation results related to the classification accuracy, classification time and classification error rate for the system validation of both detection (Binary-Classifier) and classification (Multi-Classifier).
• We provide a comprehensive performance analysis to gain more insight about the system efficiency such as the confusion matrix to analyze the attacks' detection True/False Positives and the True/False Negatives and other evaluation metrics including Precision, Recall, F-Score Metric and, False Alarm Rate.
• We conduct robust validation of our system using K-Fold Cross Validation method and a benchmarking study of our findings with other related state-of-art works employing the same dataset as well as the comparison with other State-of-Art machine learning based intrusion detection systems (ML-IDS) employing different dataset. The rest of this paper is organized as follows: Section 2 introduces and justifies the dataset of IoT cyberattacks employed by our system. Section 3 provides details of the proposed system architecture, development, and detailed design steps. Section 4 presents the simulation environment for system implementation, testing and validation. Section 5 discusses the details about experimental evaluation, comparison, and discussion. Finally, Section 6 concludes the findings of the research.

Dataset of Cyberattacks
Data collection involves the gathering of information on variables of interest (VOS) within a dataset in a documented organized manner that allows one to answer the defined research enquiries, examine the stated hypotheses, and assess the output consequences. In this research, the variables of interest are concerned with the intrusions/attacks data records in IoT computing environments. Two global datasets of IoT attacks can be investigated including KDD'99 dataset and NSL-KDD dataset. Indeed, KDD'99 has been developed by DARPA intrusion detection evaluation program to build a network IDS able of differentiating amongst "bad" and "good" connections [22]. This dataset includes a standard list of data to be inspected, which contains a broad range of cyber-attacks modeled in a military communication platform. However, one of the most important issues of this dataset is the enormous number of redundant data-samples in the training and testing datasets. Such redundancy affects the accuracy of classifier which will have a biased towards more frequent records [22].
Lately, KDD'99 has been re-investigated and updated to include more up-to-date and nonredundant attack records with different levels of difficulties through the newer version called NSL-KDD [23]. NSL-KDD [24] is a reduced version of the original KDD'99 dataset [25] and consists of the same features as KDD'99. However, NSL-KDD includes more up-to-date and non-redundant attack records with different levels of difficulties. Figure 2 shows sample records of original KDD-NSL training dataset in CSV format but read by notepad in TXT format (prior to any processing technique). In this research, the NSL-KDD dataset is employed for many reasons including: (a) It can be efficiently imported, read, preprocessed, encoded, and programed to produce two-or multi-class classification for IoT Cyber-attacks. (b) It covers all key attacks of IoT computing including: Denial-of-Service (DoS) [26], Probe (side channel) [27], Root to Local (R2L) [28], User to Root (U2R) [28]. (c) It is obtainable as TXT/CSV file type consisting of a reasonable number of non-redundant records in the training and test sets. This improves the classification process by avoiding the bias towards more frequent records. (d) It correlates to high-level IoT traffic structures and cyberattacks as well as it can be customized, expanded, and regenerated [24]. NSL-KDD dataset has been thoroughly developed with high-level diverse interpretations of the training data covering normal and abnormal IoT network traffic data. The normal data samples represent the legitimate data packets processed by the IoT network. The abnormal data samples represent mutated data packets (i.e., attacks) achieved by slight mutations in the previously developed attacks such as the small changes in the network packet header configurations. The original dataset is available in two classification forms: two-class traffic dataset with binary labels and multi-class traffic data set including attack-type labels and difficulty level. In both cases, it comprises 148,517 samples each with 43 attributes such as duration, protocol, service, and others [29]. The statistics of traffic distribution of NSL-KDD dataset is summarized in Table 2. Table 2. Statistics of traffic distribution of NSL-KDD dataset [25].

Two-Classes Dataset
Multi

System Modeling
In this research, the proposed system is partitioned into distinct subsystems each of which is implemented with several modules. Specifically, the system is composed of three subsystems including: Feature Engineering (FE), Feature Learning (FL), and Detection and Classification (DC), as illustrated in Figure 3.

Implementation of Feature Engineering (FE) Subsystem
This subsystem is responsible for conversion of raw IoT traffic data records of NSL-KDD dataset into a matrix of labeled features that can be fed and trained by the neural network's part of the FL subsystem. The implementation stages of this subsystem include: Importing NSL-KDD dataset: In this stage, the collected dataset has been imported/read using MATLAB in a tabulated format instead of raw data in the original dataset text files. All data columns are assigned virtual names based on the nature of data in the cells. The imported datas et includes 43 different features/columns. Figure 4 shows a sample of importing the NSL-KDD dataset using table datatype. The illustrated sample shows only the first ten records along with five features. All data columns were assigned virtual names based on the nature of data in the cells.

Renaming Categorical Features:
Four of imported 43 features are categorical features that need to be renamed prior the data encoding and sample labeling processes. These features are the target protocol, the required service, the service flag, and the record category (e.g. normal or attack). Therefore, the four categorical columns have been renamed accordingly in this stage. Figure 5 illustrates the four categorical features (columns) that have been renamed accordingly for the binaryclasses data records (the other columns are omitted for better readability). Also, note that the dataset encompasses multi-class data records for different traffic categories.

One Hot Encoding of Categorical Features:
This module is responsible for conversion of the categorical data records into numerical data records in order to be employed by the neural network. Therefore, three categorical features undergo through One Hot Encoding process (1-N encoding) [30]. These features are the protocol column, the service column, and the flag column. The class feature/column is left for samples labeling process.
• For protocol feature, three different types of protocols are revealed from the dataset including: {TCP, UDP and ICMP}. The one hot encoding for this feature will replace the categorical data of 'Protocol column' with the three numerical features as shown in Table 3.

Table 3. Scheme for Replacement of Categorical Data of Protocols
• For service feature, 69 different services are revealed from the dataset such as: The one hot encoding for this feature will replace the categorical data of 'Service column' with the 69 numerical features as shown in Table 4. • For flags feature, 11 different flags are revealed from the dataset including: { 'OTH', 'REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S0', 'S1', 'S2', 'S3', 'SF', 'SH' }. The one hot encoding for this feature will replace the categorical data of 'Flag column' with the 11 numerical features as shown in Table 5.  Figure 6. Converting Tables to Double Matrix: At the end of dataset importing, encoding, and labeling processes, the dataset samples and targets should be provided to the neural network inputs of FL subsystem as a matrix of all input numerical samples. Therefore, encoded dataset tables have been converted to double matrix (148517 x 124). For instance, the following double matrix illustrates the first five rows of dataset matrix.

Flag
Equivalen t One-Hot Encoding Matrix Resizing with Padding Operation: This module is responsible to adjust the size of the dataset matrix to accommodate the input size for the FL subsystem. This was performed by resizing the matrix of the engineered dataset form 148517 x 124 to the new size of 148517 x 784, since the input size of every individual sample processed at FL subsystem is 28 x 28 (= 784). Thereafter, the new empty records of this matrix were padded with zero-padding technique [31]. To avoid any feature biasing in the samples of the dataset, the padded records were distributed equally around the data samples. Figure 7 illustrates an example of resizing with zero-padding operation used in this research. The new matrix size is composed of 148517 sample attack each with 784 features. Matrix Normalization with Min-Max Norm: Data normalization is performed to get all the data points to be in the same range (scale) with equal significance for each of them. Otherwise, one of the great value features might completely dominate the others in the dataset. Thus, this module is responsible to normalize all integer numbers of the dataset matrix into a range between 0~1 using Min-Max Normalization (MX-Norm) [32]. MX-Norm is well-known method to normalize data as it is commonly used in machine learning applications. In this method, we scan all the all values in every feature, and then, the minimum value is converted into a 0 and the maximum value is converted into a 1, while the other values are converted (normalized) into a fraction value from 0 to 1. The Min-Max normalization ( ) for data record ( ) at the ( ℎ ) position of matrix (X) is defined as follows: Also, Figure 8 illustrates an example of integer data features normalized using min-max normalization (0 ~1). It can be clearly seen the effect of normalization as it ensures all features to be in the same scale.

Reshaping the Double Matrix:
This module is responsible to create the attack samples for the by reshaping the one-dimensional vectors of attack records into two-dimensional square matrices to accommodate the input size for the developed network. Accordingly, every onedimensional vector sample (1 x 784) will be reshaped into two-dimensional matrix (28 x 28). This operation should generate a square matrix for each data sample as illustrated in Figure 9.

Implementation of Feature Learning (FL) Subsystem
So far, the FE subsystem has been developed and the next step is to process the encoded input features using FL subsystem-based CNN. The deep learning network will to be trained with minimum classification error and thus maximum accuracy. Generally, CNN involves various layers including convolution, activation, pooling, flatten and others. Convolutional layers are the core component of CNN network and they are hierarchically assembled to generate a number of feature-maps which enable CNNs to learn complex features being a vital operation to recognize patterns in the classification and detection tasks. Therefore, the developed FL subsystem is responsible for an appropriate that can accept the encoded features from FE subsystem at the input layer and train on them with multiple hidden layers as well as update the training parameters before classifying the IoT traffic dataset as normal or anomaly (mutated). The implementation stages of this subsystem include: Feature Mapping with 2D-Convolution Operations Layer: This module is responsible to generate new matrices called feature maps that emphasizes the unique features of the original matrix [33]. These feature-maps are produced by convolving (multiply and accumulate) the original matrix ( ) using a number ( ) of ( ) convolution filters with padding size ( ) and stride size of ( ) which yields the feature maps ( ). The size of the resultant feature maps can be evaluated as follows: In this research, we have applied 20 convolution filters (9 9) over the 28 28 input samples with (p=0, and = 1 ) which resulted in 20 feature map each (20x 20 ). Figure 10 illustrates our convolutional layer, where the input is 28×28 matrix and a filter of size 9×9, this defines a space of 20×20 neurons in the first hidden layer. This is the case because we can only move the window 19 neurons to the right and 19 neurons to the bottom before hitting the right (or bottom) border of the input matrix. Note that the filter moves forward 1 position away, both horizontally and vertically when a new row starts. Also note that, Convolution layer goes through a backpropagation process to determine the most accurate values of its trainable parameters (weights: k x k x N = 9 x 9 x 20 ). The reason of using ReLU is that training a deep network with ReLU tended to converge much more quickly and reliably than training a deep network with other non-linear activation functions such sigmoid or tanh activation functions [34]. Figure 11 illustrates the rectification layer of the convolved maps. Figure 11. Implementation of ReLU activation Layer of our CNN.

Down-Sampling with Pooling Operations Layer:
This module is responsible to generate new matrices called pooled feature maps that reduces the spatial size of the rectified feature maps and thus reduces the number of parameters and computation complexity in the network [33]. This can be done by combining the neighboring points of a particular region of the matrix representation into a single value that represent the selected region. The adjacent points are typically selected from a fixed size-square matrix (determined according to the application). Among these points of the applied matrix, one value is nominated as the maximum or mean of the selected points. In this research, we have used the mean pooling technique to develop the pooling layer since it combines the contribution of neighboring points instead of only selecting the maximum point. To produce the pooled featuremaps ( ), the pooling filter ( ) is independently applied over the rectified feature-maps ( ) with stride ( ) as follows: In this research, we have applied 20 pooling operation (2 2) over the 20 20 rectified featuremaps with ( = 2) which resulted in 20 feature map each (10 10). Figure 12 illustrates our pooling layer, where the input from previous layer is 20×20 x 20 and the mean pooling filter of size 2×2. Note that the stride value is 2 which means that the filter moves forward 2 positions away, both horizontally and vertically when a new row starts. Thus, we end up with pooled maps of 10×10 x 20.

Implementation of Detection and Classification (DC) Subsystem
DC subsystem is responsible for providing traffic classification for the input traffic data into binary-class classification (2-Classes: normal vs. anomaly) or multi-class classification (5-Classes: Normal, DoS, Probe, R2L, U2R). This subsystem is composed of three consecutive stages as follows: Flattening Layer of Pooled Feature Maps: This module is responsible to linearize the output dimension of the convolutional/pooling layers network to create a single long feature vector [33]. This can be achieved by converting the 2D data of N-Pooled feature-maps into a 1-D array (or vector) to be inputted to the next layer, which is connected to the final classification model, called a dense or fully connected layer. Since flatten layer collapses the spatial dimensions of the input into the channel dimension (array), this means that if the input to the flatten layer is ( ) feature maps each with a dimension of ( ) then the flattened output ( ) can be obtained by linear multiplication of the input dimensions by the number of maps, that's it:

= (4)
In this research, since we have 20 pooled feature maps ( = 20), each with dimension of 10 x 10 ( = 10), then, our flatten layer comprise of 2000 nodes. Figure 13 illustrates the flattening layer development of our CNN. Fully Connected Layer with ReLU Function: Fully Connected (FC) layers-as name implies-are those layers where all the inputs from one layer are connected to every activation unit of the next layer [33]. Commonly, FC layers are located as the last few layers of any CNN. Therefore, this module is responsible to compile the high level features extracted by previous layers (convolutional and pooling layers) into a reduced form of low level features in which they can be used by the classifier located at the output layer to provide classification probabilities. In this research, we have developed the FC layer with 200 neurons connected with 2000 nodes of the flattened (FL) layer which provide a layer complexity reduction by10: 1. As the inputs pass from the units of FL layer through the neurons of FC layer, their values are multiplied by the weights and then pass into the employed activation function (normally ReLU function) just in the same way as in a the classical NN (i.e. shallow NN). Thereafter, they are forwarded to the output classification layer where each neuron expresses a class label. Note that, FC layer also goes through a backpropagation [33] process to determine the most accurate values of its trainable parameters ( ℎ = 2000 200). Figure 14 illustrates the development for FC layer of our CNN. ) through the transposed weight connections ( ). This is illustrated in Figure  15 and can be achieved algebraically as follows: Note that, the output layer also goes through a backpropagation process to determine the most accurate values of its trainable parameters ( ℎ = 200 5). The last layer of the neural network is a layer which has similar number of nodes as the output layer. normalizes the output into a probability distribution on classes [33]. Specifically, assigns numerical probability values for every class at the output layer where these probabilities should sum up to 1.0 (following a probability distribution). Given an input a vector ( ) of ( ) real numbers and ( ) defines the index for the input values, then, SoftMax function σ: ℝ k ⟼ ℝ k is defined as follows: For example, might produce the following probabilities for an attack record:

System Integration
In this section, we integrate all the aforementioned subsystems and modules by Putting-It-All-Together to come up with complete system architecture of our IoT-IDCS-CNN. Figure 16 illustrates the top view architecture of the integrated system as a feedforward network based IoT attack detection system. According to the system architecture, after data preprocessing stages and using the 28 28 input matrix, we constructed 784 (= 28 28) input nodes. To extract features of the input data, the network encompasses a deep convolutional layer involving a depth of 20 convolution filters of size (9 9). Thereafter, the results of the convolutional layer pass via ReLU activation function which followed by the subsampling operation of the pooling layer. The pooling layer utilizes the average pooling method with 2 2 submatrices. The pooled features are then flattened to 2000 nodes. The classification/detection neural network comprises the single hidden fully connected (FC) layer and the output classification layer. This FC layer comprises 200 nodes along with ReLU activation function. Since our system requires the classification of the data into 5 classes, therefore, the output layer is implemented with 5 nodes with SoftMax activation function. The next table, Table 6, recaps the final integrated based system for IoT attacks detection. Moreover, the life cycle for the packet traffic received at the IoT gateway is provided in Figure  17 below. The input layer takes the encoded features generated from FE subsystem in order to be trained at the CNN which update the training parameters and generate the least cost/loss value (error) with optimal accuracy. The output layer employs the SoftMax classifier which is used to classify the data using two classification techniques include: binary classification technique which provides two categories (normal vs anomaly) and the multi-classification technique which provides five categories (normal, DoS attack, Probe attack, R2L attack, U2R attack).

Simulation Environment
To implement, verify, and validate the proposed IoT attacks detection and classification system, the training and testing were performed on the NSL-KDD dataset involving the key attacks for IoT communication. The classifier model was determined to have either two classes (binary attack detection) or five classes (multi-attack classification). The proposed system was implemented in MATLAB 2019a. To evaluate the system performance, experiments were performed using a highperformance computing platform utilizing the power of central processing unit (CPU) and graphical processing unit (GPU) with Multicore structure of NVIDIA GeForce® Quadro P2000 Graphic card. The specifications for the workstation used in development, validation & verification are provided in Table 7.

Results and Discussion
Verification and validation (V&V) are essential activities and quality control factors that are performed independently to check the system compliance with requirements and specifications and that it fulfills its intended purpose. Typically, the verification process is defined as a number of activities used to examine the suitability of the system or component (i.e. are we building the product right). On the other hand, the validation process is defined as a number of activities used to examine the conformity of the system (or any of its elements) with its purpose and functions (i.e. are we building the right product). Note that while system validation is distinct from verification, however, the actions of both processes are integral and meant to be performed in coupling [35]. In this section, we provide a comprehensive verification and validation to check the system compliance with its intended objectives and purpose.

System Evlaution and Verification
To verify the effectiveness of the proposed system in compliance with its intended functionalities and missions, we have evaluated the system performance using the recommended testing dataset in terms of the classification accuracy, classification error percent and the classification time as follows: The plot for the overall testing classification accuracy and overall classification loss (classification error) comparing the performance of the binary-classifier (2-Classes) and the multiclassifier (5-Classes) obtained during the validation process of NSL-KDD dataset are illustrated in Figure 18. According to the figure, at the beginning and after one complete pass (epoch) of testing process, both classifiers showed relatively low classification accuracy proportions with 85% and 79% registered for 2-Class classifier and 5-Class classifier, respectively. Thereafter, both classification accuracy curves begin to roughly be increasing in a stable tendency while testing epochs proceeds with faster and higher ceiling level obtained for the classification accuracy of 2-Class classifier. After training the system for 100 epochs, the system was able to record an overall testing accuracy proportions of 99.3% and 98.2% for 2-Classs classifier and 5-Classs classifier, respectively, for the given testing dataset samples. Conversely, it can be clearly seen that both classifiers showed relatively high classification error proportions at the beginning of the testing process with 15% and 21% registered for 2-Class classifier and 5-Class classifier after one testing epoch, respectively. Thereafter, both classification error rates started to systematically decline while the binary classifier progresses with faster threshold achieving 0.7% of incorrect prediction proportion (classification error percentage). However, the classification error rate proportion for the multi-classifier has saturated with less than 2.0% of incorrect prediction. This range of classification error of both classifiers (0.7% -1.8%) is permitted to avoid underfitting or overfitting from the training loss (~0.0%) and training accuracy (~100%) and thus provided high-accuracy classification performance. Moreover, we have analyzed the time required to perform attack detection or classification for one IoT traffic sample. To obtain accurate and precise results, we have run the validation test for 500 times and then computed the time statistics for detection and classification. Figure 19 shows the detection/classification time performance for the proposed model (either 2-Class or 5-class classifier). According to the figure, the time required to detect/classify one sample record ranges from ( ≈ 0.5662 ) to ( ≈ 2.099 ) with average time of ( ≈ 0.9439 ) recorded for the 500 simulation runs. This average time ( 1 ) is very useful for the system to run in dynamical environment such as the real time IDS applications. Furthermore, even though the classification accuracy measurement is the key significant factor used to evaluate the efficiency of the classification or detection system, we have evaluated the validation (testing) dataset using a confusion matrix with clear identification of True Positive (TP), True Negative (TN), False Positives (FP) and False Negatives (FN) analysis to provide more insight about the performance of the proposed. Figure 20 shows the general confusion matrix of our system, confusion Matrix results for 2-Class Classifier using the testing dataset, and the confusion matrix results for 5-Class Classifier using the testing dataset. Therefore, the confusion matrix parameters (i.e., TN, TP, FN, FP) can be used to compute some other performance evaluation metrics (has less importance than the accuracy metric) including: (a) the classification precision (detection rate) which is defined as the percentage of relevant instances (e.g. attacks) among the retrieved instances, (b) the classification recall (sensitivity) which is defined as the percentage of positive instances that are correctly labeled, (c) F1-Score which is defined as the average score involving precision and recall (i.e., utilizes both false negative and false positive), and (d) False alarm rate which is defined as the percentage of misclassified normal instances detected by the system [38]. These metrics can be calculated in the following equations while Table 8 summarizes the results of the overall evaluation metrics for our proposed system.

System Validation and Benchmarking
To validate the proficiency of proposed system in compliance with system purpose and specifications. To ensure high level of reliability of our system validation stage, we have conducted a 5-fold cross validation process [37] that encompasses 5 different experiments for each classification model (total of 10 experiments). At each experiment, we have selected different sets for training (~128,000 sample) and validation (20,000 sample) as demonstrated in Figure 21 which shows the distribution of the dataset across the folds for each conducted experiment. For each experiment, we have evaluated the validation accuracy and validation error for the classification system models (2-Classes/5-Classes). Thereafter, the results obtained from the five experiments are averaged to provide an overall validation accuracy and validation error values. Consequently, the proposed system provided high level of stability and reliability across the dataset folds which confirm the system robustness in the mission of attacks detection and classification for IoT communications. The results of the 5-fold cross validation are provided in Table 9 below. Additionally, to gain more insight on the advantage of the proposed method, we benchmark IoT-IDS-CNN classification system by comparing its performance with other state-of-art machine learning based intrusion/attacks detection systems in terms in terms of classification accuracy metric. For better and more reasonable evaluation, we have selected the related researches that employs machine learning techniques for intrusion/attacks detection/classification for the NSL-KDD dataset (the same used by our system) to be compared with our proposed IoT-IDS-CNN. Therefore, we summarize the classification accuracy metric values for related state-of-art research in the following table, Table 10, in chronological order. Accordingly, it can be obviously noticed, that the proposed IoT-IDS-CNN model has improved the cyber-attacks classification accuracy of other ML-IDS models by an improvement factor (IF) of (~1.03 -1.25). Finally, although the other existing related researches for machine learning based intrusion/attack detection/classification use different cyber-attacks datasets, learning policies, programming techniques, and computing platforms, however, we still can compare the classification system performance in terms of testing accuracy metrics and the level of complexity for the developed method. Therefore, for better readability, we summarize the classification accuracy metrics for the other related state-of-art research in the following table, Table 11, in chronological order. According to the comparison of the table, it can be seen that the proposed approach produces attractive results in terms of classification accuracy showing superiority over all other compared methods.

Gradient Descent (GD) Algorithm:
An iterative optimization algorithm for finding the minimum of a loss (error) function. To achieve this goal, it performs two steps iteratively: (1) Compute the slope/gradient (first order ) at the current point, (2) Move-in the opposite direction of the slope, increase from the current point by the computed amount. The idea is illustrated in Figure A.1 below [38]. In machine learning, there are three common techniques for GD (to go down the slope), including: Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent (MGD).