Intrusion Detection on AWS Cloud through Hybrid Deep Learning Algorithm

Balajee R M; Jayanthi Kannan M K

doi:10.3390/electronics12061423

and

¹

Research Scholar, Department of Computer Science and Engineering, Faculty of Engineering and Technology, JAIN (Deemed to be University), Bangalore 562112, India

²

Professor and HOD, Department of Information Science and Engineering, Faculty of Engineering and Technology, JAIN (Deemed to be University), Bangalore 562112, India

^*

Author to whom correspondence should be addressed.

Electronics2023, 12(6), 1423;https://doi.org/10.3390/electronics12061423

This article belongs to the Special Issue Machine Learning for Service Composition in Cloud Manufacturing

Version Notes

Order Reprints

Review Reports

Abstract

The network security and cloud environment have been playing vital roles in today’s era due to increased network data transmission, the cloud’s elasticity, pay as you go and global distributed resources. A recent survey for the cloud environment involving 300 organizations in North America with 500 or more employees who had spent a minimum of USD 1 million on cloud infrastructure, as per March 2022 statistics, stated that 79% of organizations experienced at least one cloud data breach. In the year 2022, the AWS cloud provider leads the market share with 34% and a USD 200 billion cloud market, proving important and producing the motivation to improve the detection of intrusion with respect to network security on the basis of the AWS cloud dataset. The chosen CSE-CIC-IDS-2018 dataset had network attack details based on the real time attack carried out on the AWS cloud infrastructure. The proposed method here is the hybrid deep learning based approach, which uses the raw data first to do the pre-processing and then for normalization. The normalized data have been feature extracted from seventy-six fields to seven bottlenecks using Principal Component Analysis (PCA); those seven extracted features of every packet have been categorized as two-way soft-clustered (attack and non-attack) using the Smart Monkey Optimized Fuzzy C-Means algorithm (SMO-FCM). The attack cluster data have been further provided as inputs for the deep learning based AutoEncoder algorithm, which provides the outputs as attack classifications. Finally, the accuracy of the results in intrusion detection using the proposed technique (PCA + SMO-FCM + AE) is achieved as 95% over the CSE-CIC-IDS-2018 dataset, which is the highest known for state-of-the-art protocols compared with 11 existing techniques.

Keywords:

AWS cloud; intrusion detection; Principal Component Analysis (PCA); Fuzzy C-Means; Smart Monkey Optimization; AutoEncoder; CSE-CIC-IDS-2018 dataset

1. Introduction

The data and the server which hold and respond to the data through the distributed and wider network are the most important assets and can yield useful information, analytical results, and future predictions, etc., [,,] all on time. It needs to be protected with the utmost care to avoid any negative impact on society [,,]. When thinking about the network security aspects, we need to think about the fact that the data in the real world is transferred a long distance. This technique is also adopted by cloud technologies. Cloud service providers such as AWS, Azure, GCP, etc., support going global in minutes with their distributed content delivery network (CDN) features. The CDN makes faster content delivery through the local distribution edge points (in AWS it is CloudFront). Even though issues still persist due to the data on the network only which is being available in one location, as a result the data has been cached by the cloud technology called “cloud front” using globally distributed edge locations. This long-distance data travel over the cloud actually utilizes higher network resources, becoming exposed to a higher possibility of network attacks due to its distributed nature and wider network.

The Ministry of Home Affairs in India released the details of cybercrime cases registered in India [], with 44,546 cases recorded in the year of 2019, 63.48% higher than the registered cases of 2018. A University of North Georgia study [] says that 1.53 hackings happen every 1 min, which caused an average data breach cost which is exceeding USD 150 million in the year of 2020. This motivates us to do research on routing-based attacks and improve defensive mechanisms against it. The survey which is conducted on the North America [] says that 79% of organizations experienced at least one cloud breach. This survey is only conducted with the organizations who are spending more than or equal to USD 1 million on cloud infrastructure. A total of 300 such organizations were taken for the survey. When such big organizations who are spending such large amounts on cloud infrastructure are facing issues due to data breaches in the cloud during the period of March 2022, it is very much notable, considering the security of the server present in the cloud environment at that point in time. All these issues that insist toward new research need to be overcome to improve cloud-based defensive mechanisms against routing-based attacks on it. Next, while thinking about the cloud environment, there are a good number of cloud service providers. Research motivation and narrowing down are still required to formulate the research proceedings well. In this aspect, when the search has been made, an interesting fact has been found: the AWS cloud service provider leads the market share with 34% and with a cloud market business worth USD 200-billion-dollar []. This survey has been conducted during the period of the second quarter of the year 2022. This fact narrows down the research work and focuses on the AWS Cloud.

Data and important information have been transferred in large amounts on the AWS cloud environment; therefore, the security can be improved for the betterment of the cloud environment [,,,]. When coming to network security, routing-based attacks are happening on a regular basis [,,,], and so “the research focus is on the routing-based attacks for improving defensive mechanism (which internally have intrusion detection mechanism) for improving the network security”.

When speaking about the network-based attacks on the cloud environment, there are a variety of attacks happening in today’s cloud environment; a few of the major attacks possible in the cloud network are listed here as follows: blackhole attack, botnet attack, sinkhole attack, greyhole attack, wormhole attack, sybil attack, hello flood attack, acknowledgement spoofing attack, selective forwarding attack, denial of service (DoS) attack, packet mistreating attack, distributed denial of service (DDoS) attack, brute-force attack, routing table position attack, hit and run attack, persistent attack, eavesdropping or sniffing or snooping attack, homing attack, neglect and greed attack, rushing attack, gratuitous detour attack, node malfunction attack, flooding attack, spoofed or altered or replayed routing attack, impersonation attack, misdirection attack, clone attack, rogue attack, peer to peer attack, encryption cracking, wireless hijacking attack, man in the middle attack, session hijacking attack, SQL injection, zero day exploit, phishing attack and malware attack (malicious software, spyware, ransomware, viruses and worms). In total, 37 major attacks are listed here with a consideration of cloud network environment and of these 37 attacks, 26 attacks are network-based attacks. In these 26 routing-based attacks, the research focus is going to be on the DDoS-, DoS-, brute-force- and botnet-based attacks. This is due to the motivation of the considered AWS cloud network attack-based dataset, named as CSE-CIC-IDS-2018 [,]. In this dataset, more than 90% of attacks are on the specified four categories; we then narrowed down the research to the DDoS, DoS-, brute-force- and botnet-based network attacks carried out over the AWS cloud environment.

The existing defensive mechanisms can be improved in two ways: (i) the algorithm can be made efficient for reducing time and space complexity and (ii) the algorithm can be improved to provide better security.

In the above-mentioned two aspects, the narrow down approach is the “improvement of security rather than improving time and space complexity”. It is based on the survey results [,,,] examined. This narrow down approach has been opted for on the basis of the said motivation of the research. The narrow down approach of this research is going to be carried out using the algorithm to improve security in relation to intrusion detection within the AWS cloud environment.

The proposed method is the hybrid algorithm from deep learning concepts which will proceed with the raw input traffic-based data and further towards the clustering mechanism to filter out the attack and non-attack data, then the attack cluster data are taken and processed with the deep learning algorithm for classifying the attack. Finally, the accuracy of the classified attack will be proven with a better accuracy (also with other measures such as accuracy, specificity, precision, FDR, FPR, FNR, MCC, NPV, F-measure) than the existing 11 state-of-the-art techniques.

The flow of this article from here on is going to be like this: Section 2: Literature Survey, Section 3: Proposed Model, Section 4: Data Initialization Module, Section 5: Cluster Formation Module, Section 6: Attack Classification Module, Section 7: Dataset and Environment, Section 8: Result and Analysis, Section 9: Conclusion.

2. Literature Survey

The trend today is rapidly changing towards cloud computing, and computing, storage and network resources are in the Cloud [,,]. This leads to many MNC companies such as AWS, Azure, Google and Oracle to have their own cloud services to provide the IaaS, PaaS, SaaS services to their customers [,,,,]. Especially during the COVID-19 period, there was a drastic growth in cloud provider service usage [,]. When the cloud is growing so rapidly, cyber security is becoming a question due to router-based attacks [,]. As per the recent report from the insight, the movement of organization toward the cloud environment is huge, but still there are some queries which lead other organizations to examine their security [,,].

When the focus is on improving the cloud security and detecting the intrusion-based attacks, few techniques are surveyed here, which are machine learning- and deep learning-based approaches such as the Support Vector Machine (SVM) classifier [], Long-Short Term Memory (LSTM) [], Deep Neural Network (DNN) [], Deep Recurrent Neural Network (DRNN) [], Convolution Neural Network (CNN) [], Deep Belief Network (DBN) [], Deep Belief Network with Whale Optimization Algorithm (DBN + WOA) [] + Deep Belief Network with Moth Flame Optimization (DBN + MFO) [], Deep Belief Network with Sea Lion Optimization (DBN + SLO) [], Deep Belief Network with Spider Monkey Optimization (DBN + SMO) [], Deep Belief Network with Spider Monkey optimization and Sea Lion Optimization (DBN + SMSLO) [].

In this, Long-Short Term Memory (LSTM), the Deep Neural Network (DNN), Deep Recurrent Neural Network (DRNN) and Convolution Neural Network (CNN) are deep learning-based approaches, while the other mentioned approaches are machine learning-based ones. The features and issues of all these algorithms are shown in Table 1.

The above techniques which are mentioned are also used in the result comparison with the proposed technique, but a few more techniques are also considered in this article, such as the Gated Recurrent Unit with Recurrent Neural Network (GRU-RNN) [,], Aleatoric and Epistemic Uncertainty with Deep Neural Network (AE-DNN) [], Decision Tree—Nearest Neighbor (DT-NN) [], Artificial Neural Network + Support Vector Machine (ANN-SVM) [], Classifier System—Distributed Denial of Service (CS_DDoS) [], Convolution Recursively Enhanced Self-Organizing Map—Software Defined Networking-based Mitigation Scheme (CRESOM-SDNMS) [], Learning-Driven Detection Mitigation System (LEDEM) [], Intensive Care Request Processing Unit (ICRPU) [], Fuzzy Self-Organizing Maps-based DDOS Mitigation (FSOMDM) [] and T-Distribution based Flow Confidence Technique []. The reason for mentioning these techniques separately is due to the usage of different datasets. So, here a performance comparison is not able to be taken to prove the metrics.

In 2022, Hiren, K.M [], the optimization techniques such as the Whale Optimization Algorithm, Moth Flame Optimization Algorithm, Sea Lion Optimization Algorithm and Spider Monkey Optimization Algorithm are all implemented over the cluster and based on the K-Means and KNN techniques and results compared. The issue [] is slightly slower performances than the Fuzzy C-Means clustering techniques which we are proposing []. The Spider Monkey-based optimization technique has been considered from the surveyed article from 2020 by Khere N []. Similarly, the Sea Lion-based optimization technique has been considered from the surveyed article from 2019 by Masadeh R [].

Table 1. Cloud environment-based intrusion detection—a convolution approach.

Categorization	Methodology	Features	Challenges	Common Issues
Machine Learning	DT-NN []	Achieved good accuracy while selecting the feature	The issue of data over fit on the DT	Used Old Data Set (KDD-CUP 99 and NSL-KDD)
Machine Learning	ANN + SVM []	Time and space complexity for the training dataset has been lesser	Predicting the specific attack type is not accurate
Deep Learning	CNN []	Good accuracy rate	Only detecting the DDoS-based attacks
	GRU-RNN []	Precession, F1-Score and recall are at a good level	Less accuracy and higher overhead
	AE + DNN []	Good precession value with faster prediction	The accuracy and the score of F1 is on the lower side
	LSTM []	Good level of accuracy achieved	Bandwidth is on the lower side
Flood-based Attack Detection	CRESOM—SDNMS []	Metaheuristic approach	Accuracy is on the lower side
	CS_DDoS []	Metaheuristic approach	Accuracy is on the lower side
	FSOMDM []	Good in controlling malicious data traffic	False positive is at a higher rate
	LEDEM []	Good level of accuracy	When data input speed increases, performance decreases
	ICRPU []	Accuracy and intrusion detection are good	FAR is in the higher side
FRC-based Attack Detection	T-Distribution with flow Confidence Technique []	Precision and recall are on the higher side	Lesser attack detection

The complete nomenclature used in this article has been given in Table 2 for the easier way of finding the descriptions for the used abbreviations.

Table 2. Nomenclature.

3. Proposed Model

The proposed model is a hybrid technique with a deep learning algorithm. It is a combination of the dimensionality reduction technique (PCA), Fuzzy C-Means (FCM) algorithm for cluster formation, Spider Monkey Optimization Algorithm (SMO) for optimized moving cluster and centroids, deep learning-based AutoEncoder (AE) algorithm for classifying the attack (only from the packet data available in the attack cluster). The proposed model has been named as PCA + FCM-SMO + AE.

Initially, the raw data have been pre-processed for the missing data and then the output is taken for normalizing the values so that it can be handled efficiently during further subsequent steps. The normalized data are with a large number of fields, so the clustering algorithm will suffer with a dimensionality problem. If high dimensions are there, then clustering becomes difficult. So, the issue of dimensionality can be solved in two ways: first, the important features can only be selected, and second, all the features can be extracted in a smaller number of fields. Here, the proposed model PCA + FCM-SMO + AE has been going with the second way, in order to consider all the field values.

In general, the deep learning-based AutoEncoder will produce good accuracy; at the same time, it will take a longer time to process the result in the cloud environment. When considering extreme scenarios in the cloud environment, the attack detection should be faster, and the classification should be an accurate one. The Fuzzy C-Means algorithm makes a faster process to separate the attack detection, then the classification is performed with the AutoEncoder with only the attacked traffic data. Since we are reducing the number of rows which are fed to the AutoEncoder, the implementation will result in a faster classification with a higher accuracy. The architecture diagram of the proposed model (PCA + FCM-SMO + AE) is shown in Figure 1.

Figure 1. Proposed system (PCA + FCM-SMO + AE) architecture diagram.

4. Data Initialization Module

The data initialization module has been focusing on three segments: (i) data pre-processing, (ii) features normalization and (iii) dimensionality reduction.

4.1. Data Pre-Processing

It is the fundamental process for raw data since the raw data may miss some values. The data cannot be analysed completely without filling the missed data. The missing data have been considered as zero values. This results in a complete data table for further proceedings. The collected raw data are D^RAW, which have been pre-processed to get the fulfilled data D^PPD.

4.2. Feature Normalization

Now, looking into the fulfilled data D^PPD, the values in different fields are in different min and max ranges. This leads to a higher complexity when it has been analysed. So, the pre-processed D^PPD data table needs to be transformed to fix the min and max range. This process of data transformation within a fixed range according to its originality is called normalization. Here, the min value is −1 and the max value is 1 for doing the normalization. The normalized data are said to be D^ND. This normalized data, D^ND, are given as inputs for the dimensionality reduction.

4.3. Dimensionality Reduction

In the dimensionality reduction phase, the PCA (Principal Component Analysis) has been used to reduce the dimensionality. The dimensionality of the data can be reduced in two different ways: (i) the important features can be filtered out (excluding lesser important features) or (ii) all the features can be compressed to form less count of features (each feature can internally have many compressed features). In this article, we are going with the second approach to consider all the features which are having an effect in some way on the result; for this approach, the PCA technique has been chosen, which internally works with four submodules: (i) the mean, (ii) standard deviation, (iii) co-variance and (iv) eigenvalue and eigenvector of the matrix.

4.4. Mean

When the distributed values are taken, the average value can be found for the distribution, which is the mean. Here, Equation (1) is given for the calculation of the mean for the “R” random values over the distribution taken from the normalized input D^ND. Here,

D_{R}^{N D} = D_{1}^{N D} + D_{2}^{N D} + D_{3}^{N D} + \dots + D_{n}^{N D}

stands for the sum of the segmented random variables from the normalized distribution.

Mean (D^{\bar{N D}}) = \frac{1}{n} \sum_{R = 1}^{n} D_{R}^{N D}

(1)

4.5. Standard Deviation

When the mean is calculated, other variable values in the same segment will have some deviation from the mean; this deviation specifies how much the value is away from the average point. Here, Equation (2) represents the mathematical way of standard deviation.

SD = \sqrt{\frac{1}{n} \sum_{R = 1}^{n} {(D_{R}^{N D} - D^{\bar{N D}})}^{2}}

(2)

4.6. Covariance

This is specifying the relationship of two variables. If the covariance is higher, then if one variable got increased, then the other variable too will have an almost similar increasing percentage. The covariance can range from a negative value to a positive value. The negative covariance specifying there is no reliable relationship among the variables and the positive one will indicate that the two variables will have some impact on each other due to the found relationship. Equation (3) represents the mathematical way of covariance.

C o v a r i a n c e (D_{R 1}^{N D}, D_{R 2}^{N D}) = \frac{\sum_{r o w = 1}^{n} ((D_{R 1 (r o w)}^{N D} - D_{R 1}^{\bar{N D}}) * (D_{R 2 (r o w)}^{N D} - D_{R 2}^{\bar{N D}})}{r o w - 1}

(3)

Here, the row corresponds to the number of rows in the dataset, and the average of that is denoted as

D_{R}^{\bar{N D}}

. In addition to that, R1 and R2 are the features selected.

4.7. Eigenvalue and Eigenvectors of a Matrix

The normalized data, D^ND, have been taken for pushing the eigenvalue in the matrix; the eigenvector has been based on three values, namely the mean, standard deviation and covariance. When the eigenvalues are plotted in the matrix “A”, the scalar parameter

“ λ ”

has been used to form the final Equation (4) based on the eigenvalue and eigenvector.

[A] [D^{N D}] = λ [D^{N D}]

(4)

Finally, the dimensionality-reduced features are formed for further proceedings. These data are said to be D^RD. The seventy-six fields—if the data set has been reduced to seven bottlenecks—are described as D^RD = D^RD¹, D^RD², …..., D^RD⁷ via feature extraction through the Principal Component Analysis (PCA).

5. Cluster Formation Module

The dimensionality-reduced data, D^RD, have been made into a Fuzzy C-Means cluster, which is a soft cluster and works on the basis of the fuzzy degree of each packet’s feature, and a similar one will be made into the same cluster.

In this article, the proposed technique is going with different sorts of learning percentages ranging from 60% to 90% and a step count of 10%. This learning percentage is nothing but how much the cluster has learned from the entire data set. The number of clusters are fixed to two, C_C = 2; one is for the attack cluster and another one is for the non-attack cluster. If the first packet is getting inserted, then it will be inserted in one of the clusters.

The cluster has been optimized with the Spider Monkey Optimization technique, so the cluster will be moving in the plane, and it will get different shapes as well. To optimize the centroid point calculation, the Spider Monkey Optimization technique will provide support. The overall centroid point of the two major clusters is taken from the centroid calculation of the Fuzzy C-Means clustering technique, as mentioned in Equation (5).

C_{n} = \frac{\sum_{D^{R D}} d_{n} {(D^{R D})}^{f} D^{R D}}{\sum_{D^{R D}} d_{n} {(D^{R D})}^{f}}

(5)

Here, every point, D^RD, will be associated with the set of features which in turn provide the degree of their relation with the n^th cluster (attack or non-attack cluster). The FCM centroid has been calculated using the mean of all points/packets, which is internally weighted from their degree-of-belonging to the native cluster. The mentioned argument “f” in Equation (5) denotes about the fuzzification. If the “f” value is higher, then the fuzzification will be higher.

The degree of each point has been calculated using Equation (6).

d_{i j} = \frac{1}{\sum_{n = 1}^{2} {(\frac{| | D_{i}^{R D} - C_{j} | |}{| | D_{i}^{R D} - C_{n} | |})}^{\frac{2}{f - 1}}}

(6)

In Equation (6),

d_{i j} \in [0, 1 . .],

i = 1, 2, etc.—end of data point (de), j = 1, end of cluster count (ce), where each element

d_{i j}

specifies the degree of each data element

(D_{i}^{N D})

belonging to the cluster C_j.

The FCM will minimize the objective with Equation (7),

\arg \min (C) = \sum_{i = 1}^{d e} \sum_{j = 1}^{c e} d_{i j}^{f} {| | D_{i}^{R D} - C_{j} | |}^{2}

(7)

5.1. Fuzzy C-Means Algorithm

Step 1: set the number of clusters as two for attack and non-attack packet data.

Step 2: initially make the data points in one of the clusters.

Step 3: for further data points, calculate the coefficients, which yields the degree of the data points as per Equation (6), to be respectively allocated in the cluster.

Step 4: compute the centroid as per Equation (5).

Step 5: repeat step 3 and 4 until the coverage of all data points completed in the plane.

5.2. Spider Monkey Optimization

Now, the input for the Spider Monkey Optimization technique is the new data point and the centroid of the FCM cluster. The SMO will form internal clusters with the threshold value of 0.84% as the similarity index. The internal cluster has a moving nature, which will affect the shape of the external cluster as well.

5.3. Algorithm of SMO

Step 1: the initial population for the Spider Monkey Optimization (SPO) has been initialized.

Step 2: now, the Spider Monkey Optimization-based subcluster has been formed using Equation (8).

S M_{a z} = S M_{\min z} + U D (0, n) * S M_{\max j} - S M_{\min z}

(8)

Here, the equation is written to form the a Spider Monkey internal cluster on the z dimension, corresponding to any one of the primary cluster attack or non-attack clusters.

SM_{min z} specifies the lower boundary of the z dimension and SM_{max j} corresponds to the upper boundary of the spider monkey internal cluster.

UD(0,n) is the uniform distribution of cluster labeling in the primary cluster.

Step 3: repeating step 2 for all the primary cluster and internal spider monkey-based cluster, the global and local boundaries are determined.

Step 4: calculate or update the centroids of all the changed spider monkey internal clusters in all the primary clusters as per the changes made.

Step 5: now, the new data points fit has been calculated based on the internal clusters within one of the primary clusters.

Step 6: now, the calculated best fit has been compared with other spider monkey-based internal clusters for the optimization of best fit with Equation (9).

p r o b a b l i t i y S M_{c} = \frac{F i t_{c}}{\sum_{i = 1}^{n} F i t_{i}}

(9)

Here, SM_C is the probability of the current data point present in the current spider monkey cluster.

Fit_c is the degree of fit of the current data point in the current spider monkey cluster.

Fit_i is the degree of fit of the i spider monkey cluster.

Step 7: repeat step 6 until all the internal clusters are examined for the best fit.

Step 8: change the primary cluster to another cluster until all the primary clusters are iterated once. If performed with iteration, go to step 10.

Step 9: repeat step 5.

Step 10: fit the new data point in the best fit found using the probability calculation from Equation (9).

Step 11: repeat step 4 for every data point to be entered into the system.

The clustering here has been performed using two algorithms, namely the Fuzzy C-Means algorithm for the primary cluster and Spider Monkey Optimization for the internal clusters; however, there should be some algorithm to merge these two algorithms. The algorithm for this merging task has been named as the Cluster Merging Point (CMP) Algorithm.

5.4. Cluster Merging Point (CMP) Algorithm of FCM and SMO

Step 1: set the initial data point with the FCM.

Step 2: set the subsequent data points to the system with FCM and check the similarity index; if the similarity index is less than 0.84, switch to SMO for that movement; if the similarity index is greater than or equal to 0.84, plot the current data point as per the FCM.

Step 3: for every new data point choose the primary cluster with the help of FCM.

Step 4: check the primary cluster is already enabled with SMO or not; if SMO is enabled, proceed with the SMO Algorithm; if not, proceed with step 2.

Finally, the data points are plotted in the attack cluster and non-attack cluster. The data points are said to be D^RDAC and D^RDNAC for the attack and non-attack clusters, respectively.

D^RDAC → data point of the reduced dimensionality attack cluster.

D^RDNAC → data point of the reduced dimensionality non-attack cluster.

6. Attack Classification Module

The data point of the reduced dimensionality attack cluster D^RDAC has been provided as the input to the AutoEncoder, which is a deep learning-based classifier. This AutoEncoder works well with lesser dimensional data and produces accurate results when data are provided in a clustered manner.

Here, the input is very specific; it is only about the attack packet data and so the AutoEncoder is good in that it classifies based on the attacks. The AutoEncoder will work on the training dataset knowledge, and it will learn through the back propagation from the result of the training data, which is the phase of the decoder and the forward propagation used to find it; it is nothing but a phase of the encoder. The implementation or workflow of the AutoEncoder has been diagrammatically provided in Figure 2.

Figure 2. AutoEncoder workflow.

The AutoEncoder is also capable of doing multiple encode and decode processes on the hidden layers. Equations for the encode and decode processes listed here are in Equations (10) and (11), respectively.

Considering there is a Z dimension vector, the encoder function (e) is defined as in Equation (10).

\bar{E_{i}} = e (\bar{D_{i}}, \bar{θ_{e}})

(10)

where

\bar{D_{i}} \in R^{n} a n d E_{i} \in R^{z}

.

Similarly, the parameterized function for the decoder (d) is given in Equation (11).

\hat{D_{i}} = d (\bar{E_{i}}, \bar{θ_{d}})

(11)

where

\hat{D_{i}} \in R^{n} a n d E_{i} \in R^{z}

.

Whenever the encoded data are taken for the process, the encoded data are reverse propagated to decode it and the actual data are taken for the process. Equation (12) is provided to represent the same.

\hat{D_{i}} = d (e (\bar{D_{i}}, \bar{θ_{e}}), \bar{θ_{d}}) = g (\bar{D_{i}}, \bar{θ})

(12)

The AutoEncoder back propagates the encoded data with the help of a minimizer with the mean-square-error cost. The function for the same is provided in Equation (13).

C o s t (D, \hat{D}, θ) = \frac{1}{m} \sum_{i} ({\bar{D}}_{i} - g (\bar{D_{i}}, \bar{θ}))

(13)

The test data are backpropagated with Equation (11) for learning. Finally, the AutoEncoder algorithm will provide the output of the classified attack through the process of Equation (12) by minimizing the mean-square-error cost (MSER) with Equation (13).

7. Dataset and Environment

The environment for execution has been used here as “Python version 3.0”, and the proposed algorithm (PCA + FCM-SMO-AE) is executed in the AWS cloud EC2 instance with the instance family type of t2.micro. The execution has been made in the different environment setups listed in Table 3.

Table 3. Learning percentage and testing data.

The learning percentage is about the cluster formation module and its learning to classify the data in the attack and non-attack clusters. The test data are about the attack classifier module and the percentage of the total dataset, which is taken as test data for training the AutoEncoder.

The AWS cloud EC2 computing instance setup for executing the proposed algorithm has been given in Table 4.

Table 4. AWS Cloud EC2 computing instance setup.

The dataset we used is CSE-CIC-IDS-2018 [,] and it is created based on the network traffic and attack generated on the AWS Cloud in the year 2018. The dataset had 10 .csv files specifying about 10 days of network traffic, with 76 characteristics on each packet and attack carried out across each day, with the date and time of each packet. The details of the attack carried out across dataset have been provided in Table 5.

Table 5. Attack details of CSE-CIC-IDS-2018 dataset.

The considered attacks for this study are the DDoS, DoS, brute-force and botnet attacks, since nearly more than 90% of the attacks on the said dataset fall in the considered four categories.

8. Result and Analysis

The proposed method (PCA + FCM-SMO + AE) has been tested in four different test cases (testing environment conditions) with respect to the learning percentage of the cluster, which is specified in Table 3. The result has been compared with 11 existing techniques, as mentioned here in Table 6.

Table 6. Existing best technique for intrusion detection of CSE-CIC-IDS-2018 dataset.

The existing technique and the proposed technique (PCA + FCM-SMO + AE) have been compared with respect to ten characteristics for evaluation in each of the four test cases. This resulted in 40 comparisons being generated, with 12 statistics in each (11 existing + 1 proposed), totalling 40 × 12 = 480 statistics. The attack taken for the experiments are the DDoS attack, DoS attack, botnet attack and brute-force attack. In each attack category, there will be 480 statistics, resulting in 480 × 4 = 1920 statistics. The average has been taken again to lower the statistics count to 480, since it needs to be discussed here with less complexity. The characteristics are divided into a positive measure, negative measure and other measures.

8.1. Positive Measures

The positive measures taken for the comparison are accuracy, specificity, precision and sensitivity. Equations for each one are given as (14), (15), (16) and (17), respectively.

A c c u r a c y (A t t a c k C l a s s i f i c a t i o n) = \frac{(T P + T N)}{(T P + T N + F P + F N)}

(14)

S p e c i f i c i t y (A t t a c k C l a s s i f i c a t i o n) = \frac{T N}{(T N + F P)}

(15)

P r e c i s i o n (A t t a c k C l a s s i f i c a t i o n) = \frac{T P}{(T P + F P)}

(16)

S e n s i t i v i t y (A t t a c k C l a s s i f i c a t i o n) = \frac{T P}{(T P + F N)}

(17)

The values obtained in terms of the specificity and precision measures have been given in Figure 3, and the values obtained in terms of sensitivity and accuracy measures has been provided in Figure 4. The proposed technique (PCA + FCM-SMO + AE) has been compared with 11 existing techniques on all metrics with respect to different learning percentages ranging from 60 percent to 90 percent.

Figure 3. Specificity and precision comparison graph.

Figure 4. Sensitivity and accuracy comparison graph.

8.2. Negative Measures

The negative measures taken for the comparison are the false positive rate (FPR), false discovery rate (FDR) and false negative rate (FNR). Equations for each of the measures are provided as (18), (19) and (20), respectively. The values obtained on the comparison basis have been provided on the graph in Figure 5 (FPR and FDR) and Figure 6 (FNR).

F P R (A t t a c k C l a s s i f i c a t i o n) = \frac{F P}{A c t u a l N e g a t i v e}

(18)

F D R (A t t a c k C l a s s i f i c a t i o n) = \frac{F P}{(T P + F P)}

(19)

F N R (A t t a c k C l a s s i f i c a t i o n) = \frac{F N}{A c t u a l P o s i t i v e}

(20)

Figure 5. FPR and FDR comparision graph.

Figure 6. FNR comparison graph.

8.3. Other Measures

The supportive measures taken for the comparison are the MCC, F-Measure and NPV (Negative Predictive Value). The score of the F-Measure will depend on the precision and sensitivity. When the covariance of these values is higher, then the F-Measure will also be higher. The MCC is nothing but a Matthew’s Correlation Coefficient, which will be less than or equal to one. The max value corresponds to a better prediction of the system. Equations for the MCC, F-Measure and NPV are provided as (21), (22) and (23), respectively.

M C C (A t t a c k C l a s s i f i c a t i o n) = \frac{((T P * T N) - (F P * F N))}{\sqrt[2]{((T P + F P) (T N + F P) (T P + F N) (T N + F N))}}

(21)

F - M e a s u r e (A t t a c k C l a s s i f i c a t i o n) = \frac{2 * (p r e c i s i o n * s e n s i t i v i t y)}{p r e c i s i o n + s e n s i t i v i t y}

(22)

N P V (A t t a c k C l a s s i f i c a t i o n) = \frac{T N}{F N + T N}

(23)

The MCC and NPV values on a comparison basis are shown in Figure 7, and the F-Measure value on a comparison basis has been shown in Figure 8.

Figure 7. MCC and NPV comparision graph.

Figure 8. F-Measure comparison graph.

The intrusion detected in the AWS cloud network-based dataset for different learning percentage has been provided in Table 7.

Table 7. Intrusion detection details with respect to learning percentage.

The experimental results for the various metrics considered for different learning percentages and the average values are provided in Table 8.

Table 8. Experimental results for considered metrics.

The experimental results of the proposed technique (PCM + FCM-SMO + AE) show that the classified attack with the higher specificity, precision, accuracy and lower FPR and FDR values is a good sign. The MCC, F-Measure and NPV values are comparatively okay. The worst case is the metric values for sensitivity and the FNR. The accuracy of the proposed technique is 95.3%, which is 2.3% higher than the DBN + SMSLO, 12.3% higher than the DBN + SLO, 9.3% higher than the DBN + SMO, 10.3% higher than the DBN + WOA, 15.3% higher than the DBN + MFO, 11.3% higher than the DBN, 18.3% higher than the SVM, 7.3% higher than the DRNN, 35.3% higher than the CNN, 19.3% higher than the DNN and 10.3% higher than the LSTM state-of-the-art existing protocols.

9. Conclusions

The proposed technique takes the data in the CSE-CIC-IDS-2018 dataset. It pre-processed the data and filled the missing values. Then, the dimensionality of the data has been reduced to reduce the complexity, then the dimensionality-reduced data have been provided as inputs to the clustering module, which used the Fuzzy C-Means clustering technique with the Spider Monkey Optimization. The data have been split into attack and non-attack clusters. The attack cluster data values are provided as an input to the attack classifier module, which used the AutoEncoder deep learning-based algorithm to classify the attacks. Finally, the attacks are classified into DDoS, DoS, brute-force and botnet attacks.

The achieved value of the proposed technique (PCM + FCM-SMO + AE) in the positive measures such as specificity (99.0%), precision (94.7%) and accuracy (95.3%) is the highest for the state-of-the-art comparison, but the sensitivity (47.8%) is on the lower side. When the negative measures are considered, the value should be low; the achieved values of proposed techniques against the metrics such as the FPR (0.010) and FDR (0.053) is the lowest in the state-of-the-art comparison, but the FNR (1.627) is on the higher side. The metric measures such as the MCC (0.626), NPV (0.957) and F-Measure (0.635) have been comparatively okay. This makes the conclusion that, overall, the proposed method had beaten the existing 11 state-of-the-art techniques over the CSE-CIC-IDS-2018 dataset, with a 95.3% accuracy in the attack classification prediction.

Author Contributions

B.R.M.—Methodology, validation, formal analysis, investigation, writing—original draft, writing—review and editing, conceptualization. J.K.M.K.—Validation, investigation, writing—review and editing, conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xing, K.; Srinivasan, S.S.R.; Rivera, M.J.; Li, J.; Cheng, X. Attacks and Countermeasures in Sensor Networks: A survey. In Network Security; Springer: Boston, MA, USA, 2010; pp. 251–272. [Google Scholar]
Kumar, C.A.; Vimala, R. Load balancing in cloud environment exploiting hybridization of chicken swarm and enhanced raven roosting optimization algorithm. Multimed. Res. 2020, 3, 45–55. [Google Scholar]
Thomas, R.; Rangachar, M. Hybrid optimization based DBN for face recognition using low-resolution images. Multimed. Res. 2018, 1, 33–43. [Google Scholar]
Veeraiah, N.; Krishna, B. Intrusion detection based on piecewise fuzzy c-means clustering and fuzzy naive bayes rule. Multimed. Res. 2018, 1, 27–32. [Google Scholar]
Preetha, N.N.; Brammya, G.; Ramya, R.; Praveena, S.; Binu, D.; Rajakumar, B. Grey wolf optimisation-based feature selection and classification for facial emotion recognition. IET Biom. 2018, 7, 490–499. [Google Scholar] [CrossRef]
Phan, T.; Park, M. Efficient distributed denial-of-service attack defense in SDN-Based cloud. IEEE Access 2019, 7, 18701–18714. [Google Scholar] [CrossRef]
Ministry of Home Affairs. India Released Facts on Cyber Crime Cases Registered. Available online: https://www.pib.gov.in/PressReleasePage.aspx?PRID=1694783 (accessed on 21 May 2021).
A Study Report Published as a News by University of North Georgia. Available online: https://ung.edu/continuing-education/news-and-media/cybersecurity.php (accessed on 21 May 2021).
50 Cloud Security Stats You Should Know in 2022. Available online: https://expertinsights.com/insights/50-cloud-security-stats-you-should-know/ (accessed on 28 August 2022).
Amazon Leads $200-Billion Cloud Market. Available online: https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/ (accessed on 28 August 2022).
Roy, A.; Razia, S.; Parveen, N.; Rao, A.S.; Nayak, S.R.; Poonia, R.C. Fuzzy rule based intelligent system for user authentication based on user behaviour. J. Discret. Math. Sci. Cryptogr. 2020, 23, 409–417. [Google Scholar] [CrossRef]
Mohan, V.M.; Satyanarayana, K.V.V. The Contemporary Affirmation of Taxonomy and Recent Literature on Workflow Scheduling and Management in Cloud Computing. Glob. J. Comput. Sci. Technol. 2016, 16, 13–21. [Google Scholar]
Zhijun, W.; Wenjing, L.; Liang, L.; Meng, Y. Low-rate DoS attacks, detection, defense, and challenges: A survey. IEEE Access 2020, 8, 43920–43943. [Google Scholar] [CrossRef]
Kumar, R.R.; Shameem, M.; Khanam, R.; Kumar, C. A hybrid evaluation framework for QoS based service selection and ranking in cloud environment. In Proceedings of the 15th IEEE India Council International Conference (INDICON), Coimbatore, India, 16–18 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Sharma, K.; Ghose, M.K. Wireless Sensor Networks: An Overview on Its Security Threats. IJCA Spec. Issue Mob. Ad-Hoc Netw. MANETs 2010, 1495, 42–45. [Google Scholar]
Mohan, V.M.; Satyanarayana, K. Multi-Objective Optimization of Composing Tasks from Distributed Workflows in Cloud Computing Networks, Advances in Intelligent Systems and Computing Volume 1090. In Proceedings of the 3th International Conference on Computational Intelligence and Informatics ICCII (2018), Hyderabad, India, 28–29 December 2018. [Google Scholar]
Lalitha, V.L.; Raju, D.S.H.; Krishna, S.V.; Mohan, V.M. Customized Smart Object Detection: Statistics of Detected Objects Using IoT; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Kumar, R.R.; Tomar, A.; Shameem, M.; Alam, M.D. Optcloud: An optimal cloud service selection framework using QoS correlation lens. Comput. Intell. Neurosci. 2022, 2022, 2019485. [Google Scholar] [CrossRef]
CSE-CIC-IDS2018 on AWS. Available online: https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 28 August 2022).
IDS 2018 Intrusion CSVs (CSE-CIC-IDS2018). Available online: https://www.kaggle.com/datasets/solarmainframe/ids-intrusion-csv?resource=download (accessed on 28 August 2022).
Somani, G.; Gaur, M.; Sanghi, D.; Conti, M.; Rajarajan, M. Scale inside-out: Rapid mitigation of cloud DDoS attacks. IEEE Trans. Dependable Secur. Comput. 2018, 15, 959–973. [Google Scholar] [CrossRef]
Balajee, R.M.; Mohapatra, H.; Venkatesh, K. A comparative study on efficient cloud security, services, simulators, load balancing, resource scheduling and storage mechanisms. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Tamil Nadu, India, 26–28 March 2021; Volume 1070, p. 012053. [Google Scholar]
Balajee, R.M.; Venkatesh, K. A Survey on Machine Learning Algorithms and finding the best out there for the considered seven Medical Data Sets Scenario. Res. J. Pharm. Technol. 2019, 12, 3059–3062. [Google Scholar] [CrossRef]
Rajeswari, S.; Sharavanan, S.; Vijai, R.; Balajee, R.M. Learning to Rank and Classification of Bug Reports Using SVM and Feature Evaluation. Int. J. Smart Sens. Intell. Syst. 2017, 1, 10. [Google Scholar] [CrossRef]
Ravi, N.; Shalinie, S.M. Learning-driven detection and mitigation of DDoS attack in IoT via SDN-Cloud architecture. IEEE Internet Things J. 2020, 7, 3559–3570. [Google Scholar] [CrossRef]
Virupakshar, K.; Asundi, M.; Narayan, D. Distributed Denial of Service (DDoS) Attacks Detection System for OpenStack-based Private Cloud. Procedia Comput. Sci. 2020, 167, 2297–2307. [Google Scholar] [CrossRef]
Agrawal, N.; Tapaswi, S. Defense mechanisms against DDoS attacks in a cloud computing environment: State-of-the-art and research challenges. IEEE Commun. Surv. Tutor. 2019, 21, 3769–3795. [Google Scholar] [CrossRef]
Khan, A.A.; Shameem, M. Multicriteria decision-making taxonomy for DevOps challenging factors using analytical hierarchy process. J. Softw. Evol. Process. 2020, 32, e2263. [Google Scholar] [CrossRef]
Mohapatra, S.S.; Kumar, R.R.; Alenezi, M.; Zamani, A.T.; Parveen, N. QoS-Aware Cloud Service Recommendation Using Metaheuristic Approach. Electronics 2022, 11, 3469. [Google Scholar] [CrossRef]
Bhardwaj, A.; Mangat, V.; Vig, R. Hyperband tuned deep neural network with well posed stacked sparse autoencoder for detection of DDoS attacks in cloud. IEEE Access 2020, 8, 181916–181929. [Google Scholar] [CrossRef]
Balajee, R.M.; Kannan, M.K.J.; Mohan, V.M. Automatic Content Creation Mechanism and Rearranging Technique to Improve Cloud Storage Space. In Inventive Computation and Information Technologies; Springer: Singapore, 2022; pp. 73–87. [Google Scholar]
Voleti, L.; Balajee, R.M.; Vallepu, S.K.; Bayoju, K.; Srinivas, D. A secure image steganography using improved LSB technique and Vigenere cipher algorithm. In Proceedings of the 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), Coimbatore, India, 25–27 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1005–1010. [Google Scholar]
AlKadi, O.; Moustafa, N.; Turnbull, B.; Choo, K. Mixture localization-based outliers models for securing data migration in cloud centers. IEEE Access 2019, 7, 114607–114618. [Google Scholar] [CrossRef]
Devagnanam, J.; Elango, N. Optimal resource allocation of cluster using hybrid grey wolf and cuckoo search algorithm in cloud computing. J. Netw. Commun. Syst. 2020, 3, 31–40. [Google Scholar]
Mishra, P.; Varadharajan, V.; Pilli, E.; Tupakula, U. VMGuard: A VMI-Based Security Architecture for Intrusion Detection in Cloud Environment. IEEE Trans. Cloud Comput. 2020, 8, 957–971. [Google Scholar] [CrossRef]
Dong, S.; Abbas, K.; Jain, R. A survey on distributed denial of service (DDoS) attacks in SDN and cloud computing environments. IEEE Access 2019, 7, 80813–80828. [Google Scholar] [CrossRef]
Thirumalairaj, A.; Jeyakarthic, M. An intelligent feature selection with optimal neural network based network intrusion detection system for cloud environment. Int. J. Eng. Adv. Technol. 2020, 9, 3560–3569. [Google Scholar] [CrossRef]
Roy, R. Rescheduling based congestion management method using hybrid Grey Wolf optimization-grasshopper optimization algorithm in power system. J. Comput. Mech., Power Syst. Control 2019, 2, 9–18. [Google Scholar]
Anand, S. Intrusion detection system for wireless mesh networks via improved whale optimization. J. Netw. Commun. Syst. (JNACS) 2020, 3, 9–16. [Google Scholar] [CrossRef]
Balajee, R.M.; Hiren, K.M.; Rajakumar, B.R. Hybrid machine learning approach based intrusion detection in cloud: A metaheuristic assisted model. Multiagent Grid Syst. 2022, 18, 21–43. [Google Scholar]
Kumar, R.R.; Shameem, M.; Kumar, C. A computational framework for ranking prediction of cloud services under fuzzy environment. Enterp. Inf. Syst. 2021, 16, 167–187. [Google Scholar] [CrossRef]
Tang, T.; McLernon, D.; Mhamdi, L.; Zaidi, S.; Ghogho, M. Intrusion Detection in Sdn-Based Networks: Deep Recurrent Neural Network Approach. In Deep Learning Applications for Cyber Security; Springer: Cham, Switzerland, 2019; pp. 175–195. [Google Scholar]
Bakshi, A.; Dujodwala, Y.B. Securing cloud from ddos attacks using intrusion detection system in virtual machine. In Proceedings of the 2010 Second International Conference on Communication Software and Networks, Singapore, 26–28 February 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 260–264. [Google Scholar]
Fontaine, J.; Kappler, C.; Shahid, A.; De Poorter, E. Log-based intrusion detection for cloud web applications using machine learning. In Proceedings of the International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, Online, 20 October 2019; pp. 197–210. [Google Scholar]
Aboueata, N.; Alrasbi, S.; Erbad, A.; Kassler, A.; Bhamare, D. Supervised machine learning techniques for efficient network intrusion detection. In Proceedings of the 28th International Conference on Computer Communication and Networks (ICCCN), Valencia, Spain, 29 July–1 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar]
Harikrishna, P.; Amuthan, A. SDN-based DDoS attack mitigation scheme using convolution recursively enhanced self organizing maps. Sādhanā 2020, 45, 1–12. [Google Scholar] [CrossRef]
Bharot, N.; Verma, P.; Sharma, S.; Suraparaju, V. Distributed denial-of-service attack detection and mitigation using feature selection and intensive care request processing unit. Arab. J. Sci. Eng. 2018, 43, 959–967. [Google Scholar] [CrossRef]
Pillutla, H.; Arjunan, A. Fuzzy self organizing maps-based DDoS mitigation mechanism for software defined networking in cloud computing. J. Ambient. Intell. Humaniz. Comput. 2019, 10, 1547–1559. [Google Scholar] [CrossRef]
Bhushan, K.; Gupta, B.B. Network flow analysis for detection and mitigation of Fraudulent Resource Consumption (FRC) attacks in multimedia cloud computing. Multimed. Tools Appl. 2019, 78, 4267–4298. [Google Scholar] [CrossRef]
Baid, U.; Talbar, S. Comparative study of k-means, gaussian mixture model, fuzzy c-means algorithms for brain tumor segmentation. In Proceedings of the International Conference on Communication and Signal Processing 2016 (ICCASP 2016), Online, 26–27 December 2016; pp. 583–588. [Google Scholar]
Khare, N.; Devan, P.; Chowdhary, C.L.; Bhattacharya, S.; Singh, G.; Singh, S.; Yoon, B. Smo-dnn: Spider monkey optimization and deep neural network hybrid classifier model for intrusion detection. Electronics 2020, 9, 692. [Google Scholar] [CrossRef]
Masadeh, R.; Mahafzah, B.A.; Sharieh, A. Sea lion optimization algorithm. Int. J. Adv. Comput. Sci. Appl. 2019, 10. [Google Scholar] [CrossRef]
Kim, J.; Kim, J.; Kim, H.; Shim, M.; Choi, E. CNN-based network intrusion detection against denial-of-service attacks. Electronics 2020, 9, 916. [Google Scholar] [CrossRef]
Sahi, A.; Lai, D.; Li, Y.; Diykh, M. An efficient DDoS TCP flood attack detection and prevention system in a cloud environment. IEEE Access 2017, 5, 6036–6048. [Google Scholar] [CrossRef]

Figure 1. Proposed system (PCA + FCM-SMO + AE) architecture diagram.

Figure 2. AutoEncoder workflow.

Figure 3. Specificity and precision comparison graph.

Figure 4. Sensitivity and accuracy comparison graph.

Figure 5. FPR and FDR comparision graph.

Figure 6. FNR comparison graph.

Figure 7. MCC and NPV comparision graph.

Figure 8. F-Measure comparison graph.

Table 2. Nomenclature.

Abbreviation	Description
ANN	Artificial Neural Network
CNN	Convolution Neural Network
DNN	Deep Neural Network
CRESOM	Convolution Recursively Enhanced Self-Organizing Map
AE	AutoEncoder
CS	Classifier System
FCM	Fuzzy C-Means
DBN	Deep Belief Network
SMO	Spider Monkey Optimization
DDoS	Distributed Denial of Service
DoS	Denial of Service
PCA	Principal Component Analysis
DL	Deep learning
DRNN	Deep Recurrent Neural Network
DT	Decision Trees
SLA	Sea Lion Optimization
FRC	Fraudulent Resource Consumption
FNR	False negative rate
FDR	False discovery rate
FPR	False positive rate
FSOMDM	Fuzzy Self-Organizing Maps-based DDOS Mitigation
SVM	Support Vector Machine
GRU	Gated Recurrent Unit
ICRPU	Intensive Care Request Processing Unit
IDS	Intrusion Detection System
LEDEM	Learning-Driven Detection Mitigation System
LSTM	Long Short-Term Memory
MSE	Mean Square Error
NN	Nearest Neighbor
RBM	Restricted Boltzmann Machine
SD	Standard Deviation
SDNMS	Software Defined Networking-based Mitigation Scheme
FAR	Floor Area Ratio

Table 3. Learning percentage and testing data.

Learning Percentage	Testing Data Considered
60%	40%
70%	30%
80%	20%
90%	10%

Table 4. AWS Cloud EC2 computing instance setup.

Feature	Description
Compute Instance	AWS EC2
Data Storage	.csv files in EBS Storage
Instance VPC	Default VPC by AWS
Region	ap-south-1
Subnet	ap-south-1a
Elastic Block Storage Memory	8 GB
Instance Architecture	64-bit
OS	Linux
Security Group	All Traffic, IPV4 allow anywhere
Client Terminal	Putty and putty get for key conversion from.pem to.ppk
FTP Software to transfer dataset	FileZila
FTP Connection	SSH in Port 22

Table 5. Attack details of CSE-CIC-IDS-2018 dataset.

Attacker Environment	Attack Type	Tools Used for Attack	Victim Environment	Duration
Kali linux	Bruteforce attack	FTP—Patator SSH—Patator	Ubuntu 16.4 (Web Server)	One day
Kali linux	DoS attack	Hulk, GoldenEye, Slowloris, Slowhttptest	Ubuntu 16.4 (Apache)	One day
Kali linux	DoS attack	Heartleech	Ubuntu 12.04 (Open SSL)	One day
Kali linux	Web attack	Damn Vulnerable Web App (DVWA) in-house selenium framework (XSS and brute-force)	Ubuntu 16.4 (Web Server)	Two days
Kali linux	Infiltration attack	First level: dropbox download in a Windows machine. Second level: Nmap and portscan	Windows Vista and Macintosh	Two days
Kali linux	Botnet attack	Ares (developed by Python): remote shell, file upload/download, capturing screenshots and key logging	Windows Vista, 7, 8.1, 10 (32-bit) and 10 (64-bit)	One day
Kali linux	DDoS + PortScan	Low Orbit Ion Canon (LOIC) for UDP, TCP or HTTP requests	Windows Vista, 7, 8.1, 10 (32-bit) and 10 (64-bit)	Two days

Table 6. Existing best technique for intrusion detection of CSE-CIC-IDS-2018 dataset.

Technique Shortform	Reference Paper Number	Technique Full Name
SVM classifier	[]	Support Vector Machine
LSTM	[]	Long-Short Term Memory
DNN	[]	Deep Neural Network
DRNN	[]	Deep Recurrent Neural Network
CNN	[]	Convolution Neural Network
DBN	[]	Deep Belief Network
DBN + WOA	[]	Deep Belief Network with Whale Optimization Algorithm
DBN + MFO	[]	Deep Belief Network with Moth Flame Optimization
DBN + SLO	[]	Deep Belief Network with Sea Lion Optimization
DBN + SMO	[]	Deep Belief Network with Spider Monkey Optimization
DBN + SMSLO	[]	Deep Belief Network with Spider Monkey optimization and Sea Lion Optimization

Table 7. Intrusion detection details with respect to learning percentage.

	DDOS Attack	DOS Attack	Brute-Force Attack	Botnet Attack
Learning Percentage: 60% and Test Data: 40%
Predicted Positive	3,464,454	414,564	219,911	164,846
Predicted Negative	7,356,524	10,406,413	10,601,067	10,656,132
TP	3,115,042	370,988	216,017	160,210
TN	7,110,782	9,729,695	9,956,390	10,031,606
FP	349,412	43,576	3894	4636
FN	245,742	676,718	644,677	624,526
Learning Percentage: 70% and Test Data: 30%
Predicted Positive	4,003,406	480,911	257,899	191,118
Predicted Negative	8,621,068	12,143,564	12,366,575	12433,356
TP	3,691,901	435,110	250,951	188,314
TN	8,410,036	11,472,976	11,739,360	11,852,630
FP	311,504	45,801	6949	2805
FN	211,032	670,588	627,215	580,726
Learning Percentage: 80% and Test Data: 20%
Predicted Positive	4,570,926	547,518	295,964	219,795
Predicted Negative	9,857,045	13,880,452	14,132,007	14,208,176
TP	4,236,896	500,409	287,717	215,674
TN	9,631,536	13,209,304	13,472,902	13,631,057
FP	334,029	47,110	8247	4121
FN	225,509	671,148	659,105	577,119
Learning Percentage: 90% and Test Data: 10%
Predicted Positive	5,127,458	612,425	329,867	245,981
Predicted Negative	11,104,009	15,619,042	15,901,600	15,985,486
TP	4,776,398	564,726	322,994	242,118
TN	10,891,912	15,016,893	15,204,678	15,494,678
FP	351,060	47,698	6872	3864
FN	212,097	602,149	696,922	490,808

Table 8. Experimental results for considered metrics.

Measure	LSTM	DNN	CNN	DRNN	SVM	DBN	DBN + MFO	DBN + WOA	DBN + SMO	DBM + SLO	DBM + SMSLO	PCA + FCM-SMO + AE
Learning Percentage: 60% and Test Data: 40%
Specificity	0.920	0.860	0.800	0.930	0.850	0.870	0.910	0.920	0.880	0.930	0.940	0.987
Precision	0.650	0.400	0.580	0.580	0.600	0.590	0.570	0.590	0.630	0.620	0.800	0.937
Sensitivity	0.660	0.420	0.500	0.590	0.360	0.620	0.610	0.620	0.620	0.660	0.810	0.434
Accuracy	0.870	0.730	0.660	0.880	0.780	0.850	0.820	0.850	0.840	0.850	0.910	0.940
MCC	0.530	0.260	0.620	0.670	0.280	0.520	0.540	0.500	0.520	0.570	0.740	0.581
F-Measure	0.655	0.410	0.537	0.585	0.450	0.605	0.589	0.605	0.625	0.639	0.805	0.593
NPV	0.930	0.830	0.760	0.900	0.860	0.870	0.850	0.940	0.800	0.800	0.970	0.946
FPR	0.090	0.130	0.500	0.080	0.150	0.110	0.090	0.110	0.080	0.090	0.060	0.013
FDR	0.380	0.590	0.410	0.350	0.560	0.390	0.340	0.410	0.360	0.420	0.180	0.063
FNR	0.370	0.580	0.380	0.330	0.600	0.390	0.300	0.320	0.400	0.360	0.180	2.062
Learning Percentage: 70% and Test Data: 30%
Specificity	0.900	0.840	0.620	0.950	0.870	0.830	0.900	0.940	0.860	0.950	0.960	0.990
Precision	0.540	0.340	0.720	0.780	0.620	0.630	0.590	0.620	0.680	0.730	0.810	0.946
Sensitivity	0.660	0.360	0.760	0.750	0.420	0.620	0.580	0.620	0.610	0.660	0.830	0.468
Accuracy	0.850	0.800	0.520	0.920	0.740	0.820	0.780	0.850	0.850	0.840	0.930	0.951
MCC	0.550	0.220	0.700	0.730	0.290	0.560	0.520	0.530	0.550	0.500	0.770	0.618
F-Measure	0.594	0.350	0.739	0.765	0.501	0.625	0.585	0.620	0.643	0.693	0.820	0.626
NPV	0.920	0.830	0.690	0.950	0.880	0.790	0.870	0.960	0.810	0.810	0.950	0.956
FPR	0.080	0.170	0.750	0.070	0.130	0.080	0.110	0.090	0.060	0.070	0.050	0.010
FDR	0.380	0.680	0.170	0.200	0.570	0.330	0.220	0.700	0.360	0.470	0.190	0.054
FNR	0.390	0.680	0.180	0.200	0.580	0.430	0.240	0.380	0.360	0.410	0.190	1.691
Learning Percentage: 80% and Test Data: 20%
Specificity	0.850	0.870	0.780	0.940	0.840	0.810	0.850	0.900	0.920	0.960	0.950	0.991
Precision	0.600	0.540	0.590	0.680	0.650	0.580	0.630	0.630	0.680	0.620	0.810	0.949
Sensitivity	0.520	0.480	0.500	0.700	0.440	0.610	0.590	0.610	0.630	0.630	0.800	0.488
Accuracy	0.830	0.790	0.590	0.900	0.770	0.830	0.800	0.840	0.850	0.790	0.920	0.956
MCC	0.400	0.320	0.580	0.750	0.290	0.550	0.480	0.520	0.580	0.470	0.780	0.638
F-Measure	0.557	0.508	0.541	0.690	0.525	0.595	0.609	0.620	0.654	0.625	0.805	0.645
NPV	0.850	0.900	0.750	0.950	0.870	0.780	0.900	0.860	0.830	0.860	0.950	0.960
FPR	0.150	0.130	0.630	0.050	0.140	0.080	0.080	0.090	0.070	0.080	0.050	0.009
FDR	0.460	0.480	0.410	0.250	0.570	0.310	0.340	0.410	0.420	0.380	0.200	0.051
FNR	0.450	0.490	0.420	0.260	0.560	0.380	0.300	0.370	0.360	0.360	0.200	1.503
Learning Percentage: 90% and Test Data: 10%
Specificity	0.930	0.830	0.800	0.900	0.880	0.890	0.900	0.880	0.940	0.960	0.950	0.991
Precision	0.650	0.320	0.710	0.800	0.650	0.600	0.610	0.640	0.730	0.750	0.820	0.954
Sensitivity	0.600	0.340	0.640	0.800	0.500	0.630	0.620	0.630	0.660	0.650	0.800	0.522
Accuracy	0.850	0.720	0.630	0.820	0.790	0.860	0.800	0.860	0.900	0.840	0.960	0.963
MCC	0.560	0.200	0.620	0.370	0.300	0.570	0.540	0.570	0.630	0.460	0.790	0.669
F-Measure	0.624	0.330	0.673	0.800	0.565	0.615	0.615	0.635	0.693	0.696	0.810	0.675
NPV	0.900	0.840	0.800	0.920	0.830	0.800	0.940	0.880	0.880	0.930	0.970	0.967
FPR	0.090	0.180	0.640	0.090	0.140	0.090	0.120	0.080	0.070	0.080	0.040	0.009
FDR	0.340	0.650	0.210	0.400	0.580	0.410	0.380	0.640	0.380	0.490	0.190	0.046
FNR	0.350	0.650	0.220	0.410	0.540	0.560	0.440	0.450	0.360	0.430	0.190	1.250
Average Value Results
Specificity	0.900	0.850	0.750	0.930	0.860	0.850	0.890	0.910	0.900	0.950	0.950	0.990
Precision	0.610	0.400	0.650	0.710	0.630	0.600	0.600	0.620	0.680	0.680	0.810	0.947
Sensitivity	0.610	0.400	0.600	0.710	0.430	0.620	0.600	0.620	0.630	0.650	0.810	0.478
Accuracy	0.850	0.760	0.600	0.880	0.770	0.840	0.800	0.850	0.860	0.830	0.930	0.953
MCC	0.510	0.250	0.630	0.630	0.290	0.550	0.520	0.530	0.570	0.500	0.770	0.626
F-Measure	0.608	0.399	0.623	0.710	0.510	0.610	0.600	0.620	0.654	0.664	0.810	0.635
NPV	0.900	0.850	0.750	0.930	0.860	0.810	0.890	0.910	0.830	0.850	0.960	0.957
FPR	0.103	0.153	0.630	0.073	0.140	0.090	0.100	0.093	0.070	0.080	0.050	0.010
FDR	0.390	0.600	0.300	0.300	0.570	0.360	0.320	0.540	0.380	0.440	0.190	0.053
FNR	0.390	0.600	0.300	0.300	0.570	0.440	0.320	0.380	0.370	0.390	0.190	1.627

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Intrusion Detection on AWS Cloud through Hybrid Deep Learning Algorithm

Abstract

1. Introduction

2. Literature Survey

3. Proposed Model

4. Data Initialization Module

4.1. Data Pre-Processing

4.2. Feature Normalization

4.3. Dimensionality Reduction

4.4. Mean

4.5. Standard Deviation

4.6. Covariance

4.7. Eigenvalue and Eigenvectors of a Matrix

5. Cluster Formation Module

5.1. Fuzzy C-Means Algorithm

5.2. Spider Monkey Optimization

5.3. Algorithm of SMO

5.4. Cluster Merging Point (CMP) Algorithm of FCM and SMO

6. Attack Classification Module

7. Dataset and Environment

8. Result and Analysis

8.1. Positive Measures

8.2. Negative Measures

8.3. Other Measures

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics