Toward Developing E ﬃ cient Conv-AE-Based Intrusion Detection System Using Heterogeneous Dataset

: Recently, due to the rapid development and remarkable result of deep learning (DL) and machine learning (ML) approaches in various domains for several long-standing artiﬁcial intelligence (AI) tasks, there has an extreme interest in applying toward network security too. Nowadays, in the information communication technology (ICT) era, the intrusion detection (ID) system has the great potential to be the frontier of security against cyberattacks and plays a vital role in achieving network infrastructure and resources. Conventional ID systems are not strong enough to detect advanced malicious threats. Heterogeneity is one of the important features of big data. Thus, designing an e ﬃ cient ID system using a heterogeneous dataset is a massive research problem. There are several ID datasets openly existing for more research by the cybersecurity researcher community. However, no existing research has shown a detailed performance evaluation of several ML methods on various publicly available ID datasets. Due to the dynamic nature of malicious attacks with continuously changing attack detection methods, ID datasets are available publicly and are updated systematically. In this research, spark MLlib (machine learning library)-based robust classical ML classiﬁers for anomaly detection and state of the art DL, such as the convolutional-auto encoder (Conv-AE) for misuse attack, is used to develop an e ﬃ cient and intelligent ID system to detect and classify unpredictable malicious attacks. To measure the e ﬀ ectiveness of our proposed ID system, we have used several important performance metrics, such as FAR, DR, and accuracy, while experiments are conducted on the publicly existing dataset, speciﬁcally the contemporary heterogeneous CSE-CIC-IDS2018 dataset.


Introduction
Nowadays, the usage of the internet and its influence on each aspect of society has increased significantly, especially in the business industry. The connectivity to information and communication technology (ICT)-based system offers and allows the corporation to increase its productivity and activity. Recently, the availability of the Internet has increased greatly; therefore, almost every facet of our daily life is integrated with ICT at a comparatively low price. This means anyone can access any network with ease [1]. Along with such kind of improvement, several security problems in the digital world also have been increased due to the democratization of the internet [2], and therefore protecting the computer system from several threats has become a more concerning and vital research topic than before. So, ICT systems want incorporated and concrete security solutions. Regardless of the availability of various primary security solutions, such as firewalls, access control mechanisms, and antivirus, several ICT systems are still exposed to cyber threats that may prevent their functioning, vulnerable private information, or facing data corruption problems. Although this conventional security mechanism serves as the first line of the security solution, these primary security techniques are inadequate to deal with intrusion skills and techniques. As data is the most important asset of the corporation [3], so there is an excessive need to devise an efficient security mechanism to keep the data secure and make ICT systems more tolerant and resistant to malicious attacks. To this end, we have proposed an efficient ID system, which is a useful security solution appropriate to mitigating malicious network attacks.
In the last two decades, numerous research has been conducted in the ID domain to develop effective NIDS using several approaches, such as statistical learning to conventional ML and current DL techniques. These methods have decent accuracy, aiming to perceive malicious threats, and have improved the speed of network traffic. However, the rapid growth of heterogeneous data has caused a big challenge [4,5]. ID frequently comprises the analysis of big data, which is considered a hot research issue where conventional computing techniques cannot deal with the quantity of data, such as network traffic [6]. Advanced security mechanisms, such as NIDS, must evaluate the gigantic network traffic packets in a real-time environment, as the correspondingly rapid growth of malicious threats can have catastrophic effects on basic security components, such as CIA (confidentiality, integrity, availability). Table 1 summarizes the challenge that is being faced by the ID system in a big data heterogeneous environment. Table 1. Intrusion detection challenges in a big data heterogeneous environment.

Characteristic of Big Data Description Challenges
Volume Size of the dataset in terabyte or petabyte, etc.
A huge capacity for network traffic is a massive problem for conventional computing approaches, and it is also a big issue for reducing the processing capability.

Velocity Speed of data generation
Velocity is referred to as the particular speed at which traffic is created; it is also a big issue to handle high-speed network traffic in a real-time environment.

Variety
Dataset complexity, such as structured, unstructured format Variety denotes the complexity of network traffic, especially when data packets in several formats and from various sources, so attribute selection is not a simple task, and some machine learning approaches are not appropriate for this issue.

Veracity Data consistency and trustworthiness
Veracity denotes the correctness of data, data quality issues like noise or missing values.

Value Data statistical, hidden, and unknown values
Value denotes in the sense that if specific data does not afford any meaning (value), it is not considered as big data analysis.
Every day, we experience extraordinary growth of data, which is the main contributor to big data in relation to volume, veracity, variety, velocity, and value [7,8], and brings its specific problems in the field of ID. There are several traditional approaches for ID, such as firewalls, access control, and encryption mechanism. These conventional approaches have a few constraints, particularly when facing a huge amount of suspicious threats like DDOS, and DOS and IDS can get high value of FN and FP attack detection rates. Recently, researchers have used ML and data mining approaches for ID with the desire to improving ID rates as compared to traditional security mechanisms. ID under a heterogeneous data environment has been recognized, and nowadays, the researcher has started to implement efficient big data analytics framework that can evaluate and monitor the network traces professionally [9,10]. The role of the dataset is very crucial; therefore, selecting suitable big data Electronics 2020, 9, 1771 3 of 17 analytics framework and using an adequate dataset for ID system evolution are two big challenges for developing efficient IDS [11].
Numerous studies on the background problem have highlighted several issues and challenges regarding ID, which need solid and immediate researcher's attention to be addressed. These issues and challenges are • Intrusion detection algorithms.

•
Deficiency or inadequate dataset.

•
Integration of several formats of data.

•
Poor system design.

•
Big data processing framework • Testing/evaluation of IDS.
However, there are various limitations and problems with ID studies. To start with, finding a huge amount of label data and handmade features is not an easy task, while obtaining unlabeled raw traffic data with a small amount of labeled data is a comparatively easy task. Therefore, the training process of several DL techniques, such as SAEs and DBNs, contains supervised fine-tuning and unsupervised pre-training [12]. In this scenario, a huge amount of unlabeled data and a small amount of labeled data are relatively executed in dual training processes. The outward drawbacks of these fully connected networks are having a huge amount of training parameters due to full connections of units among neighboring layers. As NN layers are limited, it may affect the training process by becoming very slow. Instead, the CNN DL approach decreases the number of training parameters through policies of shared weights and sparse connectivity, while CNN for the supervised learning process requires input labeled data. The real motivation behind this high-performance ID research is to propose an effective and suitable DL method for unsupervised features extraction and benefits of CNN for ID.
With the development of cyber-defense abilities, cyber-attacks have continued to develop to penetrate security defenses like living beings. Assuming the possibility of several enemy attacks, it is essential to choose a proper course of action by proactively evaluating and predicting the effects of a specific security incident. Cyber-attacks, particularly in large-scale military network environments, have a lethal impact on security; therefore, many tests and research must be carried out to create the required preparations. Nowadays, cyber-space is identified as the fifth battlespace after air, sea, space, and land. Therefore, by simply defending the information, cyber warfare can affect the military policies and actions that are specifically related to national security. Although the military is seeking to identify and minimize cyber-attacks to counter them, cyber-attacks are growing frequently, and new forms of threats are continuing to emerge [13,14]. It is important to examine the cyber threat that emerges in different ways to react effectively to it. The consequences of cyber-attacks on the infrastructure should be secured, and the security policies should be developed. Besides, it is important to examine not only current cyber threats but also the possibility of reacting more proactively. Various research has been carried out on cyber-attacks modeling, such as attack tree, attack graph, and cyber kill chain modeling approach, etc. [15,16]. Remember that prior research on cyber-attack modeling has induced some challenges, such as scalability in a large-scale network environment. Recently, cyber-attacks do not simply end with a single attack but have a composite form of several types of attacks. Moreover, new forms of cyber-attacks are taking place continuously. To deal with these challenges, a new approach for modeling method, which is flexible enough to easily add newly developing attack types and model complex attack process systematically, is required [17,18].
In particular, we have developed a better-quality type of ID system, which is built on the MLlib of Spark and deep learning approaches, such as Conv-AE. So, in this research, we have proposed a new ID system that combines the benefits of two systems to increase the performance as associated with the classical system. The important idea of this study is to design an ID system that is based on Spark MLlib and Conv-AE deep learning techniques. It is an innovative approach, which joins shallow and deep learning techniques to achieve their powers and overcome systematic overheads.

•
Giving a comprehensive review of the advanced DL approaches in the ID domain.

•
We have proposed the ID model, which is based on Spark MLlib and state-of-the-art DL approaches, such as Conv-AE, which concatenate deep and shallow networks to decrease their analytical overheads and exploit their advantages.

•
We have analyzed packet capture files (pcap) directly on Spark, while earlier researchers have not assessed the raw packet dataset.

•
How to resolve the class imbalance issue that is normally existing in the big data high-speed network? • We study the performance of our proposed IDS using contemporary heterogeneous real traffic, CSE-CIC-IDS2018.

•
We compare the performance of the proposed Spark MLlib and DL approach-based ID system with other classical ML approaches. The experiment outcomes describe that this approach is very efficient for attack detection and detecting misuse intrusions correctly in 98.20% of cases through a 10-fold cross-validation test.
The remainder of the paper is structured as follows: related works to the ID system are described in Section 2. The architecture of the proposed ID system with the framework descriptions are in Section 3. Implementation and experimental outcomes are covered in Section 4 with a comparative analysis with existing methods. Section 5 provides future directions before concluding the paper.

Related Work
In this section, we have discussed the current research that is related to our study. As shown in Table 2, various ID approaches have been developed in the last two decades, giving predictive accuracy on various datasets. ID technology is a crucial part of computer security, and the first idea was proposed by James Anderson in 1980 [19], wherein he proposed an ID framework for intrusion classifications to establish a security controlling structure that relies on identifying malicious user behavior. In recent times, comprehensive research has been existing to develop efficient IDs by using various techniques. These ID techniques, ranging from simple statistic algorithms to advanced ML approaches, have been useful in extracting features from network traffic so that abnormal traffic can be distinguished from the normal traffic. In previous research, Naseer et al. [20], Bandyopadhyay et al. [21], Tama et al. [22], Albahar et al. [23], Tang et al. [24], Qatf et al. [25], Farahnakian et al. [26], Thi-Thu et al. [27], Pektas et al. [28], Mighan et al. [29], Meira et al. [30], Wang et al. [31] used various models, methods, and techniques based on conventional ML supervised and unsupervised approaches have been introduced for ID problems to increase the performance of the ID framework. ML approaches, such as k-NN [32], SVM [33], ANN [34], RF [35,36], and many others, have been extensively used for ID. Laskov et al. [37] provided a study of unsupervised and supervised learning approaches according to their detection accuracy and ability to identify unknown malicious threats. Solanas et al. [38] described the clustering approach for anomaly ID. Ghorbani et al. [39] presented an inclusive review of unsupervised and supervised learning techniques for anomaly IDs. A comprehensive range of anomaly ID systems was described by Kalita et al. [40], and Tavallee et al. [41] analyzed the performance of various classical ML algorithms, including DT, NB, and SVM, etc. In general, ML techniques have brought accuracy and efficiency in the identification of malicious activities in the network traffic. However, few limitations remain in these ML approaches, such as the data preprocessing phase requiring expert knowledge, high FAR value, and low DR value of the attack, etc. ML approaches also require a huge amount of training data for efficient and reliable results, which is not an easy task, particularly in a vigorous and diverse environment.
Due to these deficiencies, DL algorithms have great importance in contemporary research. DL is the cutting-edge field of ML, which can highlight these deficiencies and can solve the problem associated with shallow learning. Earlier, researchers have proved that DL has a better performance as compared to shallow learning due to layer-wise learning features structure [42]. DL algorithms evaluate the network traffic deeply and efficiently recognize the intrusion in the network data. The nomenclature of previous work of DL and shallow learning in the ID domain was summarized by Hodo et al. [43]. Nowadays, the application of DNN for the solution of ID problems is a comparatively hot research area. AE, DBN, RBM, LSTM, CNN have been used for ID. Javaid et al. [44] employed softmax regression with sparse AE on the NSL_KDD ID data, which is an upgraded form of KDD 99 ID data. Fiore et al. [45] used an RBM DL to acquire an accuracy of 85% on the KDD 99 ID dataset. Jihyun et al. [46] utilized the LSTM DL method to identify malicious threats on KDD 99 dataset and claimed that they obtained better accuracy and attack DR as compared to conventional classifiers, such as SVM and KNN. Gao et al. [47] developed an ID architecture using DBN and evaluated the performance of the DL-based ID system using the KDD 99 ID dataset. Aygun et al. [48] proposed denoising the AE-based ID system and claimed to achieve attack classification accuracy up to 88.6% and 88.2% on the NSLKDDTEST+ dataset. Yousefi-Azar et al. [49] proposed AE-based latent features of the generation-based ID architecture using the NSLKDD ID dataset and obtained the ID accuracy up to 83.34%.
Hussain et al. [50,51] developed hybrid NIDS by joining Ada boost with DT and completed an experiment on the NSLKDD ID dataset, which is an upgraded version of the KDD 99 dataset, and the outcomes demonstrate that the hybrid approach is effective in identifying anomaly in the ID system. Nowadays, a substantial amount of research is done in the ID domain. Most of the researchers focus on improving the ID system's ability to identify malicious threats and enhancing the network speed that may be controlled. Ying Chung et al. [52], in his paper, proposed hybrid ID using the SSO approach and achieved attack classification accuracy up to 93% using the KDD99 ID dataset. Ghanem et al. [53] developed another hybrid ID architecture using a metaheuristic technique for an enormous dataset, where ID detection is based on a genetic algorithm and metaheuristic approach. Kim et al. [54] proposed a novel hierarchical ID system that joins anomaly and misuses the ID model via a decomposition structure. The misused ID model is developed using DT, while the anomaly ID model has been created via a one-class SVM method. They evaluated the proposed hybrid ID system in the NSLKDD ID dataset and claimed that it has better performance in terms of ID accuracy and low FPR for both anomaly and misuse attacks. Wang et al. [31] developed a hierarchical spatial-temporal-based ID called HAST-IDS, where low features are detected via CNN, and high features are detected through LSTM deep learning approach. The entire feature learning procedure is accomplished by DNN without Electronics 2020, 9, 1771 6 of 17 a feature engineering technique. This automatically features a learning process-based ID system evaluated in DARPA98 and ISCX2012 ID datasets and increases the ID accuracy and decreases the FAR as compared to traditional ID techniques. Chencheng et al. [55] proposed a distinct flow of features-based hybrid ID system, where CNN is used to evaluate the sequence of features, and DNN is used to learn various characteristics of high-dimensional features vectors comprising environmental and statistical features. They evaluated the performance of the distinct flow of features-based hybrid ID system by using the ISCX2012 ID dataset.
Monshizadeh et al. [9] combined learning and linear algorithms with a protocol analyzer to identify malicious activities in the network. Their linear and learning architecture is known as a hybrid anomaly detection module (HADM), where linear algorithms extract features, while the learning part of HADM uses these features to identify novel types of attacks. The protocol analyzer is used to filter and categorizes the susceptible protocols to evade a needless computational load. They tested the performance of the HADM ID system by using UNSW-NB15, ISCX2012, ISCX2017 ID datasets.
In the ID domain, most of the researchers use the KDD99 dataset, but being outdated, from this kind of dataset, we are not able to mitigate the threats, which are much new. Therefore, it becomes very substantial that IDS should evaluate and test inefficient and superior datasets [56,57]. So, in our study, we address the problem related to the dataset and evaluate the solution to solve them. Recent research shows that a hybrid approach solves various research problems in different domains, such as sentiment [58], video classification [59], emotion recognition [60], and malicious ID from a video [61]. In a big data environment, heterogeneous data are any data with high variability of data types and format. It may be of poor quality and ambiguous due to noise and missing values. It is a nontrivial task to use heterogeneous data in ID research. Therefore, dealing with a large volume of stream data, ranging from unstructured to structured, text stream to numeric, is also a big issue: real-time data stream, dynamic, and very heterogeneous. So, to solve the aforementioned issues and improve the learning capability and accuracy of IDS, we have developed DL-based IDS. In particular, we have developed an efficient ID system, which is built on Spark MLlib and DL approaches, such as Conv-AE networks. Spark MLlib-based typical ML techniques are useful to detect anomaly network traffic, and Conv-AE assists in detecting misuse network traffic. As we know that ABS and SBS have some restrictions, so we have joined the two systems to reduce their drawbacks. So, in this research, we have proposed a new ID system that combines the advantages of two systems to enhance the performance compared to the conventional system in the well-known modern real-time heterogeneous dataset CSE-CIC-IDS2018.
In a nutshell, the current research attempts to respond to the following research problem: How to develop a fast, competent ID system to learn the useful features efficiently and automatically from large heterogeneous data by using state-of-the-art Conv-AE DL approaches and identify malicious attacks in the case of both anomaly and misused-based ID system, and how to overcome the FP with better attack detection rate.

The Proposed ID System
The proposed framework of the ID system is given in Figure 1. This proposed efficient ID system contains four main stages. The first stage of the proposed approach is preprocessing from the original ID dataset. The second stage is anomaly detection with conventional ML classifiers using Spark MLlib. In the third stage, Conv-AE deep learning approach is used for misuse detection. The final stage is the alarm module of the proposed approach, which detects whether the incoming network traffic is benign or malicious and evaluates the proposed ID system.

Data Preprocessing
This is the first stage of the proposed ID framework. The CSE-CIC-IDS2018 consists of labeled flow for ten days. So, more than 80 attributes can extract from the raw ID dataset by applying CICFlowMeter-V3 and save these features in CSV format, which can be evaluated for the network traffic data. Initially, in CSE-CIC-IDS2018, few attributes have a slight influence on whether network traffic is benign or malicious, such as IP address and time stamp. As the ID system classifies network traffic according to their behavioral attributes, so we have erased this column of the attribute. Besides, the timestamp is not having a high impact on training the network, so we eliminate this attribute. After that, we have divided the dataset into the train test and validation set, which are 70%, 20%, 10%, respectively. The model is trained by using training data; testing is utilized for final assessment, while the validation set is useful for the fast assessment model. We know that CSE-CIC-IDS2018 is a real-world heterogeneous ID data that are usually inadequate: missing features values, missing particular features of interest, or comprising only cumulative data; noisy: covering outliers or errors; inconsistent: covering discrepancies in names or codes. Therefore, to handle the imbalanced issue, we have employed the over-sampling in which we increase the number of instances in the minority class by randomly duplicating them to present a higher representation of the minority class in the sample. Although it has some risk of overfitting the data, no information is lost. Nevertheless, it outperforms the under-sampling technique. The train and test dataset used in this study is given in Table 3.

Data Preprocessing
This is the first stage of the proposed ID framework. The CSE-CIC-IDS2018 consists of labeled flow for ten days. So, more than 80 attributes can extract from the raw ID dataset by applying CICFlowMeter-V3 and save these features in CSV format, which can be evaluated for the network traffic data. Initially, in CSE-CIC-IDS2018, few attributes have a slight influence on whether network traffic is benign or malicious, such as IP address and time stamp. As the ID system classifies network traffic according to their behavioral attributes, so we have erased this column of the attribute. Besides, the timestamp is not having a high impact on training the network, so we eliminate this attribute. After that, we have divided the dataset into the train test and validation set, which are 70%, 20%, 10%, respectively. The model is trained by using training data; testing is utilized for final assessment, while the validation set is useful for the fast assessment model. We know that CSE-CIC-IDS2018 is a real-world heterogeneous ID data that are usually inadequate: missing features values, missing particular features of interest, or comprising only cumulative data; noisy: covering outliers or errors; inconsistent: covering discrepancies in names or codes. Therefore, to handle the imbalanced issue, we have employed the over-sampling in which we increase the number of instances in the minority class by randomly duplicating them to present a higher representation of the minority class in the sample. Although it has some risk of overfitting the data, no information is lost. Nevertheless, it outperforms the under-sampling technique. The train and test dataset used in this study is given in Table 3. The significant perception of this research is to test the reliability of the efficient ID system against unknown malicious threats via misuse attack detection technique. Table 4 designates the dataset for the Conv-AE deep learning approach for misuse classification of the testing and training the network.

The Anomaly Detection Module
In this stage of the proposed ID system, we have used a machine learning library of SPARK to implement several conventional ML classifiers, such as LR, DT, SVM, and RF, to classify malicious traffic for anomaly detection. In this stage, we have divided the dataset into two subsets-80% for training and 20% for tests. Then, conventional ML classifiers are trained on the training set to detect malicious and normal traffic. This is the overall binary learning stage. The trained conventional ML classifiers are tested on the test dataset. During this stage, the best performing model is selected due to grid search hyperparameter tuning and 10-fold cross-validation.

Misused Detection Using Conv-AE Deep Learning Approach
In this stage, Conv-AE is used for identifying the misused traffic, with an objective to classify the anomalous traffic further into relevant classification policies: DOS attacks, DDOS attacks, bot, brute force. Conv-AE merges the advantages of CNN and unsupervised pretraining AE. The micro overview of Conv-AE is shown in Figure 2. Initially, CNN has two fundamental components: classification and feature extraction. The feature extractor component contains two layers, known as convolutional and pooling layers. In this way, CNN learns features efficiently as output from the extraction component, which is commonly recognized as features map become the input to other components, which is called classification. However, instead of fully connected layers, the encoder consists of the convolutional layers, and the decoder consists of the deconvolutional layers. After the decoding part is fully connected, the softmax classifiers are added to the end for probability distribution over the classes. Here, the trained model is tested to determine whether the behavior of the trained model is malicious or normal, with the test set as one of the inputs. components, which is called classification. However, instead of fully connected layers, the encoder consists of the convolutional layers, and the decoder consists of the deconvolutional layers. After the decoding part is fully connected, the softmax classifiers are added to the end for probability distribution over the classes. Here, the trained model is tested to determine whether the behavior of the trained model is malicious or normal, with the test set as one of the inputs.  Then, we randomly divide the data into training and testing-80% and 20%, respectively. The 10% from the training dataset is used for the validation test. During the training, the network is a fine-tune by optimizing AdaMax, Adam, and Ada Gard, with flexible learning rates, and these are optimized with grid search hyperparameters using several combinations and 10-fold cross-validation on 128 batch size. We analyze the network performance by adding a Gaussian noise layer after Conv layers to enhance the overall model generalization ability and overcome the overfitting problem.

Alarm Module
The last stage of the proposed ID system is the alarm module, which interprets the results of the events on both the anomaly and misuse detection stage. It is the final component of the proposed ID system, which helps the administrator or end-user after getting any malicious information that something has happened in the network.

Implementation Details
To show the efficacy of the proposed ID system on the contemporary heterogeneous dataset CSE-CIC-IDS2018, we have done various experiments. We have discussed in detail in the below sections.

Datasets
Since choosing suitable data to test an ID system plays significant roles, we make the ID data before we describe the simulation details of the proposed ID system.
Even though there have been numerous standard ID datasets publicly existing, some of them comprise the undevitrified, old-fashioned, irreproducible, and inflexible intrusion. To reduce the deficiencies of ID datasets, we have used most contemporary heterogeneous ID datasets, such as CSE-CIC-IDS2018, for our proposed high-performance efficient ID system [42]. This dataset is prepared by a collaborative project between the CIC and the CSE. The ID data consist of seven distinct attack states over a huge network for 10 days, such as a botnet, DDOS, brute force, web attack, DOS, infiltration, and heart leech attack.

•
Botnet attacks: Many Internet-connected devices are used by a botnet owner to accomplish many tasks. It can be utilized to steal data, send spam, and permit the attacker access to the device and its connection. These kinds of attacks are collected through keylogging and screenshot. • DDOS attacks: It typically happens when several systems flood the bandwidth or resources of a victim. Such an attack is often the result of many compromised systems (for example, a botnet) flooding the targeted system by making the enormous network traffic. These kinds of attacks use LOIC for TCP, UDP, and HTTP.

•
Brute force attacks: This is one of the most widespread attacks that only cannot be used for password breaking but also to discover hidden content and pages in a web application. It is simply a hit, attempting an attack, and then the victim succeeds. These kinds of attacks are collected through SSH and FTP Patator tools.

•
Web attack: These kinds of threats are coming out every day, and now people and organizations take security seriously. It uses the SQL injection, in which an attacker can make a string of SQL commands and then use it to force the database, respond to the information, cross-site scripting (XSS), which is happening when developers don't test their code properly to identify the possibility of script injection, and brute force over HTTP, which can try a list of passwords to find the administrator's password. The web attacks are collected via DVWA and in-house selenium framework (brute force and XSS). • DOS attacks: The attacker requests to make a computer network resource inaccessible for the time being. It is usually proficient by flooding the intended network resource or machine with superfluous requests in a try to overload systems and avoid few or all authentic requests from being fulfilled. They use the goldeneye, hulk, slow HTTP test, and slow loris to gather these kinds of attacks.

•
Infiltration attacks: The infiltration of the network from inside normally takes advantage of a vulnerable application, such as Adobe Acrobat Reader. After effective exploitation, a backdoor is performed on the victim's machine and can conduct diverse attacks on the victim's network. They apply port scan and Nmap to gather these sorts of attacks. • Heart leech attacks: It comes from a bug in the OpenSSL cryptography library, which is a commonly used transport layer security (TLS) protocol implementation. It is usually exploited by sending a malformed heartbeat request with a small payload and wide length field to the vulnerable party (usually a server) to evoke the victim's response. It is a kind of DOS attack.
There are more than 80 attributes that can extract from the raw ID dataset by applying CICFlowMeter-V3 and save these features as CSV format, which can be evaluated for the network traffic data. To extract novel features of data, the inventive files (logs and pcap) are also accessible, which can be utilized to extract features. Some of the CSE-CIC-IDS2018 features are given in Table 5.

Performance Parameters
As models are trained, we have analyzed them by test. Then, performance metrics are computed through the confusion matrix. The predicted and expected classification is represented with the help of the element of the confusion matrix. The outcomes of classification are divided into two classes, such as incorrect class and correct class. There are four critical scenarios to calculate the element of the confusion matrix. We have the confusion matrix in the ID setting as shown in Table 6.

•
True-positive (TP)-It is signified by x, and it presents that model is accurate as normal and predicts positive.
• False-negative (FN)-It signifies the wrong prediction and is represented by y. It classifies instances, which are anomalous in certainty, as regular, and the model mistakenly predicts negative. • False-positive (FP)-It is represented by z and presents that the model incorrectly predicts positive, and in reality, the number of identified attacks is normal.

•
True negative (TN)-It is represented by t and states that the instances that are properly detected as an attack predicts negative. We can calculate the performance of the proposed ID system by using the above conditions of confusion matrix as DR, and TPR and FAR are the two significant and general parameters for the evaluation of IDS. DR means the percentage of anomalous classes recognized by the ID model. FAR means the amount of misclassified normal classes.

Experimental Settings
The initial anomaly detection stage is implemented in Scala with Spark using conventional ML classifiers, while for misuse detection, using Conv-AE is implemented with Keras in python. The experiment is carried on a PC having 64-bit ubuntu14.04 OS with a Core i7 processor and 32 GB RAM. The stack of software consists of Java 1.8 (JDK), Apache Spark v2.3.0, Keras, and Scala 2.11.8. We use 80% of the data for training purposes with 10-fold cross-validation and assess the performance of the trained network, with 20% held over the dataset. The deep learning Conv-AE model is trained on Nvidia TitanX GPU with cuDNN and CUDA in Keras to make the whole training process smooth and fast. Table 7 illustrates the performance of the proposed ID system for anomaly detection using conventional ML classifiers and for misuse detection using the DL Conv-AE approach. The best results are given in the table, which are obtained through random search only. As results show that LR performs off-color, giving low attack detection accuracy, while RF and SVM perform better, giving attack detection accuracy up to 89%. The most important improvement in misuse attack detection up to 98.20% is with the Conv-AE approach. The main reason behind the significant enhancement in the performance of the proposed ID system is the superior feature extraction through CNN and AE deep learning approach. Table 8 is the comparison of the proposed ID system with current solutions on the heterogeneous CSE-CIC-IDS2018 dataset. The CSE-CIC-IDS2018 dataset is produced later as compared to KDD99 and DARPA; therefore, few experimental results are existing for comparison. So, based on existing evaluation outcomes for the comparison, the best one from each study has been chosen relative to the attack detection accuracy. Previous researchers such as Ferrag et al. [11], Peng et al. [62], Ana et al. [63], Lee et al. [64], Chadz et al. [65] used various ML and DL approaches, while we used hybrid approach in in ID domain. It can be noted that our proposed approach for anomaly and misuse attack detection is better as compared to advanced approaches in terms of attack detection accuracy. It is mainly due to the cutting-edge feature selection approach; we implement a machine learning library of Spark and Conv-AE deep learning approach. It is essential to note that this comparison with other approaches is for reference only because various research have used various preprocessing methods and distinct types of traffic proportions, as well as data distribution techniques.

Overall Analysis
We concentrate on resolving real-life ID problems, using enormous data analysis models (Apache Spark, Apache Hadoop) and AI (ML, DL). Controlling this kind of issue is not a simple task because of time and space restrictions. Big data presently has very huge and increasing volumes but still needs a huge powered computational device to support a learning framework that can handle the data proficiently, using specialized resources. Therefore, we can achieve better security with the proposed ID system using the Spark MLlib data analytics framework with DL Conv-AE against malicious threats.

Conclusions and Outlook
In this research, the ID system is developed using Spark and Conv-AE deep learning approach, which is fast, simple, vigorous, and efficient cybersecurity. The proposed ID system based on Conv-AE can automatically and efficiently learn the features representation from the CICIDS2018 heterogeneous dataset. We have implemented our proposed ID system for using various conventional machine classifiers using Spark and for misuse detection using state-of-the-art deep learning approaches, such as Conv-AE. The proposed ID system is better as compared to traditional security approaches in terms of attack detection rate and accuracy and also has less computation complexity. Both deep and machine learning models are assessed with renowned classification parameters, such as attack detection rate, classification accuracy, precision, recall, and F1 score.
We think that our approach can be extended in the future into numerous fields, such as the anomaly and network misuses, which can be recognized on real-time streaming image data, focusing on exploring deep learning as an attribute extraction tool to learn knowledgeable data illustrations in case of other anomaly detection problems in a more modern real-time dataset.
Author Contributions: M.A.K. conceived the research, wrote the paper, designed the framework, and performed the experiments. J.K. assisted with proofreading, revision, and improvements. J.K. supervised the overall research. All authors have read and agreed to the published version of the manuscript.