Performance Comparison and Current Challenges of Using Machine Learning Techniques in Cybersecurity

: Cyberspace has become an indispensable factor for all areas of the modern world. The world is becoming more and more dependent on the internet for everyday living. The increasing dependency on the internet has also widened the risks of malicious threats. On account of growing cybersecurity risks, cybersecurity has become the most pivotal element in the cyber world to battle against all cyber threats, attacks, and frauds. The expanding cyberspace is highly exposed to the intensifying possibility of being attacked by interminable cyber threats. The objective of this survey is to bestow a brief review of di ﬀ erent machine learning (ML) techniques to get to the bottom of all the developments made in detection methods for potential cybersecurity risks. These cybersecurity risk detection methods mainly comprise of fraud detection, intrusion detection, spam detection, and malware detection. In this review paper, we build upon the existing literature of applications of ML models in cybersecurity and provide a comprehensive review of ML techniques in cybersecurity. To the best of our knowledge, we have made the ﬁrst attempt to give a comparison of the time complexity of commonly used ML models in cybersecurity. We have comprehensively compared each classiﬁer’s performance based on frequently used datasets and sub-domains of cyber threats. This work also provides a brief introduction of machine learning models besides commonly used security datasets. Despite having all the primary precedence, cybersecurity has its constraints compromises, and challenges. This work also expounds on the enormous current challenges and limitations faced during the application of machine learning techniques in cybersecurity.


Introduction
In this age, the cyberspace is growing faster as a primary source for a node to node information transfer with all its charms and challenges. The cyberspace serves as a significant source to access an infinite amount of information and resources over the globe. In 2017, the internet usage rate was 48% globally, later it increased to 81% for developing countries [1]. The broad spectrum of the cyberspace embraces the internet, users, the system resources, the technical skills of the participants and much more, not just the internet. The cyber-world also plays a significant role in causing limitless vulnerabilities to Energies 2020, 13 cyber threats and attacks. Cybersecurity is a set of different techniques, devices, and methods used to defend cyberspace against cyber-attacks and cyber threats [2]. In the modern world of computer and information technology, the cybercrimes are growing with faster steps as compared to the current cybersecurity system. The weak system configuration, unskilled staff, and scanty amount of techniques are some factors that rise to vulnerabilities in a computer system to threats [3]. Because of the growing cyber threats, more headway needs to make when developing cybersecurity methods. The outdated and conventional cybersecurity methods have a substantial downside because these methods are ineffectual in dealing with unknown and polymorphic security attacks. There is a need for robust and advanced security methods that can learn from their experiences and detect the previous and new unknown attacks. Cyber threats are increasing in a significant way. It is becoming very challenging to cope with the speed of security threats and provide needful solutions to prevent them [4].
Machine learning: One of the primarily used advanced methods for cybercrime detection is machine learning techniques. Machine learning techniques can be applied to address the limitations and constraints faced by conventional detection methods [5]. Researchers have addressed the advancements, limitations, and constraints of applying machine learning techniques for cyberattack detection and have provided a comparison of conventional methods with machine learning techniques. Machine learning is a sub-field of artificial intelligence. ML techniques are built with the abilities to learn from experiences and data without being programmed explicitly [6]. Applications of ML techniques are expanding in different areas of life, such as education [7,8], medical [9][10][11], business and cybersecurity [12][13][14]. Machine learning techniques are playing their role on both sides of the net, i.e., attacker-side and defender-side. On the attacker side, ML techniques are employed to pass through the defense wall. In contrast, on defense side, ML techniques are applied to create prompt and robust defense strategies.
Cyber threats: Machine learning techniques are playing a vital role in fighting against cybersecurity threats and attacks such as intrusion detection system [15,16], malware detection [17], phishing detection [18,19], spam detection [20,21], and fraud detection [22] to name a few. We will focus on malware detection, intrusion detection system, and spam classification for this review. Malware is a set of instructions that are designed for malicious intent to disrupt the normal flow of computer activities. Malicious code runs on a targeted machine with the intent to harm and compromise the integrity, confidentiality and availability of computer resources and services [23]. Saad et al. in [24] discussed the main critical problems in applying machine learning techniques for malware detection. Saad et al. argued that machine learning techniques have the ability to detect polymorphic and new attacks. Machine learning techniques will lead to all other conventional detection methods in the future. The training methods for malware detections should be cost-effective. The malware analysts should also be able to keep with the understanding of ML malware detection methods up to an expert level. Ambalavanan et al. in [25] described some of the strategies to detect cyber threats efficiently. One of the critical downsides of the security system is that the security reliability level of the computing resources is generally determined by the ordinary user, who does not possess technical knowledge about security.
Another threat to computer resources is a spam message. Spam messages are unwanted and solicited messages that consume a lot of network resources along with computer memory and speed. ML techniques are being employed to detect and classify a message as spam or ham. ML techniques have a significant contribution to detect spam messages on computer [26,27], SMS messages on mobile [28], spam tweets [29], or images/video [30,31]. An intrusion detection system (IDS) is a protection system to computer networks from any malign intrusions for scanning the network vulnerabilities. Signature-based, anomaly-based and hybrid-based are considered major classifications of an intrusion detection system for network analysis. ML techniques have a substantial contribution to detecting different types of intrusions on network and host computers. However, there are numerous areas such as detection of zero-day and new attacks are considered significant challenges for ML techniques [32]. Threats to validity: For this review, we included the studies that (1) deal with anyone of the six machine learning models in cyber security, (2) target cyber threats including intrusion detection, spam detection, and malware detection, and (3) discuss the performance evaluation in terms of accuracy, recall, or precision. We have used multiple combinations of strings such as 'machine learning and cyber security' and 'machine learning and cybersecurity' to retrieve the peer-reviewed articles of journal, conference proceedings, book chapters, and reports. We have targeted the six databases, namely Scopus, ACM Digital Library, IEEE Xplore, ScienceDirect, SpringerLink, and Web of Science. Google Scholar was also used for forward and backward search. We have focused on recent advancement in the last ten years. In total, 2852 documents were retrieved, and 1764 duplicated items were removed. The title and abstract were screened to identify potential articles. The full text of 361 studies were assessed to find the relevancy with the inclusion criteria. We have excluded the articles that were discussing (1) the cyber threats other than intrusion detection, spam detection, and malware detection, (2) threats to cyber-physical systems, (3) threats to cloud security, IoT devices, smart grids, smart cities, and satellite and wireless communication. With backward and forward search, 19 more studies were retrieved. In total, 143 studies were finally selected for data extraction purpose. Figure 1 depicts the process of article selection. The previous survey and review articles were used in addition to these included papers to provide a comprehensive performance evaluation.  Xin et al. in [33] reviewed the critical challenges faced by machine learning techniques and their solutions in a network intrusion detection system. Each ML technique has its pros and cons. No ML technique could be declared as the best technique with no limitations. One of the biggest challenges faced by ML techniques is that data collection is a lengthy and laborious procedure. Most of the publicly available datasets are outdated, have missing or redundant values [33]. In contrast, this paper covers other cybersecurity threats and the evaluation of ML models in those areas.
Gandotra et al. in [34] provided a classification of malware in the static, dynamic and hybrid analysis. Moreover, he provided a review of several research papers that applied machine learning techniques to detect malware. However, he only targeted a cyber threat, i.e., malware. Moreover, Energies 2020, 13, 2509 4 of 27 critical analysis and performance evaluation of machine learning techniques are missed. There is no description of the state-of-the-art malware datasets. In contrast, our paper has targeted sever cyber threats and provided the description of commonly used datasets. Moreover, the performance evaluation of significant machine learning techniques on a frequently used dataset is also presented.
Bhavna et al. in [35] reviewed several papers applying machine learning techniques to detect cyber threats. However, they have focused and described more on instruction detection. Performance evaluation of machine learning techniques and benchmark datasets are also not provided.
Ford et al. in [36] presented a survey on the application of machine learning techniques in cybersecurity. This survey addressed the crucial challenges in applying machine learning technique in cybersecurity. ML techniques are efficiently fighting against the cyberattacks and threats. However, machine learning classifiers themselves are exposed to various cyber and adversarial attacks. There is an immense amount of work needed to improve the safety of ML from adversarial cyberattacks. Jiang et al. in [37] examined the various publications on using machine learning techniques in cybersecurity from 2008 to early 2016. The authors also described that, despite the growing role of machine learning techniques in cybersecurity, the selection of appropriate and suitable machine learning technique for a specific underlying safety problem is still a challenging matter of grave concern.
Hodo et al. in [38] assessed the performance of machine learning techniques in anomaly detection and measured the usefulness of feature selection in ML IDS. They claimed that although convolutional neural network (CNN) classifier could have served as a satisfactory classifier in cybersecurity, still it has not been used to its full potential. Moreover, machine learning models are unable to adequately detect the attacks because of the missing and incorrect signatures in the signature list of an intrusion detection system. Besides, further work is needed to explore the knowledge-based and behavioral-based approaches.
Apruzzese et al. in [39] presented an analysis of machine learning techniques in cybersecurity to detect the spams, malware and intrusions. It asserted that the machine learning techniques are vulnerable to cyber threats and all the methods are still struggling to overcome all the limitations and obstacles. The biggest challenge is that most same classifier used for different kind of safety problems. It is highly required to find suitable classifier for a particular safety issue. It also emphasized that all the shortcomings of machine learning techniques should be handled as a matter of deep concern as cyber attackers are leveraging all their resources.
The communication technologies used by smart grids are leading to cybersecurity deficits. Yin et al. in [40] developed a method to gauge the vulnerable area of distributed network protocol 3 (DNP3) protocol based on IoT-based smart grid and SCADA system. The obtained vulnerability measures were used to develop an attack model for the data link layer, transport layer, and application layer. Furthermore, they developed two algorithms by applying machine learning techniques to transform the data. Authors showed by experimental results that the proposed system classified intrusive fields with detailed information about DNP3 protocol. Peter et al. in [41] discussed three types of malware and central measures that are crucially needed to overcome the security threats. They suggested that cybercrimes can be reduced by continuously updating the cybersecurity policy, decreasing the reaction time and robust segmentation. Ndibanje et al. in [42] presented a classification method for obscure malware detection by using API call as malicious code. They applied similarity-based machine algorithms for feature extraction and claimed to have effective results for obscure detection methods.
Torres et al. in [43] discussed the utilization of machine learning classification techniques applied in cybersecurity. They provided a review of different alternatives to using machine learning models to reduce the error rate in intrusion and attack detection. However, this paper describes the significant challenges and considers several other cyber threats to cybersecurity. Ucci et al. [44] focused on achieving malware detection using key machine learning techniques. They analyzed malware detection using a feature extraction process. They also emphasized that there is an urgent need to update the currently used datasets as most of the publicly available datasets are outdated. In contrast, this paper provided an overview of commonly used ML models, their complexities and evaluations  Table 1 presents a comparison of this paper with the existing survey and review papers. It can be observed that most of the review papers have not presented a comprehensive review of significant cyber threats. Moreover, none of the paper provided the performance evaluation of famous machine learning techniques. Secondly, we have not just provided the performance evaluation. Instead, we have compared them based on benchmark datasets. Our comparisons in Tables 4-9 depict the performance of each machine learning technique on the detection of significant cyber threats based on frequently used datasets. We have also described the current challenges of using machine learning techniques in cybersecurity that open new horizons for future research in this direction.
Contributions: In this review paper, we build upon the existing literature of applications of ML models in cybersecurity and provide a comprehensive review of ML techniques in cybersecurity. The following are significant contributions to this study: (1) To the best of our knowledge, we have made the first attempt to provide a comparison of the time complexity of commonly used ML models in cybersecurity. We have also described the critical limitations of each ML model. (2) Unlike other review papers, we have reviewed applications of ML models to common cyber threats that are intrusion detection, spam detection and malware detection. This review paper is organized as follows: Section 2 describes an overview of cybersecurity threats, commonly used security datasets, basics of machine learning, and evaluation criteria to evaluate the performance of any classifier. Section 3 provides a comprehensive comparison of frequently used ML classifiers based on different cyber threats and datasets. Section 4 concludes this study and points out the critical challenges of ML models in cybersecurity.

Cybersecurity and Machine Learning
This section is divided into four parts. First part provides the basics of cyber threats and attacks. The second part describes the commonly used security datasets for computer networks and mobile. The third part presents the fundamentals of machine learning and various machine learning algorithms. The fourth section describes different metrics to evaluate a classifier.

Basics of Attacks and Threats
The malicious attacking technologies are gaining faster progression than the defending techniques. Cybersecurity aims to maintain data protection, resource protection, data privacy, and data integrity [53,54]. There are various threats and attacks on cyberspace. Common threats to cyberspace are fraud detection, malware detection, spam classification, phishing, disabling firewall and antivirus, logging of keystrokes, malicious URL, and probing to name a few.
Phishing and malware are considered as critical threats to cyberspace. Phishing is the method to get unauthorized access to the data by pretending as a legitimate user. Sending a link of a web page posing as a legitimate page that navigates to other links to enter personal information is an example of phishing. In contrast, malware is malicious software that is developed intentionally to get unauthorized access on the target computer and disrupt the normal flow of activities [55]. Malware detection has three sub-classes, namely static, dynamic, and hybrid detections. In the case of static malware detection, the applications are examined for the malicious pattern without executing them. However, dynamic detection is performed while the applications are running. Hybrid detection is a mixture of both detection techniques. Virus, Trojan horse and worms are sub-categories of malware. A virus is a piece of malicious code that destroys the data on the system by unwitting the user. A worm is malicious software that illegally consumes the system resources by replicating itself. A trojan horse obtains unauthorized access to the data by professing itself as legitimate software. A Trojan horse does not replicate itself [56,57].
A spam message via email or SMS is another critical threat to the computer and network resources. Spam messages consume a lot of computer memory and network resources. Spam message affects both mobile and computer networks. Spam can be found in the form of email, images, videos, tweets, and spam blogs on mobile and computer networks.
Several defense mechanisms have been installed on network systems to detect unauthorized intrusion and probing. Cybercriminals can scan computer networks for vulnerabilities. There are three categories of intrusion detection based on network analysis such as signature-based, anomaly-based and hybrid-based. Signature-based techniques are used to detect the known attacks, whereas anomaly-based detection detects any unusual behavior within the network. Hybrid-based detection is a combination of both detection techniques. There are four categories of cyber-attacks, namely user to root (U2R), remote to local (R2L), probing, and denial of service (DOS). If a user tries to get access rights of a root/admin user, then this attack is called U2R. In contrast, if a remote user tries to gain access as a local user, then the attack is classified as R2L. Whereas, if a legitimate user is denied to the system access by making the network resources busy, then the phenomena is called DOS. However, in the case of probing, cybercriminals only scan the network to find weak areas for future attacks.

Commonly Used Security Datasets
Machine learning techniques produce better results if the datasets have diversity and collected real-time data. In this sub-section, we will discuss the most used security datasets.
Frequently used security datasets are the Defense Advanced Research Project Agency (DARPA) datasets, URL dataset, KDD Cup 99 dataset, Australian Defense Force Academy (ADFA) dataset, HTTP CSIC-2010, Android malware dataset, Android validation dataset, Spambase, and NSL-KDD. The primary outcome of the DARPA dataset is the detection of the attacks [58]. DARPA dataset is a network traffic and audit logs-based dataset. It has its limitations to handle new system variations. DARPA does not show real-world network traffic of data [59]. AFDA dataset was developed to get the better of the DARPA dataset. AFDA overcame the limitations to handle new system variations [60]. KDD Cup 99 dataset was formed using a subset of DARPA dataset. The later advancement for KDD Cup 99 dataset is NSL-KDD dataset [61]. NSL-KDD was proposed to overcome data redundancy and duplicate records. Thus, the NSL-KDD dataset has a reasonable number of records as compared to KDD Cup 99, and it performs better than KDD Cup 99 [61,62]. The primary purpose of the HTTP CSIC-2010 dataset is to detect web attacks. HTTP CSIC-2010 dataset handles a massive number of web queries. HTTP CSIC-2010 dataset is known as the most long-established and efficient dataset for attack detection in web queries [63]. The Enron dataset is composed of a massive number of emails produced by the Enron Corporation's staff. This dataset used to classify the spam emails. Spambase dataset is another commonly used dataset to ascertain and refine the spam emails. This dataset computes the different attributes of the collected observations and publicly available on the UCI ML repository [64]. The URL dataset, an internet traffic-based dataset, was proposed to blacklist malicious URLs [65]. The URL dataset consists of five different types of malicious URLs: phishing URLs, spam URLs, malware URLs, benign URLs, and defacement URLs. The Android malware dataset is an android apps-based dataset. Android malware dataset was proposed to blacklist malware android applications [66]. The Android validation dataset was generated to find various relations between 72 real apps by extracting two types of features: metadata and N-grams. The Android validation dataset shows that there are different relationships between apps, for example, siblings, false siblings, step-siblings and cousins [67].

Basics of Machine Learning
Artificial intelligence (AI) is a branch of computer science that works to find the best possible way to achieve a specific goal by simulating the human brain. Machine learning is a sub-branch of AI that takes the result from experience and uses them as future instructions without being programmed explicitly [68]. ML can be further classified into three major subtypes, namely, supervised learning, unsupervised learning and semi-supervised learning.
In supervised learning, we have prior knowledge of targeted classes and labels for the data. In unsupervised learning, there is no previous knowledge of the target classes and based on identifying patterns in the data. The combination of both supervised learning and unsupervised learning is called semi-supervised learning. Deep learning (DL) is another sub-branch of ML with more capabilities. Both ML and DL methods perform by learning from their experience. The only difference is that DL executes an action in repeat iterations to achieve the best possible outcome. DL solved the problems end-to-end, whereas ML techniques follow the concept of divide and conquer. In the last decade, an abundant amount of work has done to use both these techniques to enrich cybersecurity [69]. ML and DL techniques use the experience to generate input, but deep learning can repeatedly perform a task without any human interaction. Machine learning divides a problem into smaller pieces to generate the outcome, whereas deep learning generates end-to-end findings. The training time duration is more significant for deep learning and shorter for machine learning. In contrast, the testing time duration is shorter for deep learning and longer for machine learning. Deep learning requires powerful hardware system to perform. Machine learning performs well on the low-end hardware system. Machine learning techniques learn from prior knowledge of labels, whereas deep learning techniques learn from their past mistakes.
There are two main models for deep learning approaches: generative models and deep discriminative models [70]. A deep discriminative model can be further classified into three main classes, namely, recurrent neural networks, deep neural network, and convolutional neural networks. As the name suggests in recurrent neural network data is stored in nodes, and all the nodes establish a connection among each other in the form of loops [71]. A deep neural network is a widely used approach. There are manifold layers in deep neural networks, and the number of layers always exceeds Energies 2020, 13, 2509 9 of 27 three. A convolutional neural network is a multilayer network which processes unstructured data to generate output in the form of complex features [72]. Generative/unsupervised models are further divided into four classes, namely, deep belief networks, deep autoencoders, deep Boltzmann machines, and restricted Boltzmann machine. A restricted Boltzmann machine contains two layers. One layer is called a hidden layer, and the second layer is called a visible layer. Both the hidden and visible layers are completely connected using a set of weights, but there is no connection within the same layer [73]. A deep belief network contains more than one layer where each layer performs as a restricted Boltzmann machine. In a deep belief network, data is depicted by a visible layer, and the features are represented by a hidden layer [74]. A deep autoencoder achieves less data loss by regenerating the input neurons at the output layer such that the number of input and output neurons remain the same in both layers [75]. The deep Boltzmann machine is a multilayer network which contains multi hidden and visible layers. Each layer is connected with neighbor layer, but the connection is entirely undirected without any connection within a layer [76]. Table 2 presents a summary of frequently used machine learning models for cybersecurity. In this table, we have mentioned the time complexity of each mentioned classification model along with its brief description and limitations. Reference number column mentioned the reference of paper of time complexity value for a particular model. Computational cost, i.e., the time complexity of each model, is obtained after a rigorous literature review and web search. However, in order to have a better detection rate, there is a need to use models with lower time complexity. Generally, the models with linear complexities such as O(n) and log-logarithmic are considered best. However, quadratic, i.e., O(n 2 ), and cubic, i.e., O(n 3 ), are also acceptable for most practices. O(n 3 ) considered slower, but exponential and factorial time complexities are undesirable. It is crucial to use a suitable model as per the situation. There are applications, such as in military, where it is critical to have a model with a higher detection rate. However, there are medical problems such as surgical robots where there is a need for higher trustworthiness instead of a quick response. • Works on an if-then rule to find the best immediate node and the process continues till the predicted class is obtained.
• Difficult to change the data without affecting the overall structure. Complex, expensive and time-consuming.

Evaluation Criteria
A confusion matrix or error matrix is used to gauge the performance of the classification model formally. Table 3 presents an error matrix that depicts the classification result into four categories, namely TF, TN, FP, and FN. Other evaluation metrics are formed based on these four categories.

Precision
The ratio of the total number of normal correctly classified samples to the total count of all positive classified samples is called precision.

Recall
The percentage of the total number of normal correctly classified samples to the total count of all positive classified samples is called recall.

Accuracy
The ratio of the total number of normal correctly classified samples to all the samples in the data set is called accuracy.

ROC Curve
The receiver operating characteristic curve is used to outline the overall threshold's performance with the true positive rate on the y-axis and false positive rate on the x-axis.

Error Rate
The ratio of the total number of misclassified samples to all the samples in the dataset is called the error rate.

Performance Comparison of Machine Learning Models Applied in Cybersecurity
Researchers are investigating machine learning techniques to detect different cybercrimes in cybersecurity. We have provided a detailed discussion of various cyber threats in Section 2. Furthermore, we have briefly presented an overview of frequently used security datasets in Section 2. This section provides a comprehensive survey of each ML model applied to deal with different cyber threats. Subsequent lines will explain the description of each column in Tables 4-9. The ML technique columns describe the considered machine learning model. We have considered six ML models for this study: random forest, support vector machine, naïve Bayes, decision tree, artificial neural network, and deep belief network.
We focus on three critical cyber threats, namely intrusion detection, spam detection and malware detection. The domain columns state the significant cybersecurity threats considered for this review. The reference number and year columns depict the citation number of each article and published year, respectively. The values of approach or sub-domain columns are different for each cyber threat. IDS domain has three values that are anomaly-based, signature/misuse-based and hybrid-based. Malware has three further sub-classifications that are static, dynamic and hybrid. In the case of spam, sub-domains correspond to the medium in which the authors tried to identify the spam such as image, video, email, SMS and tweets. A description of each sub-domain/approach has been provided in Section 2. Finally, the result attribute presents the evaluation of each classifier applied in a particular sub-domain of cyber threat on a specific dataset and provided in the cited paper mentioned in the reference column.

Support Vector Machine
The principle superiority of support vector machine (SVM) is that it produces the most successful results for cybersecurity tasks. SVM distributes each data class on both sides of the hyperplane. SVM separates the classes based on the notation to the margin. Support vector points are those points that lie on the border of the hyperplane. The major drawback of the support vector machine is that it consumes an immense amount of space and time. SVM requires data trained on different time intervals to produce better results for a dynamic dataset [83].
SVM showed an accuracy of 99.30% with KDD Cup 99 dataset for IDS [84]. 96.92% is the best reported accuracy for malware detection using Enron dataset [85] and 96.90% with Spambase to classify spam emails [86]. The best reported recall for SVM to detect intrusion is 82% [87], malware is 100% [88], and spam is 98.60% [89]. SVM has obtained best precision while detecting the intrusion is 74% [90], malware is 96.16% [91], and spam is 98.60% [89]. A detailed performance comparison of SVM to various cyber threats on the frequently used dataset is presented in Table 4.

Decision Tree
Decision tree (DT) belongs to the category of supervised machine learning. DT consists of a path and two nodes: root/intermediate and leaf. Root or intermediate node presents an attribute that followed a path that corresponds to the possible value of an attribute. Leaf node represents the final decision/classification class. A decision tree is used to find the best immediate node by following the if-then rule [106]. Further, 99.96% is the reported accuracy of DT while detecting the anomaly-based IDS with KDD dataset [107]. With standard SMOTE dataset, DT shows an outstanding accuracy of 96.62% for malware detection [108]. With the Enron dataset, DT correctly classified ham emails with an accuracy of 96% [88]. The best reported recall for DT to detect intrusion is 98.10% [90], malware is 96.70% [109], and spam is 96.60% [89]. DT has obtained best precision while detecting the intrusion is 99.70% [90], malware is 99.40% [110], and spam is 98% [88]. A detailed performance comparison of decision tree to various cyber threats on the frequently used dataset is presented in Table 5.

Deep Belief Network
A deep belief network (DBN) consists of various middle layers of restricted Boltzmann machine (RBM) organized greedily. Every layer communicates with the layers behind it and the layers ahead of it. There is no lateral communication between the nodes within a layer. Every layer serves as both an input layer and an output layer, except the first and the last layers. The last layer functions as a classifier. The primary purpose of a deep belief network is image clustering and image recognition. It deals with motion capture data. Deep belief network has shown the accuracy of 97.50% for IDS [125], 91.40% for malware detection [126] and 97.43% for spam detection [127] with KDD, KDD CUP99, and Spambase datasets, respectively. The best reported recall for DBN to detect intrusion is 99.70% [128], malware is 98.80% [129], and spam is 98.02% [130]. DBN obtained the best precision while detecting the intrusion is 99.20% [128], malware is 95.77% [131], and spam is 98.39% [130]. A detailed performance comparison of DBN to various cyber threats on the frequently used dataset is presented in Table 6.

Artificial Neural Network
An artificial neural network (ANN) classier consists of hidden neuron input and output layers and performs in two stages. The first stage is called feedforward. In this stage, each hidden layer receives some input nodes and based on the input layer and activation function, the error is calculated. In the second stage, namely feedback stage, the error is sent back to the input layer and process is continued in iterations until the correct result is gained [136]. The artificial neural network showed an accuracy of 97.53% for IDS [137], 92.19% for malware detection [138], and 92.41% for spam detection with NSL-KDD, VX Heavens, and Spambase datasets, respectively. The best reported recall for ANN to detect an intrusion is 98.94% [139], and spam is 94% [140]. ANN has obtained best precision while detecting the intrusion is 97.89% [139], malware is 88.89% [141], and spam is 95% [142]. A detailed performance comparison of ANN to various cyber threats on the frequently used dataset is presented in Table 7.

Random Forest
Random forest (RF) follows through the task by combing different predictions generated by joining different decision trees. RF raised a hypothesis to obtain a result [127]. RF falls under the category of ensemble learning. RF also termed as random decision forest. RF is considered as an improved version of CART that is a sub-type of a decision tree.
RF has shown an accuracy of 99.95% with IDS [149], 95.60% with malware detection [150] and 99.54% for spam detection [151] with KDD, VirusShare, and Spambase datasets, respectively. The best reported recall for RF to detect intrusion is 99.95% [149], malware is 97.30% [109], and spam is 97.20% [89]. RF obtained the best precision while detecting the intrusion is 99.80% [152], malware is 98.58% [98], and spam is 98.60% [153]. A detailed performance comparison of RF to various cyber threats on the frequently used dataset is presented in Table 8.

Naïve Bayes
The major limitation for Naïve Bayes (NB) classifier is that it assumes that every attribute is independent, and none of the attributes has a relationship with each other. This state of independence is technically impossible in cyberspace. Hidden NB is an advanced form of Naïve Bayes, and it gives 99.6% accuracy [162]. Naïve Bayes showed an accuracy of 99.90% with DARPA dataset for IDS [163]. 99.50% is the best reported accuracy for malware detection using NSL-KDD dataset [164]. With Spambase dataset, Naïve Bayes showed considerable accuracy of 96.46 % to classify spam or ham email [86]. The best reported recall for NB to detect intrusion is 100% [33], malware is 95.90% [164], and spam is 98.46% [86]. NB obtained the best precision while detecting the intrusion is 99.04% [163], malware is 97.50% [109], and spam is 99.66% [86]. A detailed performance comparison of NB to various cyber threats on the frequently used dataset is presented in Table 9.

Discussion and Conclusions
Machine learning techniques have become the most integral underlying part of the modern cyber world, particularly for cybersecurity. Machine learning techniques are being applied on both sides, i.e., attacker side and defender side. On the attacker side, machine learning techniques are being used to find new ways to pass through and evade the security system and firewall. On the defender side, these techniques are helping security professional to protect the security systems from illegal penetration and unauthorized access. This paper reviews a comparative analysis of machine learning techniques applied to detect cybersecurity threats. We have considered three significant threats to cyberspace: intrusion detection, spam detection, and malware detection. We have compared six machine learning models, namely, random forest, support vector machine, naïve Bayes, decision tree, artificial neural network, and deep belief network. We have further compared these models on further sub-domain of cyber threats. The sub-domains of each cyber threat are different. Anomaly-based, signature-based, and hybrid-based are considered sub-domains for intrusion detection. For malware detection, the sub-domains are either static detection, dynamic detection or hybrid-detection. Sub-domains for the spam are the medium on which the models are applied to classify spam like images, videos, emails, SMS or calls. Section 2 described each sub-domain of threat in detail. This section is divided into two parts. First part provides the discussion on the performance of various ML models applied in cybersecurity. The second part provides the challenges of using machine models in cybersecurity and concludes the study. Figure 2 shows the performance comparison of six machine learning techniques based on frequently used datasets to detect intrusion detection. We have picked the values from the given tables that show the maximum value for accuracy, precision and recall based on the dataset. SVM has revealed an outstanding performance of nearly 98% on KDD dataset whereas the utmost accuracy for SVM reported on NSL-KDD dataset was 83%. DBN performed outstanding nearly on all datasets and shown an accuracy above 95% to detect intrusion. On the DARPA dataset, NB and ANN performed better accuracy than other models, but ANN has given worse precision value on DARPA dataset. On NSL-KDD dataset, DBN performed best among other models concerning accuracy, precision, and recall. The SVM and DBN came up with an excellent precision on KDD-Cup 99 dataset among all other models. On KDD dataset, decision tree and random forest have shown excellent precision rate among all the models. Random forest on KDD dataset, NB on DARPA, DBN on NSL-KDD, and NB on KDD Cup99 have shown the best recall rates, respectively. Figure 2 shows the performance comparison of six machine learning techniques based on frequently used datasets to detect intrusion detection. We have picked the values from the given tables that show the maximum value for accuracy, precision and recall based on the dataset. SVM has revealed an outstanding performance of nearly 98% on KDD dataset whereas the utmost accuracy for SVM reported on NSL-KDD dataset was 83%. DBN performed outstanding nearly on all datasets and shown an accuracy above 95% to detect intrusion. On the DARPA dataset, NB and ANN performed better accuracy than other models, but ANN has given worse precision value on DARPA dataset. On NSL-KDD dataset, DBN performed best among other models concerning accuracy, precision, and recall. The SVM and DBN came up with an excellent precision on KDD-Cup 99 dataset among all other models. On KDD dataset, decision tree and random forest have shown excellent precision rate among all the models. Random forest on KDD dataset, NB on DARPA, DBN on NSL-KDD, and NB on KDD Cup99 have shown the best recall rates, respectively.  Figure 3 shows the performance evaluation of six machine learning techniques on frequently used datasets to detect malware. We have observed that there are not much benchmark datasets available for malware detection. Mostly, the researchers collected their customized datasets and applied machine learning techniques to evaluate the models. We have also noticed that machine learning techniques are often shown outstanding accuracy, precision, and recall values on the customized dataset. These proposed techniques don't show similar best performance when applied to other datasets. Classical machine learning techniques, e.g. decision tree performed better on several datasets. DBN showed an outstanding recall value almost on all datasets. DT and RF have performed with a better precision rate on VirusShare dataset. RF has shown excellent recall, precision, and accuracy values on Enron dataset. ANN has shown worst performance on Enron dataset for  Figure 3 shows the performance evaluation of six machine learning techniques on frequently used datasets to detect malware. We have observed that there are not much benchmark datasets available for malware detection. Mostly, the researchers collected their customized datasets and applied machine learning techniques to evaluate the models. We have also noticed that machine learning techniques are often shown outstanding accuracy, precision, and recall values on the customized dataset. These proposed techniques don't show similar best performance when applied to other datasets. Classical machine learning techniques, e.g., decision tree performed better on several datasets. DBN showed an outstanding recall value almost on all datasets. DT and RF have performed with a better precision rate on VirusShare dataset. RF has shown excellent recall, precision, and accuracy values on Enron dataset. ANN has shown worst performance on Enron dataset for accuracy, recall and precision. With respect to accuracy, the NB and DT performed excellently on the VirusShare dataset compared to other models.   Figure 4 shows performance evaluation of machine learning techniques based on frequently used datasets for spam classification. Spambase is a famous spam dataset, and NB performed better among other machine learning models with respect to the accuracy, precision, and recall. Researchers have also collected dataset from Twitter, containing millions of tweets. RF has outperformed and reported more than 97% precision value. All evaluated machine learning models have shown more than 90% accuracy to detect and classify spam. SVM and DBN have shown better accuracy, recall and precision among other models when applied to the collection of SMS to classify spam text messages. It can also be observed that SMS collection is a customized dataset collected by the researcher. Every machine learning model has given more than 95% accuracy, precision, and recall, whereas the same machine learning models have different performance values on standard datasets, i.e. Enron and Spambase.   Figure 4 shows performance evaluation of machine learning techniques based on frequently used datasets for spam classification. Spambase is a famous spam dataset, and NB performed better among other machine learning models with respect to the accuracy, precision, and recall. Researchers have also collected dataset from Twitter, containing millions of tweets. RF has outperformed and reported more than 97% precision value. All evaluated machine learning models have shown more than 90% Energies 2020, 13, 2509 18 of 27 accuracy to detect and classify spam. SVM and DBN have shown better accuracy, recall and precision among other models when applied to the collection of SMS to classify spam text messages. It can also be observed that SMS collection is a customized dataset collected by the researcher. Every machine learning model has given more than 95% accuracy, precision, and recall, whereas the same machine learning models have different performance values on standard datasets, i.e., Enron and Spambase.  Figure 4 shows performance evaluation of machine learning techniques based on frequently used datasets for spam classification. Spambase is a famous spam dataset, and NB performed better among other machine learning models with respect to the accuracy, precision, and recall. Researchers have also collected dataset from Twitter, containing millions of tweets. RF has outperformed and reported more than 97% precision value. All evaluated machine learning models have shown more than 90% accuracy to detect and classify spam. SVM and DBN have shown better accuracy, recall and precision among other models when applied to the collection of SMS to classify spam text messages. It can also be observed that SMS collection is a customized dataset collected by the researcher. Every machine learning model has given more than 95% accuracy, precision, and recall, whereas the same machine learning models have different performance values on standard datasets, i.e. Enron and Spambase.  Figure 5 shows the comparative analysis of accuracy, precision and recall values for the detection of intrusion, spam and malware. We have taken the maximum value obtained by six machine learning models regardless of the dataset. It is depicted that SVM, DT and RF have given the maximum accuracy and precision value for intrusion detection. However, DBN and ANN reported the best recall value for the detection of intrusion. It is recommended that SVM, DT, NB, and RF should be considered for intrusion detection if accuracy is the priority for intrusion detection. DBN and ANN comparatively performed worse than other models to detect malware.   Figure 5 shows the comparative analysis of accuracy, precision and recall values for the detection of intrusion, spam and malware. We have taken the maximum value obtained by six machine learning models regardless of the dataset. It is depicted that SVM, DT and RF have given the maximum accuracy and precision value for intrusion detection. However, DBN and ANN reported the best recall value for the detection of intrusion. It is recommended that SVM, DT, NB, and RF should be considered for intrusion detection if accuracy is the priority for intrusion detection. DBN and ANN comparatively performed worse than other models to detect malware. However, ANN has shown exceptional recall value for the detection of malware. RF and NB have shown better accuracy for the classification of spams yet DBN again recommended in case precision and recall is the priority of situation. Keeping in view the metrics collected from reviewed papers, RF and NB are recommended for the classification of spam, and SVM and DT for the detection of malware, respectively, for better accuracy.

Challenges of using ML Models in Cybersecurity and Future Directions
The application of machine learning techniques to detect several cyber threats has shown better results than conventional methods. Despite having all those improvements, machine learning techniques are still facing many challenges in the cybersecurity domain. We have presented a comparison of machine learning techniques based on frequently used datasets. The unavailability of benchmark and updated datasets for the training of machine learning models is a big challenge. Another unbalancing trend is that the same dataset is generating different results using the same techniques for the same sub-domain. In the Table 4, SVM is applied to detect anomaly-based intrusion by [92] and [93] but having a difference of 10% in accuracy. The same can be seen in the case of [98] and [99] in Table 4, [116] and [117] in Table 5, [151] and [160] in Table 8, and [103] and [86] in Table 9 to name a few.

Challenges of Using ML Models in Cybersecurity and Future Directions
The application of machine learning techniques to detect several cyber threats has shown better results than conventional methods. Despite having all those improvements, machine learning techniques are still facing many challenges in the cybersecurity domain. We have presented a comparison of machine learning techniques based on frequently used datasets. The unavailability of benchmark and updated datasets for the training of machine learning models is a big challenge. Another unbalancing trend is that the same dataset is generating different results using the same techniques for the same sub-domain. In the Table 4, SVM is applied to detect anomaly-based intrusion by [92,93] but having a difference of 10% in accuracy. The same can be seen in the case of [98,99] in Table 4, [116,117] in Table 5, [151,160] in Table 8, and [86,103] in Table 9 to name a few.
The unbalance in the accuracy may be due to the selection of different feature extraction methods or data for testing and training purposes. However, this discrepancy in results has created confusion while selecting a suitable classifier for a particular problem. In addition to these problems, the available datasets need to rationalize by banishing the redundant, noisy, missing and unbalanced data. These datasets should be up to date with modern and sophisticated attacks, and missing values should be detected and removed from the required data.
The detection speed of particular cyber threat and prompt action are critical challenges for machine learning models. The time complexity of machine learning models matters in this case. Faster and robust models will detect the cyber threat beforehand and stop it from creating any problems for network and system. However, in order to have a better detection rate and to take a prompt action, the models with lower time complexity are suggested to use. Generally, the models with linear complexities such as O(n) and log-logarithmic are considered best. The choice of machine learning model will vary from case to case. The time complexity of each model is obtained after a rigorous literature review and web search. We have provided the time complexity of significant machine learning models used in cybersecurity in Table 2. It is essential to have efficient models to produce the best detection rate and quick response in some scenarios, such as the military. If the detection rate of models will be slower, then attackers will dodge the model to harm the system before any preventative measures are taken.
Deep learning can handle data without human interaction; however, still have several limitations. In comparison to machine learning techniques, deep learning techniques need an enormous amount of data and costly hardware components to produce better results. The substantial magnitude of time and power is required to process more massive datasets. In order to train the model, there is a need for high computing hardware such as GPUs and parallel processing to expedite the learning and classification processes. The growing rate of unlabeled, sparse, and missing data also affects the training process of the models. There is a need to have high computational efficiency where maximum throughput is trying to achieve with limited resources.
There is a significant need to have high-level of correctness instead of speed and response of prediction in some scenarios. Trustworthiness is essential when machine learning techniques are being applied in life-critical or mission-critical applications such as self-driving cars. Image classification is very critical to correctly read a traffic sign by self-driving cars. Prediction cannot be applied with blind faith where robots are doing the treatment and surgery of cancer patients.
We have also observed that, even in 2020, researchers are applying and testing the latest machine learning and deep learning techniques on outdated datasets. It is apparent from the year column of Tables 4-9 that latest machine learning techniques are being tested on DARPA and KDD Cup datasets which are more than 15 years old. There is a need to have the latest, benchmark, and real-time datasets to evaluate latest machine learning and deep learning models. There are benchmark datasets for intrusion detection like DARPA and KDD Cup. However, we have perceived that there is a deficiency of state-of-the-art datasets for spam and malware detection problems. Researchers are applying latest machine learning techniques on their customized datasets. They claim to have the better accuracy of their models without disclosing their datasets and code to regenerate the results. Customized datasets are often collected in a particular fashion that lack diversity and their proposed model(s) performed well on those datasets. However, when the same models are tested on other similar problem domain on different dataset, the models don't show the similar best results as claimed by authors on their customized datasets. Furthermore, researchers are publishing the performance evaluation of their proposed models using different metrics. Some are publishing the recall while others are only focusing on accuracy. There should be standardized metrics to compare the performance of models. It can be observed from Tables 4-9 that most of the researchers have published the accuracy of their model, leaving other metrics. False-negative is a value that describes the case whereby an illegitimate user has been granted with access to a system and network. This would have a drastically worse effect on system performance rather than considering accuracy. There should be standard metrics in order to compare the models using different measurements. This would then be a milestone for future research to improve the performance of models.
Further, there is a need to have robust machine learning models to handle adversarial inputs. There should be an emphasis on training the model in adversarial settings to develop robust models against adversarial inputs. We have reviewed the six machine learning models based on several datasets to detect a cyber threat but encourage a beginner in this domain to delve into the extensive bibliography presented in this review paper. In future work, we will analyze more ML and DL techniques against several other cybersecurity threats. We will evaluate the ML models in other areas of cybersecurity, such as IoT, smart cities, methods based on API calls, cellular network, and smart grids.