A Survey of the Recent Trends in Deep Learning Based Malware Detection

: Monitoring Indicators of Compromise (IOC) leads to malware detection for identifying malicious activity. Malicious activities potentially lead to a system breach or data compromise. Various tools and anti-malware products exist for the detection of malware and cyberattacks utilizing IOCs, but all have several shortcomings. For instance, anti-malware systems make use of malware signatures, requiring a database containing such signatures to be constantly updated. Additionally, this technique does not work for zero-day attacks or variants of existing malware. In the quest to ﬁght zero-day attacks, the research paradigm shifted from primitive methods to classical machine learning-based methods. Primitive methods are limited in catering to anti-analysis techniques against zero-day attacks. Hence, the direction of research moved towards methods utilizing classic machine learning, however, machine learning methods also come with certain limitations. They may include but not limited to the latency/lag introduced by feature-engineering phase on the entire training dataset as opposed to the real-time analysis requirement. Likewise, additional layers of data engineering to cater to the increasing volume of data introduces further delays. It led to the use of deep learning-based methods for malware detection. With the speedy occurrence of zero-day malware, researchers chose to experiment with few shot learning so that reliable solutions can be produced for malware detection with even a small amount of data at hand for training. In this paper, we surveyed several possible strategies to support the real-time detection of malware and propose a hierarchical model to discover security events or threats in real-time. A key focus in this survey is on the use of Deep Learning-based methods. Deep Learning based methods dominate this research area by providing automatic feature engineering, the capability of dealing with large datasets, enabling the mining of features from limited data samples, and supporting one-shot learning. We compare Deep Learning-based approaches with conventional machine learning based approaches and primitive (statistical analysis based) methods commonly reported in the literature.


Introduction
According to the Panda Security report [1], hackers are involved in creating around 230,000 malware samples daily, a number expected to grow in the coming years.According to an FBI report [2], ransomware is considered to be one of the fastest-growing threats, with over 4000 ransomware attacks occurring every day since 2016.Ransomware is capable of targeting home users, small and large businesses, and has the potential to cause the loss of sensitive information temporarily or permanently according to [3].Critical infrastructure is the most luring target for the ones who are well versed with the damages that can be caused by ransomware.Ransomware is the type of malware that uses the encryption module to encrypt the data and makes it unusable for the user [4].Over the past few decades ransomware has affected not only small businesses but has victimized big companies like FedEx, Nissan, Russian and German railways, and NHS organizations in the UK according to Ref. [5].According to a report [6] produced by Kaspersky, spam emails are the constant features of phishing, and this trend is unlikely to change soon.Symantec's Internet Security Threat report of 2019 [7] stated that supply chains remained a soft target, with attacks increasing by 78% in 2019 compared to the previous year.The same report mentions blocking 69 million cryptojacking events in 2018, four times increase compared to 2017.Small businesses are severely affected by cyber-attacks and according to statistics in 2019, 40% of small companies were attacked, out of which only 13% could detect and mitigate the attacks [8].Due to economic losses caused by cyber-attacks, 60% of small companies collapsed.Accenture reports that the US $2.4 M is spent by companies to support malware detection and defense from web-based attacks.Cyber-attacks have heavily created chaos in critical infrastructure as well.State-sponsored attackers had been found involved in launching attacks over industrial control systems lately.One of the biggest examples of such malware is Stuxnet which was designed to choke the working of the Iranian Nuclear Power Plant's centrifuges [9,10].Cyber physical systems are almost applied in all critically important areas such as traffic lights, health care, power generation, water industry, transportation system, etc. [11].Communication of these cyber physical systems with network make them vulnerable and many stealthy attacks launching different malicious payloads can be expected easily by looking at the statistics [12].Malfunctioning of such significantly important systems can cause severe accidents and damages.To protect the cyber physical systems working in all crucial areas, researchers have been trying their level best to device an anti-malware system that can protect them.There are many tools and anti-virus products available in the market for the detection of malware and cyberattacks, however, they have their inherent shortcomings.Anti-virus products work over the signatures of malware, and the signature database needs to be constantly updated.This technique also does not work for zero-day attacks and for the new variants of existing malware (which can have a different signature).
Various strategies have been implemented to speed up the real time detection of different types of malware as explained in Appendix A.1 so that the effect of the malware can be mitigated.A taxonomy of malware analysis is explained in Appendix A.2 and is illustrated in Figure 1: static analysis focuses on detecting a malicious file without executing it, whereas dynamic analysis works by first executing the file.A hybrid strategy involves a combination of both static and dynamic analysis Various approaches have been reported in the literature to detect malicious behavior and files, involving: (i) statistical data analysis-based research for malware classification; (ii) machine learning methods (including Deep Learning) for malware detection and identification.
The key motivation has been to develop the capability of detecting and identifying malware in a cost-effective manner, and in real-time so that the effects of malware can be mitigated.Different survey papers have been written in the domain of cyber security surveying the work done in malware detection.Unlike other survey papers, our paper is not focusing on a single strategy to be reported in this literature survey, instead, we have accumulated the research trends in malware detection from various application areas of data science as well as AI.Table 1 shows the comparison between our work and other survey papers.

Coverage Other Papers Our Survey Paper
Survey of statistical based methods for malware detection [13,14] Survey of machine learning based algorithms for malware detection [15] Survey of deep learning based techniques to detect malware [13,16,17] Analysis of problems associated with statistical based approaches of detecting malware [18] Analysis of shortcomings of machine learning based solutions for detecting malware [15] Analysis of disadvantages of using deep learning based methods to detect malware [13,16] Survey of FSL methods in the domain of malware detection The contributions of this work are as follows: • Description of malware classification and identification strategies • Mechanisms for classifying and detecting malware and a comparative analysis between these methods

•
Potential issues and challenges in the different categories of proposed solutions • The future direction of research in this domain This paper is organized in the following order (Shown in Figure 2): Section 2 describes the methods used in the case of the different trends in malware detection.Section 3 presents the comparative analysis of these trends.It also discusses the issues and challenges faced in each trend.Section 4 highlights future trends in the domain of malware identification and classification.

Trends in Malware Detection
Information, in today's era, is one of the most valued but vulnerable assets.There is a constant threat of serious damage to infrastructure caused by evolving sophisticated malware.Various techniques, trends, and strategies are proposed to alleviate the threats triggered by malicious codes.These methods may range from the primitive type of malware detection based on statistical analysis to machine learning-based methodologies and specifically deep neural networks.As this paper is concerned with malware detection methodologies, so it is important to go through the evolution of malware identification and detection.In this section, a hierarchy is built to represent this development of malware detection according to the methodology used.

Malware Detection with Primitive Methods (Statistical Analysis Based Methods)
Malware detection is being performed with different techniques.Many researchers have explored the different practices for malware discovery and recognition.Ref. [19] focused on detecting a malicious pattern in executables.Majorly [19] has stated that malware detection is a kind of obfuscation-de obfuscation game in today's era, therefore authors in [19] have focused on the techniques of obfuscation to check whether present anti-virus products can overcome the variability introduced by obfuscation or not.They implemented SAFE (Static Analyzer for executables) which is claimed to detect a malicious pattern in executables.Further, they developed an obfuscator for executables that uses four different techniques to obfuscate the executable and then tested antivirus scanners by providing them with obfuscated variants of existing malicious executables.Ref. [19] presented a general architecture for detecting a malicious pattern in executables with two main components i.e., Program annotator and malicious code detector.Obfuscation transformations that are supported by the obfuscator detailed in [19] include register reassignment, dead-code insertion, code transposition, and instruction substitution.
Ref. [20] used a heuristic approach for detecting malware by analyzing windows binary files of obfuscated executables.They have come up with a framework that first generates a risk score by statically analyzing the windows PE (See Appendix B) file for 8 characteristics (abnormal ordinals, Nonstd_name, In_code, TLSection, DLL_no_export, Flagged Section Name, Low function Call, Other_badPEformat).This framework assigns weight and risk score to each characteristic.The risk score is assigned based on experience and comparison between malware and benign files.A total of 2014 windows files were used in experiments.
Ref. [21] primarily focused on malware detection through statistically making use of opcodes.In their methodology, first, the frequency of opcodes appearing in malware and benign files is calculated and then the statistics-based discrimination ratio is calculated through which weights are obtained for opcode sequences.Then the similarity between two executables is computed using weights of opcode sequences.Malware files are collected from the VxHeavens website, which was a total of 13,189 executables.For benign dataset 13,000 files are collected from their computer.The basic assembler is used to disassemble the executables.After obtaining the assembly file, a profile of opcodes' frequency is maintained.This file contains the unnormalized frequency of opcodes appearing in both datasets.Finally, the relevance of all opcodes is calculated giving mutual information between opcode and classification class.Finally, malware opcode sequences are extracted and their frequency of appearance is calculated to detect maliciousness.After calculating weighted term frequency, a vector of weighted opcode sequence frequency is obtained.Experimentally first opcode sequences of lengths 1 and 2 are extracted and the similarity in the sequences appearing in both malware and executables are calculated but, in both datasets, they are appearing almost with the same frequency due to which afterward opcode sequences of length 1 and 2 are combined to check the similarity of their appearance in both datasets.Malware variants have great similarity in terms of frequency of opcode sequences whereas similarity measure is low between malware and benign dataset.
One kind of malware is a botnet that scans the internet to find vulnerable hosts to perform various malicious activities.Normally botnets are coordinated through a Command-and-Control channel C&C and most of the control protocols are IRC based whereas other protocols such as HTTP can also be used.Ref. [22] focused on detecting and confining DDoS and portscan.Authors in [22] brought up a platform that focused on detecting malicious activities by monitoring communication between botnet and C&C and by monitoring traffic for detecting and confining DDoS along with the detection of zombie computers on the network.Resultantly they managed to filter botnet-related traffic, confined infected parts of the network, and found methods for disabling botnets.To collect malware, high and low interaction honeypots were used.Low interaction honeypots used in the experiment were (1) Nepenthes and (2) Honeyd.After the malware was captured, it was analyzed manually.They were identified using various anti-virus tools and were sandboxed to collect useful information.Then a victim PC was connected to the analysis workstation and traffic generated by the victim PC in a clean state was monitored.Wireshark was started on an analysis workstation.Afterward, the victim's PC was rebooted with malware installed on it, and then events related to DNS requests attempted to connect to unknown ports and scanning of unknown ports was recorded.Dnsmsaq, fakemta relay-Http, relay, and Wireshark were used as tools for different purposes.This methodology was cumbersome to perform intended functionalities, therefore, MWNA (Malware Network Analyzer) was developed.It is based on the Linux Packet Filter mechanism.The published method for detecting DDoS analyzes packets during normal traffic: first to establish a baseline and then to derive thresholds.Then finally some attack features are extracted.Finally, above mentioned method is combined with a rate-limiting scheme so that amount of monitored traffic can be reduced.
A hybrid approach is also being used for taking benefit from the amalgam of malware detection methods.Ref. [23] focused on availing the advantages of all techniques for malware detection due to which the implemented framework by [23] is hybrid.They presented a framework that works on the detection methodology involving API calls extracted from the suspected file by running it in a VM environment.Then a graph is built using the information of API calls and operating system resources being utilized.Graph nodes represent API calls and operating system resources, and edges represent the reference between nodes.Then the constructed graph is minimized.Finally, to find a match between two graphs, the Graph Edit Distance algorithm is used, and to make use of this algorithm cost matrix is utilized.
Ref. [24] developed a tool, PyTrigger, which provides the user actions required to trigger, collect, and distill malware behavior profiles.Their paper has made three major contributions including the development of an algorithm that helps in extracting malware behavior, user-triggered malware behavior from among a similar event along with an event recording and playback system, and the full implementation of the PyTrigger system.PyTrigger has two major subsystems: (1) the recording and playback system and (2) the behavior analysis system.The recording and playback subsystem of PyTrigger is supposed to record the values of all objects' data states such as windows' titles, mutable text field values, drop-down menu choices, etc. and are then forcibly entered in GUI while being replayed to create the scenario which triggers the malware behavior.PyTrigger system executes the malware sample several times in VM and uses Events Tracing for Windows to trace the events.PyTrigger system was evaluated on 4100 malware samples from 35 different malware families.Typical user activity that was recorded was related to Gmail, Facebook, and Google HSBC, text editing, file browsing, and execution (Windows Explorer).An added advantage of this system is its ability to extract delegated events.Events that are delegated by the malicious process to other processes which are legitimate and lie outside the malware process chain are called delegated events.
Ref. [25] concentrated on the solution for detecting malicious activity which should be low cost and should not be using any third-party software so that in less time and low budget detection can be done.Secondly, since some malware behavior can overcome the virtual environment, therefore, running malware in a virtual machine for dynamic analysis can compromise some of the triggering scenarios.The authors manipulated windows audit logs into interpretable features and presented a linear classification model for detecting malicious behavior using the windows audit log as a feature set with high accuracy.This approach explored some new malware behaviors.For performing validation, six different experiment sets were designed.One of the experiments for validation involved a dataset that had malware a year or two older than the malware presented in training.Second experiment for validation was performed based on malware families.Secondly, the same trained classifier was run in a virtual environment as well as in an enterprise environment to cater to the variable of the environment.The experimental dataset consisted of 32,078 samples out of which 17,399 were benign samples and 14,679 malicious samples.6,898,593 unique features were extracted, and 20,362 audit logs were collected from binaries executed in a cuckoo sandbox.
Figure 3 shows the performance metrics used by the surveyed papers that fall in the category of statistical based methods.

Malware Detection with Conventional Machine Learning Based Methods
Machine learning plays an important role to capture helpful properties in malware to advance security measures.This whole process of knowledge extraction and learning of patterns helped the researchers to pave their steps into machine learning-based malware analysis and detection.Machine learning has been extensively used not only in malware detection but also for detecting malicious activity through network traffic [26].
Ref. [27] worked on Belief propagation with the file system but could not do well for new samples.Ref. [28] conducted malicious graph matching and extracted APIs/System calls but they used a small dataset.Ref. [29] used a Rule-based classifier and SVM and performed detection based on byte sequences but made use of only specific malware classes for evaluating their model.They built datasets from Windows system files and the Anti-Virus Platform.Ref. [30] also used a Rule-Based Classifier and extracted APIs/System calls but this APIs/System calls categorization was not up to the mark.They conducted their tests on features of the Windows XP system and Program Files folders.Authors of [31,32] used Random Forest and used network and API system calls, Registry, and File system but the dataset was small.Ref. [33] used Decision Trees in their research work and [34] used Naïve Bayes, Random Forest, and SVM and worked on byte sequences, APIs/system calls, file systems, and Windows registry.Ref. [35] used KNN for detecting malicious PEs.Malware code causes damage to the resources, and with a little code change, malware developers can easily beat the protection layer.A lot of research was done for the detection of these variants.Ref. [36] explored the Decision Tree and Random Forest and made use of Opcodes.They used small datasets of Windows XP system and Program Files folders and generated code of malware for making part of the dataset.Ref. [37] performed Clustering with locality-sensitive hashing Byte sequences but the used dataset was very small.Ref. [38] worked on a Rule-based classifier, they worked on APIs/System calls, and Windows Registry.Ref. [39] used the clustering technique which was being used for variants detection by past researchers also.The authors chose DBSCAN but their approach was not coping with malware evasion techniques.Ref. [40] worked on Logistic Regression and Neural Networks and operated on Byte sequences and APIs/system calls.
Table 2 shows the datasets and performance metrics used by the researchers in the surveyed papers that apply conventional machine learning algorithms.

Malware Detection with Deep Learning Based Methods
Deep Learning is a specialized form of machine learning in the domain of Artificial Intelligence (AI) that applies deep artificial neural networks also famous as deep neural networks.They are the techniques of machine learning that simulate the process of learning by a human brain.The human brain consists of cells which are referred to as neurons in neural networks.Similarly, in a human brain, all the cells are connected through axons and dendrites with the connection region known as synapses.These connections when found in ANN (Artificial Neural Networks), contain weights to behave as the connections between nerve cells in the human brain.Figure A2 (Appendix B) shows the human brain and simulated version of the human brain through the artificial neural network.
The major difference between conventional neural networks and deep neural networks is the number of layers.Deep neural networks make use of many hidden layers for the high-level abstraction of data.They can learn the features of data.This process of feature engineering is carried out with the help of a big number of examples input to the deep learning-based algorithm which leads to the production of results in the form of classification, identification, or generation of data after learning the most suitable features during feature engineering.The major motivation for using deep learning in various fields was to organize and analyze a large amount of data.Different areas where deep networks are preferred to be used include image processing, speech processing, healthcare, and with the increase in cyber space, now even cybersecurity.
Depending upon its features, this domain can be further categorized into different subdomains as shown in Figure 4.All features of PE files hold some significance in defining degree of maliciousness in a particular file.Features from the header and Imports, all play a significant role in defining the nature of PE file as malicious or benign.Ref. [41] made use of LSTM for the selection of optimal features of PEs.These optimal features were selected to train a deep learning based model for detecting malicious PE file.Refs.[42,43] made use of sequential dynamic data and claimed that an ensemble of recurrent neural networks can be capable to detect the maliciousness of an executable within the first 4 s of execution with almost 93% accuracy.GRU (Gated Recurrent Units) were used with RNN to reduce training time.User CPU usage, and system CPU usage, sent packets to count, received bytes count, total bytes sent, count of the processes being executed, the maximum number of processes being carried out, the number of milliseconds elapsed since the file started to run and maximum process ID assigned were used as features.
Ref. [44] combined two types of neural network layers i.e., convolutional, and recurrent layers for modeling system call sequences for classifying malware.These two types of layers use dissimilar types of approaches for modeling sequential data.Convolutional networks use sequences in the form of a set of n-grams, and recurrent networks tend to train a stateful model by using full sequential information.The input of the system was 60 distinct system calls.
Ref. [45] performed malware detection using stacked AutoEncoders (SAE) with the input of Windows API calls mined from the PE files.The SAEs model worked on a greedy layer-wise training operation for performing unsupervised feature learning.Then this process was followed by supervised parameter fine-tuning.Results showed that the model with 3 hidden layers and 100 neurons at each layer gave the best training and testing accuracy as compared with ANN, SVM, Naïve Bayes, and Decision Tree.
Ref. [46] implemented a method that manipulates raw inputs to detect maliciousness.The implemented model called eXpose picks generic short strings from security inputs.These strings include malicious URLs, mutexes, registry keys, etc.Then it learns to identify their maliciousness.eXpose makes use of a neural network convolutional kernel for feature extraction.The architecture is composed of notional components along with character embedding, feature detection components, and classifier.Results showed that eXpose outperformed manual feature extraction approaches, attaining a 5-10% detection rate gain at a 0.1% false-positive rate compared to these baselines.
The proposed model by Ref. [47] is comprised of phases of OpCode-Sequence Graph Generation, Deep Eigensapce Learning, and Feature Selection for the detection of Internet of Battlefield Things (IoBT) malware.Ref. [47] used a Convolutional Network for the deep learning module, because it can give more accurate results of classification when the data patterns are complex and nonlinear.This approach achieved 99 % accuracy and 98% Recall.
Ref. [48] focused on addressing the detection task of malware variants with the help of deep learning methods.The authors got a method published in which they transformed the nasty code into a grayscale image.Then the images were recognized and classified by employing a Convolutional Neural Network (CNN) which could extract the features of the malware images automatically.The implemented CNN was composed of an input layer, convolutional, and subsampling layers.This model also classified malware into related malware families. Ref.
[49] used the approach of converting the disassembled malware code into a greyscale image using SimHash and then used a Convolutional Neural Network to identify the malware family.The presented methodology is comprised of three phases: Feature extraction, Malware image generation, and CNN training.Results showed that the authors were successful to obtain an accuracy of approximately 99% with 10,805 samples.
Ref. [50] have focused on the description of state-targeted APT using a Deep Neural Network (DNN).Researchers utilized the ability of Deep Neural Networks (DNN) to make use of raw features as input, whereas the learning of higher-level features was done during the training process.In this progression, every hidden layer extracted higher-level features from the preceding layer, building a hierarchy of higher-level features. Ref.
[51] devised an approach of using a neural network comprised of convolutional and feed-forward neural constructs for malware classification.In this approach PE file metadata, import features and Assembly opcode features categories were used.
Ref. [52] made use of a dynamic analysis approach based on Windows API call graphs and SAE models.A Behavior-based Deep Learning Framework (BDLF) was developed in this paper which makes use of SAE for feature reduction from behavior graphs and then performs classification through Decision Tree, KNN, Naïve Bayes, and SVM.
Ref. [53] focused on malware detection based on process behavior in possible infected terminals.The published solution applies DNN in 2 stages, the first stage is for extracting process activities by RNN and converting them into feature vectors.Feature vectors were then treated as images that were classified by CNN.
Ref. [54] have worked on a new image processing technique with optimized parameters for Machine Learning algorithms and Deep Learning architectures to produce an efficient zero-day detection system of malware.First malware detection was performed using deep learning based on static analysis on ember dataset and privately collected samples and it was deduced that the performance of malware detection can marginally be enhanced by using a hybrid system pipeline proposed as Windows-Static-Brain-Droid (WSBD), which was composed of both classical machine learning algorithms and deep learning models.In the next stage of research, malware detection was performed using deep learning based on dynamic analysis.It conducted a comparison between classical machine learning algorithms and deep learning architectures based on dynamic analysis, and deep learning architectures outperformed all experiments.Finally, experiments were conducted for categorizing the malware into malware families using deep learning based on image processing.A novel technique DeepImageMAlDetect (DIMD) was proposed which is based on the image processing technique and uses CNN and LSTM.The proposed method can work on malware from different operating systems.Finally, architecture by the name of ScaleMalNet was developed.It collects data from different data sources and uses self-learning techniques such as classical machine learning algorithms, deep learning architectures, and image processing techniques for detecting, classifying, and categorizing malware to their corresponding malware family efficiently.
Authors in [55] proposed a new technique to generate a signature for malware that does not depend on any specific behavior of malware so that it can be used for variants of malware as well.To achieve the goal, researchers first recorded the behavior of malware through Sandbox and then converted the output text file into a binary vector sized.After creating a binary vector Deep Belief Network was trained by a Deep Stack of Denoising Autoencoders.
Ref. [56] focused on a technique that made use of a Deep Neural Network for malware detection using features extracted statically with more accuracy and minimum FPR.There are three main components of the framework defined in this paper: (1) the First component focuses on the extraction of four features from benign and malicious binaries (2) 2nd component is a Deep Neural Network consisting of an input layer, two hidden layers, and one output layer (3) 3rd component is the score calibrator.
Research of [57] focused on one-shot learning which is referred to when there are very few samples to learn from.It implements a model LRUA-MANN which modifies the memory access capability of a Neural Turing Machine to adapt a one-shot learning task.LRUA-MNN is used with LSTM as a controller and makes use of LSTM state and memory bank as memory.
Ref. [58] has focused on carrying out the process of malware detection without having in-depth knowledge of malware and its analysis.Two Neural Networks were used; one was fully connected, and the other was a Recurrent Neural Network.The model had 3 LSTM layers with attention mechanisms before classification.Sax et al. used Neural nets and extracted Strings and PE file characteristics but did not cope with obfuscation and did not produce good accuracy in such situations.
Ref. [59] implemented the idea of a multitasking learning model which was trained for seven classification tasks for malware image classification.The implemented model by [59] consisted of 5 CNN layers with PRelu activation function. Ref.
[60] have explored the advantages of using transfer learning in the domain of malware identification.Their research focused on utilizing transfer learning for extracting the features of malware dataset.They made use of an already trained deep learning model (trained over ImageNet) and finally classified the malware into their respective families.
Figure 5 summarizes the types of deep learning algorithms used by researchers over the years and Table 3 summarizes the performance metrics used by researchers while using deep learning based methods for malware detection.Critical analysis of all the surveyed papers that implemented deep learning algorithms, emphasizes the grave need of using a large dataset to produce reliable results.Deep learning architectures heavily make use of supervised learning that requires a large no. of labeled examples for training the model as mentioned by [61].Using the small dataset does not help the model to learn the features properly during the training phase which leads to non-reliable results.Another aspect that got unveiled during this survey referred to the fact that this large dataset is supposed to contain a large no. of examples for each class that must be identified by the trained model.And processing the bulk of data in deep learning needs powerful hardware, high computational processing power, and high training time which diminishes the chance of applying the trained models to real-time data.Because of these unavoidable features of deep learning models, the market could not get successful in replacing the signature-based anti-malware systems with artificially intelligent systems.Therefore, researchers shifted their direction of research from developing deep models for feature learning to finding out the possibilities of developing models that can work over small datasets.In the quest of achieving the previously mentioned objective, researchers explored the concept of Few Shot Learning (FSL) which is based on meta learning with a focus on learning the strategy of how to learn the meaningful properties of data.Meta learning utilizes the concept of transfer learning (multi-task learning) and semi-supervised or unsupervised learning approaches which need a few examples for the training.And thus, according to [62], the meta learning model can be trained with the help of prior knowledge.Meta learning based algorithms that are being used in malware analysis include Few Shot Learning (FSL), One shot Learning (OSL), and Zero Shot Learning (ZSL).Figure 6 shows the relationship between machine learning and meta learning models.Major advantages of meta learning based algorithms are listed in Figure 7.     Ref. [63] have explored the Siamese network for malware image classification.Siamese network architecture is the application of one shot learning field.The basic approach used by [63] was to transform the features into malware images that were input to Siamese Convolutional Neural Networks shown in Figure 8. Siamese CNNs used by the [63] produce 2 feature vectors.Finally, the Manhattan distance between those feature vectors was calculated and given to the sigmoid function to generate the similarity score.
Another surveyed paper [57] mentioned the use of one shot learning approach with a memory augmented neural network using the API calls sequence.Ref. [57] adapted an approach that has two domains of learning.The first domain in this approach is used to train the model with known malware and 2nd domain is used to train or test with a dataset of an unknown type of malware.Domain 2 makes use of domain 1 s trained model.The working of the implemented approach [57] is shown in Figure 9.
Ref. [64] have explored one shot learning approach with matching and prototypical networks.The developed model by [64] is shown in Figure 10.Ref. [64] take advantage of visual dissimilarity in the images of different malware families (shown in Figure 11) and have converted the malware binaries into 8-bit greyscale images to be given as input to the few shot learning models.
Ref. [65] presents a few shot learning based neural network ConvProtoNet.ConvPro-toNet in [65] used stacked convolutional layers rather than only computing means, to generate features of malware classes.ConvProtoNet is capable of being trained on one dataset and tested on another.
Ref. [66] composed the dataset of splash screen images showing the message of the system being attacked by the ransomware.They trained their one shot learning model on a dataset of 50 ransomware families splash screen images.Different augmentation techniques are used by [66] to tune the images for adapting one shot learning.

Issues and Challenges
Every trend in malware detection and analysis has come forward with some of its shortcomings due to which trend of research got shifted to other technologies for detecting malware in real-time with minimum false positive rate and maximum accuracy.This section will highlight all the challenges faced by each trend and the disadvantages of different techniques adapted for malware detection and analysis.Tables 4-6 summarize all issues of surveyed papers based on different analysis methods.The time of execution for extracting API calls is not mentioned.If the time of execution would have been small, then the results would not be reliable

Shortcomings of Primitive Methods (Statistical Analysis Based Methods) for Detecting Malware
Primitive methods of malware analysis depend upon statistical analysis of changes in the system or probabilistic explanation of an executable being malware based on the appearance of literals.But this probabilistic or statistical approach gives approximation over only a few features of malware and even gets stuck with obfuscated malware.
Packed executables were ignored by [21] and even the dataset was small which led to the uncertainty of results if the implemented solution is deployed in real-time.Ref. [19] made use of a detection algorithm that is context insensitive and is unable to track the calling context of the executable.
Another framework that was mentioned by [25] made use of windows audit logs but since windows audit logs can be obfuscated then in such case the presented solution is of no use.Secondly, researchers in [25] run the experiments for only 4 min which could have easily ignored the slow executing malware.
The solution given by [24] did not consider all those features which play an important role in the detection of malware.
The solution modeled by [22] used low interactive honeypots which allow only limited interaction of malware; thus, some malware can get undetected and get active only on the occurrence of certain conditions.
In the work done by [20], FPR is too high to implement the system in a real environment.The solution given by [67] tried to cater to metamorphism but dealt with only 3 techniques of obfuscation whereas there are many more techniques to obfuscate due to which claimed results cannot be reproduced in a real environment.
Hence, papers surveyed proposing the solutions for malware detection based on heuristic and statistical approaches, show that there is a need of adopting other techniques.Those techniques should be capable of improving FPR to generate a robust and reliable solution that can be implemented in real-time.

Shortcomings of Conventional Machine Learning Based Methods for Detecting Malware
In the case of static analysis being used by researchers, the foremost problem which hinders the analysis process is obfuscation, encryption, and packing.Refs.[34,35,[68][69][70][71] have executed the solution without catering to the issue of obfuscation, packing, and encryption.One of the major problems seen in many papers during the survey is the problem of anti-analysis techniques which can be called evasion techniques also.Professional malware developers or in other words sophisticated malware developers take care of the fact that the target machine can be an analysis machine or can have a virtual environment setup, so they purposefully make use of evading techniques through which, normally, first they check for the presence of virtual environment and in case of its presence malware hibernates itself.This is called environmental awareness and is very clearly stated in [58].Malware can easily comprehend and identify if it is being run in a virtual or debugging environment.Another evasion approach is timing-based which means malware gets only active at any date or time or gets activated at user interaction only.In solutions applied by [32,39,[72][73][74] detection accuracy gets noticeably reduced on facing the evasion techniques, encrypted malware, and if malware needs user interaction for getting activated.Another problem that was identified during the survey was the small or insufficient datasets being used for analysis due to which results produced might not be reliable.
Researchers in [28,30,32,37,38,[68][69][70][71][72][74][75][76][77][78][79] used small dataset.Since conventional machine learning algorithms are supposed to carry out the process of feature engineering, therefore, a very prominent problem that could be seen during the paper survey related to machine learning-based solutions for malware detection was the use of few features out of all those features which can very distinctively play a vital role in the detection of malware.Solutions carried through in [30,36,38,80,81] considered only a subset of useful features.Another shortcoming was the lack of capability of detecting the variants of malware.

Shortcomings of Deep Learning Based Methods for Malware Detection
The approach of deep learning has taken over the field of malware analysis because of its capability of automatic feature engineering but since still it is in the phase of evolution, therefore, certain issues still need to be catered to.One of the issues faced by deep learningbased methods is small data.The solution published by [43] indicates that the system was tested against small data and malware was executed for a very small time which can be easily catered by malware writers through evading techniques.Similarly, the research work of [46] used small data for training to avoid computational constraints but it affected the generalization.Again, the same problem was seen in the work of [47] due to which results of the given solution cannot be relied upon when implementing the presented framework in a real-time environment.Solutions given by [25,53,57] also suffer from the same problem.
Another problem that can be seen is the size of the input.Since CNN works over images and it is observed that most of the produced solutions work over the fixed size of images only.Solutions presented by [48,54] could perform better by handling variable size input data.Ref. [52] have not mentioned the execution time of samples for extracting API calls.In case samples would not have run for enough time, then claimed results would be non-reliable.Some of the proclaimed solutions have not catered to obfuscated samples due to which if they are implemented in real-time, their results will be affected on encountering packed or obfuscated samples.
Solutions communicated by [44,49,51] have not catered to the circumstances where evasion techniques could have been applied.In the research work of [45], sparsity constraint was not considered.Most of the solutions adapting dynamic analysis did not pay heed to multipath execution problems.Comparison between approaches carried out by [35] is not reliable because, in one of the approaches, features were not normalized whereas the value of features had a big range.Research work of [57] made use of only malware samples for malware classification although the real time system receives benign as well as malicious files so the system should have been trained on both types of files.Secondly, even malware families that were considered for training were too few.
The solution proposed by [82] is a stacked approach consisting of two stages.In the first stage, multiple base line machine learning based classifiers were used using the static features only.In the 2nd stage, the final classifier was used which worked over the dataset created by the predictions of the base classifiers used in the 1st stage.Similarly proposed methodology in [83] is also following the ensemble method.The first stage of ensemble classification in [83] is using multiple machine learning algorithms which are trained using static features only.

Direction for Future Work
There are different trivial problems that we have outlined in this paper and need to be addressed to produce a viable product capable of detecting malware in real-time.This section will highlight all such issues which need to be paid heed to, in future work.

Moderate Sized and Updated Dataset
As highlighted in the previous section, most of the survey papers have taken a small dataset which is not enough for research to produce reliable results.This problem is mainly due to the constraints of handling big data or due to the unavailability of the labelled datasets.Small datasets that have been used in research produce biased results that can't be reproduced in a real environment.Problems of unavailability of labeled data, imbalanced data, or unavailability of enough samples for a particular class of malware can be coped with through Few Shot Learning (FSL) and its variants.So that improved or state-of-the-art results can be achieved without jumping into the problems of handling and processing large datasets.Secondly, some of the datasets used for research purposes were quite old.Since malware is being produced daily with the latest and new characteristics, therefore, research carried out on outdated data might not be helpful in real-time.It is recommended that up-to-date data should be collected which should consist of all the latest variants of malware.Another issue that needs to be taken care of is the reflection of real data distribution in the datasets for training and validation of proposed frameworks.

Using Significant Features
The selection of appropriate features plays a vital role in training a model for producing effective results.Features extracted statically and dynamically both hold their contributions to the detection of malicious behavior.Most of the surveyed papers have used a subset of features or have used either statically extracted or in some cases dynamically extracted features only which paves way for the concern that some features which might be quite decisive in detecting malicious nature, may have been ignored.Many papers indicated that non-optimal features were focused on and should be taken care of in future work.Using the combination of static and dynamic features can train the model with better learned capabilities.To deploy the anti-malware system in real time environment, extracting dynamic features can pose a problem as per the limitations available in the market.In such a case, the application of neural networks can be helpful.Neural networks can deal with the images of samples of both malicious and benign classes.This way rather than focusing on any feature, all the static semantic features of the samples can be focused on.The usage of neural networks automates the process of feature engineering.Rather than selecting the features on the hit and trial method, embedding layers of the neural network can be used to automatically select the most contributing features.

Handling Evasion Techniques
As described earlier evasion techniques can be categorized as environment awarenessbased or timing-based.A framework that can be claimed to be deployed in a real-time environment should be accurate and effective so that it does not get affected by evasion techniques.It should be taken care of in future work because new malware can detect the virtual environment.Another sophisticated capability of malware is to get activated at a particular date and time and till that activating time, it does not exhibit malicious behavior.Some malware gets triggered over getting certain input otherwise they behave as a benign file.This kind of behavior can be traced by multipath execution which should be the focus of future work.

Combating Anti sAnalysis Techniques
Malware developers perform various anti-analysis techniques to suppress the detection and analysis of their released malware.Obfuscating a sample, compressing/packing the binary/exe, and encrypting the file, all are tools to make it difficult to detect and analyze the malware.Future work can be to mitigate all these anti-analysis tools to eradicate the possibility of the destructive threat posed by malware.
In short, the aggressively dynamic nature of the cyber world demands researchers to take care of the following points while conducting their research in this domain of malware detection.

•
Since malware easily changes its shape due to sophisticated techniques used by malware writers so research in the future should be conducted with the motive of dealing with metamorphic, polymorphic, and obfuscated malware.

•
The day-by-day increase in malware is the prime reason for the increasing no. of malware families and with the passage of a certain period various new forms of malware keep on showing up on the surface of the cyber world.Future research should focus on developing a generic model that should be capable of detecting zero day malware.

•
To implement the real time solution, a model should be reliable enough to handle any kind of unseen malware as well.
Deep learning based research has proved to be fruitful by producing quotable results in the detection of malware.To further improve the solution, meta learning based algorithms can be exploited in conjunction with deep learning.Meta learning based algorithms help in producing generic models.These generic models are trained for self-learning.Through self-learning, the strategies of learning the properties of even unseen types of malware can be learned easily.More specifically few shot learning has proved itself worthy of being explored in the future due to its effectiveness, efficiency, and robustness.

Conclusions
In this survey paper, we investigated the research lack in building a real-time antimalware system.This literature survey is about different techniques adapted to detect malware and analyze them.Work in this paper is organized in such a way that three different trends in techniques of detecting and analyzing malware are highlighted.Different malware detection trends have been categorized into primitive methods, which include statistical measures only, machine learning-based methods, and methods that involve new emerging technology of deep learning.The presented work's contributions include the distribution of techniques into three different trends, issues, and challenges faced by all different methods and directions of future work by mitigating all the issues faced by existing methods.Different statistical strategies are categorically highlighted that are used in the literature for detecting malware.Additionally, we shed light on machine learning algorithms and features that are used to detect malware.And finally, we discuss different deep learning models that are used in detecting and analyzing malware.This work indicates different issues related to datasets, the use of features' subsets, effects of evasion techniques, and hindrance caused by anti-analysis techniques.
Finally, future direction leading towards meta learning based algorithms have been suggested for producing a viable product capable of detecting and analyzing malware in real-time with improved accuracy.

•
The scale of devastation that malware can pose Normally in the case of malware what we get hold of for the sake of analysis are binary files or executables which are not easily understandable by humans.Therefore, different analysis techniques have been proposed to get full insights into malware.Broad categories of these techniques are shown in Figure A1.
Static Analysis: It refers to the phenomena of analyzing a file without executing it to keep the process of analysis safe.This approach includes the extraction of low-level information such as CFGs (Control Flow Graphs), DFGs (Data Flow Graphs), and system call analysis.Different tools can aid in static analysis such as IDAPro for disassembling the file.The static analysis gets failed when malware is obfuscated as it cannot penetrate through the packed samples as explained by [18].
Basic Static Analysis: It can confirm the maliciousness of the file.It can provide information about the functionality of malware, but it can't work with diligently programmed malware because of the lack of understanding of sophisticated malware's behavior.
Advanced Static Analysis: It refers to reverse engineering, which can be performed through a disassembler to understand the instruction code of the malware.Dynamic Analysis: When the file is executed in the safe/virtual environment for the sake of analysis then, it is called dynamic analysis It should be conducted by hiding the virtual environment from malware otherwise, malware can hibernate itself.This approach gets failed, when a particular triggering condition doesn't occur on which malware executes in its malicious state.
Basic Dynamic Analysis: It executes the malware in a safe environment to observe behavior to find any signature.It provides low-level information so cannot work with sophisticatedly programmed malware.
Advanced Dynamic Analysis: It uses a debugger to investigate the internal state of running malicious executables.It extracts detailed information, which helps in understanding the code as shown by [85].
Hybrid Analysis: This approach is a combination of both static and dynamic approaches.Researchers are trying to make use of the beneficial features of both approaches.
Table A1 refers to the summary of the advantages and disadvantages of static and dynamic approaches in malware analysis.

Appendix B. Glossary of All Terms
This section is organized to help the reader get aware of some technical terms that he/she would come across quite frequently while reading this paper.
Obfuscation: Ref. [86] explains it as the process of hiding a code using different techniques so that malware can bypass security devices/software.
Polymorphism: Ref. [87] states it as the strategy through which malware keeps on changing its appearance to overcome detection.It is achieved through encryption using a different set of keys every time the malware executes.
Metamorphism: Using metamorphism malware changes its code and signature pattern but it is achieved without using encryption.

PE (Portable Executable):
It is a file format for executables used in versions of windows.Opcode: In machine language, the opcode is the part of instruction that refers to the operation.
DDOS: It is an acronym for Distributed Denial of Service, and it is categorized as a network attack.
Honeypot: It is a system attached to the network to attract cyber attackers as mentioned by [88] in their work.It works by luring the attackers away from the systems having critical info.Furthermore, it helps in observing the attacker's behavior and collecting information about the attacker's activity.Honeypots are the systems that imitate to contain the data values for the attacker, but these systems do not get accessed by legitimate users.
Ref. [89] further categorized into low interaction honeypots and high interaction honeypots.Low interaction honeypots contain software that emulates the real service whereas high interaction honeypots contain a complete operating system, services, and applications to give a complete real feeling of a valuable system to the attacker.
Machine Learning: It is a specialized field that comes under the hood of Artificial Intelligence.It makes use of AI to take decisions by mining the information from data as described by [90].
Supervised Learning: It is a learning technique used by AI-based algorithms for finding out the mapping function between input (x) and output (y) provided input and corresponding output.
Unsupervised Learning: It is a learning technique utilized by AI-based algorithms to find the underlying structure in data when only input is given.
Classification: It is a supervised learning technique that is applied when the output variable is a category and there is no relationship among the values of the output.
Regression: It is a supervised learning technique and is shed when the output variable is a real value and values of the output variable have a relationship (greater than or less than).
Clustering: It is an unsupervised learning technique in which data is divided into groups based on some similarity measure.SVM: Support Vector Machine-It is a machine learning algorithm based on supervised learning and can be used for both classification and regression.
KNN: K Nearest Neighbour-It is a machine learning algorithm that works by measuring similarity.
Random Forest: It is a machine learning algorithm that can be used for both classification and regression.
Naïve Bayes: It is a supervised learning-based machine learning algorithm that works over applied Bayes.
LSH: It is a clustering-based machine learning algorithm.
Neural Networks: Neural networks also known as artificial neural networks are techniques of machine learning that simulate the process of learning by a human brain.The human brain consists of cells which are referred to as neurons in neural networks.Similarly, in a human brain, all the cells are connected through axons and dendrites with the connection region known as synapses.These connections when found in ANN (Artificial Neural Networks), contain weights to behave as the connections between nerve cells in the human brain.Figure A2 shows the human brain and simulated version of the human brain through the artificial neural network.
Deep Learning: Ref. [13] explained it as a specialized form of machine learning in the domain of Artificial Intelligence (AI) which applies deep artificial neural networks also known as deep neural networks.The major difference between conventional neural networks and deep neural networks is the number of layers.Deep neural networks make use of many hidden layers.Deep learning networks can be further categorized into different types of models such as deep neural networks (DNN), recurrent neural networks (RNN), and long short-term memory (LSTM).Unlike machine learning, it is capable to deal with unstructured data as well.
RNN: Recurrent Neural Network is a generalized form of feed-forward network that can handle sequential data by processing the current input as well as the previously received input stored in its internal memory (hidden units).The internal memory of RNN refers to the hidden units in intermediate or hidden layers which have got the capability of retaining and processing the previous inputs concerning time, having interdependency on each other.The Standard and unfolded architecture of RNN is shown in Figure A3.It is used where sequence and time series are important.
Autoencoder: According to [91] it is a type of feed forward neural network which makes use of an encoder and decoder to first compress the input and then decompress it.This process of compression and decompression is to learn the features of input first so that the same input can be reconstructed at the output.This is a type of NN that makes use of learned most important features of data to reconstruct it.
Stacked AutoEncoder: It is a neural network that consists of many AutoEncoder layers with the output of each layer connected to the input of the successive layer as explained by [89].

Figure 3 .
Figure 3. Performance Metrics Used in Literature Proposing Primitive Methods for Malware Detection.

Figure 4 .
Figure 4. Types of Deep Learning.

Figure 5 .
Figure 5. Deep Learning Techniques Used for Malware Detection.

Figure 6 .
Figure 6.Relationship Between Machine Learning and Meta Learning.

Figure 11 .
Figure 11.Visual Samples showing Dissimilarity Between the Images of Different Families [51].

Table 1 .
Related survey Papers on Malware Detection Approaches.

Table 2 .
Datasets and Performance Metrics Used in Literature Proposing Machine Learning Methods for Malware Detection.

Table 3 .
Datasets and Performance Metrics Used in Literature Proposing Deep Learning Methods for Malware Detection.

Table 4 .
Limitations of Surveyed Papers Proposing Primitive Methods for Malware Detection.

Table 5 .
Limitations of Surveyed Papers Proposing Machine Learning Based Solutions.
S. Pai, F. Di Troia, C. A. Visaggio, T. H. Austin, M. Stamp 2015Researchers did not consider the obfuscation while training the models, therefore the performance of models would not be good on real time data.Secondly, a small dataset was used for training the models which is not recommended.

Table 6 .
Limitations of Surveyed Papers Proposing Deep Learning Based solutions for Malware Detection.

Table A1 .
Comparative Analysis of static and Dynamic Approaches to Malware Analysis.