MANNWARE: A Malware Classiﬁcation Approach with a Few Samples Using a Memory Augmented Neural Network

: The ability to stop malware as soon as they start spreading will always play an important role in defending computer systems. It must be a huge beneﬁt for organizations as well as society if intelligent defense systems could themselves detect and prevent new types of malware as soon as they reveal only a tiny amount of samples. An approach introduced in this paper takes advantage of One-shot/Few-shot learning algorithms to solve the malware classiﬁcation problems using a Memory Augmented Neural Network in combination with the Natural Language Processing techniques such as word2vec, n-gram. We embed the malware’s API calls, which are very valuable sources of information for identifying malware’s behaviors, in the different feature spaces, and then feed them to the one-shot/few-shot learning models. Evaluating the model on the two datasets (FFRI 2017 and APIMDS) shows that the models with different parameters could yield high accuracy on malware classiﬁcation with only a few samples. For example, on the APIMDS dataset, it was able to guess 78.85% correctly after seeing only nine malware samples and 89.59% after ﬁne-tuning with a few other samples. The results conﬁrmed very good accuracies compared to the other traditional methods, and point to a new area of malware research.


Introduction
Cyber-attacks have threatened security systems across the world. Malware, or malicious software, plays a critical role in cyber-security as it is intentionally developed to perform various destructive tasks on the victim's system without his knowledge. According to a report from Kaspersky Lab in 2017, at least 360,000 new malicious files were detected every day in 2017-an 11.5% increase from the previous year [1]. In May of 2017, WannaCry, a new type of ransomware at that time, and its variance spread quickly through many computers of companies around the world, encrypted files on the PCs, and caused substantial financial damage to these companies. In the report published by Symantec Corporation in February 2019, a decrease in ransomware activity during 2018 was observed for the first time since 2013, with the overall number of ransomware infections on client sides dropping by 20%. However, ransomware such as WannaCry continued to inflate infection figures, and the number of ransomware infections has been shifted toward enterprises with 81% of all infections [2]. These harmful programs are more devastating than others due to their spreading speeds and its functionalities. This trend leads to a need for methods that are powerful enough to detect and stop these kinds of malware as soon as they start spreading widely. Currently, automatic defense systems can respond to these malware threats by keeping up with the speed of the malware development, but real success depends on strategic insight as well as the speed of the response. Therefore, the most effective defense strategy requires both intelligent (machine learning-led) programs as well as human expertise.
Thanks to the huge development of AI technology, more and more fast and reliable research has been developed to detect and classify new types of malware. In the machine learning field, one of the key challenges is the number of collected samples. Generally, in this field, the more data we collect, the better the accuracy we get. However, there is not always enough data for every task. In this case, an idea of learning object class from only a few data called one-shot/few-shot learning is widely used. Many one-shot/few-shot learning algorithms have been proposed to deal with "data-hungry" problems. Some concepts are based on the Bayesian approach, such as the work of Li Fei-Fei et al. about learning object categories with One-shot learning [3], a probabilistic program induction by Brenden M Lake et al. [4] and R. Salakhutdinov et al.'s work on the Hierarchical Nonparametric Bayesian Model [5]. Some other approaches take advantage of Meta-learning to solve this challenge, such as a variation of Neural Turing Machine for one-shot learning tasks introduced by A. Santoro et al. [6], Matching Network by O.Vinyals et al. [7] or Siamese Network by G. Koch et al. [8]. These meta-learning models are capable of adapting well or generalizing to new tasks with unknown data using their learned meta-knowledge during training time.
In this paper, we solve the malware detection problems, which usually require several samples for analysis by using the one-shot learning algorithms. The proposed approach uses the Memory Augmented Neural Network in combination with Natural Language Processing (NLP) methods such as n-gram, word2vec., to classify new malware species with very little information about them. More specifically, we use meta-learning to train the model on how to learn from available resources so that it could itself adapt to the new tasks and new environments that have never been encountered during training time. This approach is an extended version of the work published in the 5th Asian Conference on Defense Technology (ACDT 2018) [9]. We extend our previous work by providing more background information, adding more options for the model, and verifying the performance with the experiments on the two datasets called FFRI 2017 [10] and APIMDS [11].
Using One-shot/Few-shot learning approaches in malware analysis could help systems to detect and classify new types of malware by only a few samples that have just been revealed. It could be adapted to classify rare malware that has never been seen before, such as the ransomware WannaCry mentioned above when it starts to spread. Since this kind of malware research has not been introduced before, our approach could lead to exploring a new research field in the cyber-security analysis. In addition, some traditional methods such as Recurrent Neural Network (i.e., LSTM and GRU), SVM, Random Forest are used as baselines to evaluate the results.

Meta-Learning
Meta-learning has been applied in many fields of machine learning/data mining. Its primary goal is to understand the interaction between a mechanism of learning and the contexts in which that mechanism is applicable. It assists machine learning systems with the process of model selection by the meta-knowledge acquired from the learning algorithms. In other words, via meta-learning, the networks could "learn how to learn" from prior experience or learned knowledge. This network learns to deal with the tasks via two stages: one acquires meta-knowledge from machine learning systems, and the other adapts that knowledge to the new problems (domain) with the objective of identifying a suitable learning algorithm or technique for them [12]. In the first stage, the learner accumulates the knowledge on the performance of multiple applications (dataset), which captures how task structure varies across the tasks. Then, whenever there is a new task to learn, the model itself fine-tunes its weighted-parameters using the small amount of the new training data to select an applicable algorithm. Meta-learning could be categorized in several ways, such as recurrent models (MANN), metric learning (Matching Network for One-shot learning), meta-optimization (Model Agnostic Meta-Learning-MAML [13]), etc. In this paper, the meta-learning system for one-shot learning tasks using recurrent architecture (Recurrent Neural Network-RNN) is applied. It follows the task set-up proposed by Hochreiter et al. [14] as pictured in Figure 1.

Figure 1.
Overview of the Meta-learning system using Recurrent Neural Network. The subordinate system plays a role as a self-adjustable system.

Meta-Learning with Memory Augmented Neural Network
In this paper, in terms of One-shot/Few-shot learning, in spite of many existing meta-learning based approaches, the Memory Augmented Neural Network proposed by A.Santoro et al. [6] is used as a model to classify unknown malware with only a few trained samples. A.Santoro et al. modify the memory access capabilities of the Neural Turing Machine (NTM) introduced by A. Graves et al. [15] to adapt this model, LRUA-MANN, to one-shot learning tasks. Although Neural Turing Machine still suffers from some problems seen in Neural Network architectures, (i.e., the fixed size of the network), it is considered a promising architecture in the future. While giving a speech in the Machine Learning Conference 2016 in San Francisco, Daniel Shank, a Senior Data Scientist at Talla, insisted that "Neural Turing Machines are a landmark architecture in the field of machine learning" [16]. Many research works and applications developed from NTM such as the Dynamic Neural Computer by A. Graves et al. [17], or the Kanerva Machine by Y. Wu et al. [18] have confirmed this. Hence, we believe that taking advantage of NTM in our approach could help it to be extended in the future.
Basically, this network is fundamentally composed of four parts as NTM and is illustrated in Figure 2. A neural network called the Controller Network receives and processes inputs, and sends its output vector to a Write Head before receiving processed data from the Read Heads, and forwards them to the output layers of the network. A simple matrix (or Memory Bank) is used to store processed data from the controllers and is considered as a memory of the whole model. The data are written into the memory from the controller via Write Head and read by the Read Heads using a special addressing mechanism called Least Recently Used Access (LRUA). In contrast to the two memory addressing mechanisms of NTM, namely content-based and location-based, LRUA allows data to be written to either the least used location (rarely-used locations) or the most recently used location (last used location) of the memory. Thanks to this mechanism, the model is advantageous to one-shot/few-shot learning sequence-based prediction tasks. More specifically, it calculates the write-weight vector w w t as follows: where σ(α) is a sigmoid function of a scalar parameter α, w r t−1 is the read-weight vector of a previous step, and w lu t−1 is the least used weight vector, generated from the usage weight vector w u t−1 that updates every step with a decay parameter γ as From this vector, an important weight called least-used weight w lu t is defined accordingly: The notation m(w u t , n) represents the nth smallest element of w u t . The memory will be written in accordance with this write-weight vector The Read Heads are used to read data out of the memory. First, a Read Head computes a cosine distance between a query key vector generated from controller output k t and all the memory cells as Then, this measure is used to create a read-weight vector w r t (i), which is a result of its softmax function.
Finally, a read vector r t is generated This read vector will be used in conjunction with the hidden state of the controller to produce an output of the network.
In general, LRUA MANN is perfect for meta-learning and one-shot/few-shot learning tasks as it could look back to the learned knowledge by both long-term memory via network's updated weights and short-term memory of its external memories. This model overcomes the problems of the other model using RNNs, which could not perform memorization well.
Moreover, in the original paper, A.Santoro et al. did the experiments with two different neural networks as controllers, which are the Recurrent Neural Network (LSTM) and Feed Forward Network (FFN). While the LRUA-MANN with the LSTM controller could use two types of memory, e.g., Which are LSTM's hidden cell states and Memory Bank (NTM memory cells), to save the information, the LRUA-MANN with FFN functioned as a controller use only the external memory of NTM. The LRUA-MANN using the LSTM controller seems to provide better accuracy than the one using the FFN controller since it could effectively remember the previously processed inputs.

Related Works
In terms of classifying malware with a few samples, there are two other relevant methods: one is Anomaly Detection, and the other is Domain Adaptation.

Anomally Detection
Anomaly detection refers to the identification of unusual objects, or outliers which are markedly different from the other known objects in the same dataset. These algorithms have broad applications in a variety of domains, such as to detect network attacks in cyber-security, to detect failures in a system and to remove anomalous data in data preprocessing step in machine learning.
In the malware analysis field, many anomaly detection algorithms have been applied [19][20][21]. Based on all valid behaviors of benign programs, anomaly detection techniques help malware detectors to detect previously unknown zero-day attacks. These methods alone themselves are not sufficient for malware detection and usually go along with other approaches to overcome their limitations, e.g., high false alarm rate and complexity [22].
To some extent, these features are related to zero-shot/one-shot/few-shot learning tasks, which are applied in this proposed approach. Both of them are used to identify never before seen malware classes based on their experience. However, the anomaly detection differs in some aspects from this proposed approach because (i) The anomaly detection techniques may identify malware that come from whether the same manifolds or not, whereas the few-shot learning model (MANN) used in this approach classifies malware based on the previously learned samples belonging to the corresponding families; and (ii) The anomaly detection normally aims to distinguish between the two contrasting objects, such as the "normal" and the "anomalous" process behaviors, the seen and the unseen malware classes. In other words, it is considered as a binary classifier, though our approach is a multi-class classifier that focuses on classifying malware into two or more classes. Although an idea of multi-class classification based anomaly detection techniques has been mentioned (e.g., Stefano et al. [23] or Barbara et al. [24]), such methodologies help classifiers to distinguish between multiple normal classes and one anomalous class; hence, they are not suitable for multi-class classification tasks as presented in this paper. Therefore, in this paper, they are not used as comparative baselines in our approach. They ought to be significant opponents when dealing with malware-benign classification tasks.

Domain Adaptation
Another field associated with machine learning and transfer learning is Domain Adaptation. This field refers to the ability of a learning mechanism to improve performance on the target tasks after being trained in a different but related concept on a previous source task [25]. This is also the purpose of meta-learning approaches which we use in this proposed approach. In our approach, Domain adaptation is also applied as we want to adapt the trained network in the assumed known malware domain for the assumed unknown ransomware domain, makes it flexible to work well on other new malware domains in the future.

Proposed Method
Even though many applications of machine learning to the Malware Classification task bring very high performance (e.g., the recent works of M. Kruczkowski et al. [26], M. Ahmadi et al. [27]), such methods require hundreds of samples to be effective. In this paper, we introduce a different method that classifies malware into the proper families using only a few known samples while maintaining the acceptable level of performance. This approach could solve the "data-hungry" problem, which is a major drawback of most current machine learning algorithms.
An overview of this approach is depicted in Figure 3. The proposed approach contains two different domains of learning, called Domain 1 and Domain 2. The first domain is used to train the model with already known malware types, and the other is for either training or testing with unknown types of malware dataset. The second domain is also known as the fine-tuning process in which the model is made familiar with a new dataset with a few support data.
In the second domain, the unknown malware samples are classified using the adapted model that was optimized via gradient descent during the first phase of Domain 1's training. The model in this domain could be either trained in a few episodes with a few unknown malware samples or tested directly with the new types of malware.
Generally, the proposed method is implemented in the following five steps:

Collect Malware's Behavior Characteristics
Depending on the malware analysis methods, e.g., static analysis and dynamic analysis, different features from malware could be collected for malware analysis. On the one hand, some useful resources of a program, such as lists of API calls (Application Programming Interface) and sequences of opcodes, are usually used in the malware static analysis. This method is often straightforward, but prone to self-encrypting (or packed), obfuscation, or self-morphing processes (e.g., polymorphic malware species). On the other hand, despite some limitations, such as time-consuming and specific environments to record malware's behavior, the dynamic analysis could deal with the hindrances of the static analysis as it helps to analyze the behavior of malware on the fly. Since malware and benign programs use API calls, such as File IO and Registry read/write, to interact with the OS (Operating System), with the dynamic analysis methods, analysts could collect them with some process hooking libraries, e.g., Detour library [28], or EasyHook [29]. Moreover, some methods use API calls as input features to classify malware, and also bring good results; hence, API call sequences could be considered good resources to identify malware.

Extract Feature
As some kinds of malware could inject fake API calls to the regular API call sequences to cover up temporal information, and to reduce the effect of the fake API calls, we use n-gram methods to split API call sequences into different kinds of API. This idea is inspired by the work of S. Guo et al. [30] for splitting the API call sequence into some sub-sequences.
The N-gram concept is widely used in NLP (Natural Language Processing), which is simply an n-character slice of a long sentence. By sliding an n-sized window along to the sequence of API calls, the series of the n API calls are obtained. After being generated, these new n-gram APIs are converted into a numerical representation of information in the following process of vectorization.

Vectorize Feature
This step is to find the significant relationship between n-gram APIs to generate the most appropriate feature vector of each malware. A vectorization model is trained by converting all n-gram APIs of malware in Domain 1 into the specific vectors via word2vec introduced by T. Mikolov et al. [31]. The feature vector of the malware is then represented by taking an average of all converted n-gram APIs' vectors of it.
In NLP, the word2vec model has two models, Continuous Bag-of-Words (CBOW) and Continuous Skip-grams, which are simple 2-layers neural networks, to produce word embedding. These methods are proved to be more efficient than the other NLP model in representing words. In the CBOW architecture, the model predicts the considering n-gram API by numerous surrounding APIs without regard to their order. In contrast, the continuous skip-gram architecture uses an n-gram API to predict the surrounding n-gram APIs. By working in such ways, the output vector of each API is the accumulated weights of the hidden layer. In this paper, our approach uses these models, and achieves different results according to the datasets.

Train the Memory Augmented Neural Network
We train the Neural Turing Machine using LRUA addressing mechanism to learn the meta-data of known malware samples in Domain 1. In contrast to other traditional machine learning training methods, for the meta-learning task in this training process, we guarantee that the model satisfies the rule of self-learning system, a theory of Hochreiter et al. [14]. We provide a combination of the previous output of the network and the current sample as an input to the network so that it could optimize its sub-systems itself. The input of the network is declared as a sequence of {(x t , y t−1 )} t=1,...,n , where x t is a current input malware sample, y t−1 is a malware class of the previous step of the sequence. Hence, the input sequence of the network is represented as the sequence {(x 0 , 0), (x 1 , y 0 ), . . . , (x n , y n−1 )}.
In each training episode, the network is taught to recognize malware gradually in the sequence of samples belonged to five random malware families. This is called the 5-way classification task. These five classes are randomly grabbed from the training dataset, numbered from 0 to 4, and encoded as one-hot vector labels. These classes are different in every episode. For each class, the equal number of malware samples of the corresponding class is randomly sampled.
The output at time step t of the Memory Augmented Neural Network is defined according to the definition in Section 2.2. In general, it could be rewritten as the function of the whole MANN network: where [x t , y t−1 ] is a combination of the vectorized malware sample and the class of the previous malware sample in a sequence at time t. h t−1 and c t−1 are the previous states of the MANN's controller (LSTM). r t−1 is a specific value read from the memory of the previous step. Then, the classification result is a categorical distribution p t of the output o t using weights from MANN's output to the linear output layer W op This network is trained for thousands of episodes so that the output of the classification task could reach an acceptable threshold. Then, this optimized model is transferred to Domain 2 to test with unknown data.

Adapt Trained Network to the New Domain
In this step, we adapt the trained network in the previous step to classify the new kinds of malware. The trained word2vec model in Domain 1 is used to generate feature vectors of the new samples. It is noticed that the new malware samples might not contain n-gram APIs collected and vectorized by trained word2vec model in Domain 1. In such cases, they are eliminated in a calculation of the new feature vector. The number of non-existence n-gram APIs could be reduced if we collect enough malware samples in Domain 1.
For some tasks, if there are enough samples of new malware types, we could use some of them to fine-tune the model before doing classification tasks with the unknown samples to increase final accuracies. Otherwise, these unknown malware could be directly classified with the pre-trained model.
The training and testing procedures in this domain are the same as in Domain 1.

Experiments
In this paper, the experiments with two models on two different datasets are described. One model uses LSTM as a controller as the original paper, and the other uses the Gated Recurrent Unit (GRU) as a controller. LSTM and GRU are two closely related recurrent units of the recurrent neural network (RNN), which is an extension of a conventional Feed Forward Neural Network. While LSTM introduced by Hochreiter and Schmidhuber has had a long history of application since 1997, GRU was first implemented in 2014 by Cho at al. [32]. It is slightly less complex but approximately as good as an LSTM performance-wise. The key difference between them is that GRU has two gates (reset and update gates), while LSTM has three gates (namely input, output, and forget gates). According to empirical evaluations of these RNN units [33,34], GRU could outperform LSTM in terms of convergence in CPU time, the number of parameters, as well as the performance in some cases. In our experiments, they provide different performance according to the datasets.

Hyper Parameters
For the vectorization process, we use the word2vec model and vary its parameters to generate different feature vectors. Specifically, we try our model with different input features according to the following factors of the word2vec model: • N-gram: The number of API calls are used to create a new n-gram API. • Min-count: A threshold such that all n-gram APIs with total frequency in a malware's API sequence lower than it will be ignored. • Window-size: Maximum distance between the current and predicted word within a sentence.

Case 1: FFRI 2017 Dataset
The FFRI dataset is a part of the anti-Malware engineering WorkShop (MWS) dataset, which is designed for use in anti-malware research. The data has been collected and created every year since 2013 using the dynamic malware analysis system Cuckoo Sandbox and Yarai Analyzer Professional by a private company FFRI. In this paper, we use the FFRI 2017 dataset collected from March 2017 to April 2017 [10]. This dataset is a log of total 6251 malware samples, in which most of them were selected randomly from a massive collection of malware by crawling the web sites, online malware services reflecting the trends of the malware at that time such as Virustotal, solely in Portable Executable (PE) format. Each malware's activities are recorded via Cuckoo sandbox in 90 s and are provided in a JSON format.
In this experiment, we assume that we have already known many types of malware except ransomware, and the task is to classify ransomware into the assigned classes with very little knowledge of them. For that purpose, the dataset is divided into two subsets, one contains only the ransomware types, and the other is the rest of the FFRI 2017 dataset. By splitting the dataset in this way, we could verify the ability of the model that deals with never seen before ransomware samples. All classes on this dataset that have more than ten samples are collected. On the known malware dataset, there are 45 classes with a total of 4154 malware samples. The ransomware dataset contains nine ransomware classes with 1574 ransomware samples. These classes are listed in Table 1. We train our model in Domain 1, which consists of the known types of malware, and use this trained model to classify ransomware in Domain 2.

Training of the Model in Domain 1
Firstly, we split the API Sequences of every record (malware) in the first dataset into n-grams. According to the results achieved from experiments, if we use 1-gram to split API sequences, we could get the best results. Secondly, we train the word2vec network with those new samples with a variety of hyperparameters such as window-size, model types (i.e., skip-gram and CBOW), etc. Then, the malware's feature vector is calculated as the average value of all converted n-gram sliced API vectors of the corresponding malware. Next, we train the model to do 5-way classification tasks (classify malware into five categories). In this task, for each episode, ten samples of each malware class are randomly chosen to create an input sequence of fifty samples (five malware families with ten samples each) and fed to the LRUA-MANN. The model is trained to classify samples in this sequence one by one into five different categories numbered from 0 to 4 (these categories are different from the malware families). Particularly, after randomly guessing the first sample of the sequence on the 5-way classification task, the model will classify the next instance based on its knowledge of the previous samples stored in the memory and so forth.
The training process is terminated after 200,000 episodes. To calculate the loss of the training, we use categorical cross-entropy as a loss function for multi-class classification. This loss value gradually reduces after some thousands of episodes of training, and the trained model with the lowest loss value is selected for the unknown ransomware classification tasks in Domain 2.

Classification of the Unknown Ransomware in Domain 2
The trained models with the lowest loss value in Domain 1 are selected to guarantee the best accuracy in the unknown ransomware dataset. The two experiments are implemented in this domain: (i) use the pre-trained model without any modification in Domain 2 to predict the unknown ransomware, and (ii) fine-tune the pre-trained models with 11 samples of each ransomware family, and then test them with the rest of the ransomware dataset in Domain 2 In the first experiments, they are conducted with many parameter-dependent models. Among them, the best results of 71.99% accuracy after examining 9 samples are obtained when we use the skip-gram model with the window-size parameter of 50 and the min-count parameter of 2 to convert 1-gram APIs into a vector of 50 dimensions. Table 2 demonstrates the results of the model using LSTM as a controller with five different word2vec parameter settings. Table 2. Accuracies of the first ten instances without training in Domain 2 of the model using LSTM as a controller. The best result is archived by using feature vectors generated from word2vec in the skip-gram model with the following parameters: n-gram of 1, vector-size of 50, window-size of 5, min-count of 2.

Model Vector Size Window Size Min-Count n-Gram 6th 10th
Skip-gram We also compare the performance of the model when using different controllers. According to Table 3, the MANN model using LSTM outperforms the one using GRU from the second instance's accuracy. Regarding the baselines, to the best of our knowledge, it seems that there are no other one-shot/few-shot learning methods for classifying malware using API call sequences so far. Therefore, to verify our result, we use some traditional classifiers such as Support Vector Machine (SVM), Random Forest, Feed-Forward Network, and Nearest Neighbors. We also use other RNNs (i.e., LSTM and GRU) to classify ransomware families as the baselines under the same test condition as our approach.
According to Table 1, we found the unbalanced categories in the ransomware dataset. for example, the family trojan-ransom.win32.shade has only 21 samples while trojan-ransom.win32.blocker family has 555 samples. Although this imbalance does not influence our model as we pick randomly five samples each class each time, it could impact the final result of the baselines. It can be reduced to a great extent by under-sampling the majority classes and making them close to that of trojan-ransom.win32.shade class. Hence, we randomly select 21 samples from each class to create a subset of 105 samples for each experiment.
Except for the experiments conducted with RNN, which is the same with the MANN model, we run the experiments of the baselines (traditional machine learning methods) with five and nine random samples from each class for training and the rest for testing. These experiments correspond to the 6th and 10th instance of the MANN model's experiments, respectively. In both experiments, the baselines are reset to train and test 1000 times, and the final accuracies of these baselines are the averages of their accuracies. These results are then compared with the results of the 6th instance and 10th instance of the MANN model, respectively. In both cases, the proposed models overcome all six baseline models. Regarding the baselines' results, except for the K-NN model with very low accuracies (25.63% and 32.76% for the 6th instance and the 10th instance, respectively), other baselines could reach up to around 50% in these experiments. However, these results are not higher than ours in both cases. The results show that the MANN model using LSTM as the controller is better than using GRU in these experiments (69.58% and 71.99% with LSTM as controller compared to 66.38% and 68.44% of the MANN model with GRU as controller). The detailed results are listed in Table 4. Table 4. Accuracies of the first ten instances without training in Domain 2 of the FFRI 2017 dataset. The best result is archived by using feature vectors generated from word2vec in the skip-gram model with the parameters: n-gram of 1, vector-size of 50, window-size of 5, min-count of 2. If we spend a few data in Domain 2 for fine-tuning the pre-trained models, we could improve their overall performance. Table 5 shows the results of the 5-class classification tasks after the fine-tuning process. In these experiments, both baseline models and MANN are pre-trained with a total of 55 samples, in which 11 samples of each class are randomly selected. The test results of the pre-trained baselines without any training correspond to the first instance in the experiments of the MANN model. In other cases, we train the baselines with either one, five or nine random samples of each feeding class. The results are compared with the proposed approaches at the 2nd, 6th and 10th instances accordingly. Note that, even when fine-tuning, this setup still follows the same procedure as the previous experiments as the assigned class is a random number. The results are very competitive since the MANN model using GRU as a controller could guess correctly 84.59% at the 2nd, 90.8% at the 6th and 91.77% at the 10th instance, overcoming the other baselines such as LSTM with 81.63%, 90.88% and 87.08%, or SVM with 83.32%, 83.51%, and 84.22%, respectively.

Case 2: API-Based Malware Detection System Dataset
In this experiment, we use a dataset collected and shared by Huy Kang Kim et al. [11]. This APIbased Malware Detection System (APIMDS) dataset contains with 23,080 malware samples picked randomly from the Malicia project [35] and VirusTotal [36]. It was shared online via Hacking and Countermeasure Research Lab in the Graduate School of Information Security of the Korea University. This dataset is summarized in Table 6.
The same experiments as the FFRI 2017 dataset in the previous section are conducted. We split this dataset into two smaller subsets. One contains all assumed known malware, which is all types of malware in this dataset except the ransomware types, and the other has only ransomware samples. Table 7 lists all ransomware classes and their number of samples.
There is also an imbalance in the ransomware dataset like in the FFRI 2017 dataset. For example, the Trojan-ransom.win32.blocker family has only 18 samples while the others have more than 84 samples. For the baselines, we try to balance the number of samples in all classes by reducing the number of other classes to the number of samples of the Trojan-ransom.win32.blocker family. These samples are randomly changed according to the experiments with the traditional classifiers.
In the first domain, the malware samples on the APIMDS dataset are converted into feature vectors via the word2vec model using various settings of the parameters. The final results of the model in Domain 2 shows that if the 250-sized feature vectors are generated using the CBOW model with 5-sized window-size, 1-gram, and min-count of 1, the models could produce the best accuracy. A comparison of this setting with the others is detailed in Table 8. Compared to the setting that works best on the FFRI 2017 dataset, in this dataset, 250-sized feature vectors generated by the CBOW model give us the best performance.
The models using different controllers are also examined on this dataset. According to Table 9, the MANN model using GRU controller outperforms the one using LSTM only in the first few instances.
Our models also show superior results over the baselines. Apart from the KNN classifier with a very low rate of 28.65%, the other traditional methods could categorize ransomware samples into the correct classes with just around 50% accurate. These results are indicated in Table 10.
In the case of fine-tuning, again, our pre-trained models and the baselines are fed 40 samples for fine-tuning, in which eight samples are randomly taken from each class, leaving the rest ten samples in each class for testing. Generally, it improves all the performance of our models and the baselines. As indicated in Table 11, although the classification rates of the first instance of the MANN models are lower than the baselines (i.e., 43.02% with GRU controller, and over 71% of the baselines), the second instance could be classified with a higher rate as above 75% with the model using the LSTM controller and 80.55% with the one using the GRU controller. From the 6th samples, the model using LSTM could classify correctly over 89% while SVM, Nearest Neighbors (with K = 4), MLP, and RF are almost the same with 78.09%, 56.02%, 78.58%, and 70.98% respectively. The RNN models could not overcome the MANN models, although they were better than the machine learning methods. Misc. -5.52 Table 7. Ransomware families of Domain 2 dataset.

Families Samples
trojan-ransom.win32.blocker 18 trojan-ransom.win32.mbro 84 trojan-ransom.win32.agent 93 trojan-ransom.win32.pornoasset 90 trojan-ransom.win32.foreign 145 Table 8. Accuracies of the first ten instances without training in Domain 2 of the model using LSTM as a controller. On the APIMDS dataset, the best model is achieved by using feature vectors generated from the word2vec in the CBOW model with the parameters: n-gram of 1, vector-size of 250, window-size of 5, min-count of 1.

Discussion
In this section, we have evaluated the proposed approach on the two datasets. The results provided by the experiments are also different according to many factors.
Firstly, the best results achieved with the Skip-gram model on the FFRI 2017 dataset, and with the CBOW model on the APIMDS dataset have shown that the most important influence on this approach is how to select the best-fit word embedding model regarding the dataset. Both the Skip-gram model and the CBOW model with various settings show their advantages and disadvantages. According to Mikolov et al. [31], Skip-gram model works well with a small amount of the training data, whereas CBOW is several times faster than train, slightly better accuracy for the frequent APIs. They are correct in our cases. While only 4154 samples on the FFRI 2017 dataset were used to train the model, nearly 22,650 samples extracted from the APIMDS dataset were used. Hence, depending on the size of the dataset, we could specify the best word embedding model for our approach.
Secondly, we also consider the MANN model's parameters as another vital factor. Since the MANN model is one of the two important components of this approach, its network parameters, such as controller types and memory size, decide the suitable model in accordance with the specific dataset and the number of used samples. For example, in our experiments, the model equipped with LSTM seems to adapt better with a few samples, but it is better to use GRU as the controller in some other cases when there are more data to learn. However, it is necessary to conduct more experiments with different datasets before deciding the best controller for this approach.
Thirdly, the way of recording API calls of malware is another factor that influences the final performance. While the APIMDS dataset provides the API calls of each malware as a sorted sequence following the sampling methods of the authors, the FFRI 2017 dataset stores API calls of the malware separately according to their recording times in JSON format. Consequently, it might make the difference in the structure of the API call sequences between the two datasets.
Finally, how to merge API calls into a new one using n-gram also contributes to the final results of the model. On these datasets, unigram is the best value among n-grams that improves the quality of the models' inputs. As mentioned in the previous sections, it seems that unigram is useful to not only extract important features but also eliminate the effect of the faked injected APIs in malware samples. Therefore, to conclude, depending on the malware data, many aspects need to be considered to increase the efficiency of this approach.

Conclusions
The early detection and prevention of dangers from threats such as spreading of malware, zero-day exploits, etc. in defense systems of small networks, organization network as well as the whole internet, is an essential area of cyber-security research. Almost all of the recently proposed malware classification methods require tons of malware samples to achieve accepted accuracies. To overcome this problem, the proposed approach takes advantage of the AI developments to introduce a novel way that could help malware analysts quickly classify malware into the correct groups, even with only a few known samples.
In this paper, the effectiveness of malware classification based on Natural Language Processing in combination with Memory Augmented Neural Network for the one-shot learning task has been shown. The accuracies of the proposed approach are quite good, even if only one sample is recognized. Furthermore, these results could be improved by adjusting the hyper-parameters of the Neural Turing Machine and parameters of the word2vec model; however, this makes the approach parameter-independent. Therefore, it is necessary to continue to undertake deeper research into this disadvantage before applying it to practice.
As future work, on the one hand, we will take a deeper look at some other one-shot learning algorithms to find more suitable methods for malware analysis to improve the accuracy of our approach. Also, the evaluation methods will be considered and conducted more precisely so that the proposed approaches could be accurately evaluated. On the other hand, we will also try to collect more benign programs as well as different kinds of malware's API sequences. Hence, we could re-evaluate our methods to detect and distinguish malware from the benign-ware well.
Finally, because this is only the beginning of research that applies one-shot learning algorithms in the machine learning field to malware analysis and cyber-security, we hope our contribution could create a paradigm for future studies of the malware research, and provide malware analysts a new way to classify malware even with a few samples.