Contextual Identiﬁcation of Windows Malware through Semantic Interpretation of API Call Sequence

: The proper interpretation of the malware API call sequence plays a crucial role in identifying its malicious intent. Moreover, there is a necessity to characterize smart malware mimicry activities that resemble goodware programs. Those types of malware imply further challenges in recognizing their malicious activities. In this paper, we propose a standard and straightforward contextual behavioral models that characterize Windows malware and goodware. We relied on the word embedding to realize the contextual association that may occur between API functions in malware sequences. Our empirical results proved that there is a considerable distinction between malware and goodware call sequences. Based on that distinction, we propose a new method to detect malware that relies on the Markov chain. We also propose a heuristic method that identiﬁes malware’s mimicry activities by tracking the likelihood behavior of a given API call sequence. Experimental results showed that our proposed model outperforms other peer models that rely on API call sequences. Our model returns an average malware detection accuracy of 0.990, with a false positive rate of 0.010. Regarding malware mimicry, our model shows an average noteworthy accuracy of 0.993 in detecting false positives.


Introduction
With the rapid development in computers and Internet technology, malicious programs (malware) also have significantly developed in both categories and quantities. Researchers have centered their attention on inventing diversity malware detection methods to relieve the expeditiously growing malware rate. Generally, malware detection methods are categorized into either static or dynamic [1]. In static malware detection, researchers usually check and analyze portable executable (PE) files' contents without executing the malware samples.
Throughout the static analysis, analyzers investigated PE files by collecting and extracting specific features such as string patterns, operation code (op-code) sequences, and byte sequences. The features collected during static analysis are generally viewed as discriminating features that are used to decide whether a given sample is malicious or not [2]. Nevertheless, static malware detection methods have shown to be inappropriate to overcome the skillful techniques used by malware authors to bypass detection [3][4][5].
In contrast to static analysis, dynamic analysis tools are used to monitor the malware during execution. Through observing malware in real-time, we can extract valuable behavior features such as network behavior, system calls, registry change, and memory usage [6].
The Application Programming Interface (API) call sequences are viewed to be a distinguishable representative features in behavioral-based malware analysis [7]. The reason behind its prominence is because API call analysis can uncover and capture the malware behavior. Those types of real behaviors are not attainable in static analysis. Therefore, dynamic analysis research works relied on real-time features such as API call sequence as well as control flow that reveal malicious malware behavior [8]. However, dynamic analysis approaches are also insufficient. It was reported in [9] that brilliant malware can discover whether it runs on a virtual or real environment.
One of the most smart malware approaches to avoid exposure is through behaving as normal or benign executable files. This kind of mimicry behavior became a real challenge to malware detection tools. It is natural to think that the most common malware attacks (especially for Windows operating systems) are formed using executable files, however, security reports [10] showed that the wildest serious attacks are the ones that are carried out using mimicry infections. Those types of infections allow attackers to exploit the vulnerabilities of third-party applications to trigger executable payloads. Another quandary is regarded due to the vulnerabilities of third-party applications that are not promptly patched. Therefore, the late or absence of proper security updates increases much longer the lifespan of attacks committed by mimicry infections.
Machine learning-based techniques have been used to detect malicious parts that are embedded in infected user applications such as PDF files. Research work demonstrated the effectiveness of learning-based systems at detecting obfuscated attacks that are capable of circumventing plain heuristics [11][12][13]; however, the problem still requires significant work to resolve.
Malware analysis tools should also pay attention to non-executable files that seem to behave benignly. Nevertheless, they conceal malicious code which makes their detection significantly harder. Although their imperfection, dynamic analysis is prospectively able to conquest some benchmark metrics. Those metrics are determined during malware interactions with the subsidiary operating system. Those metrics can be used to detect a possible attack [14].
In this work, we exploited the contextual embedding features in the API call sequence. Through modeling the transitions existing in the calling sequence, we generated behavioral models for malware and goodware. Although malicious and non-malicious applications are using the same API functions, we proved that there are variations in how both types utilized the API functions. We also propose a solution to detect Windows malware and malware mimicry or fake goodware programs.
We organized the rest of the paper as follows: Section 2 discusses the related work and other research backgrounds. In Section 3, we present our proposed malware detection model. The datasets, along with the empirical evaluations of our model, are presented in Section 4. Section 5 concludes this paper.

Related Work
Many studies aimed to analyze malware characteristics. The most leading way to analyze malware is through monitoring its behavior. One of the leading approaches to perceive the program behavior is through tracking its API calls [15,16]. API functions are standard by themselves; there are no groups called malicious or non-malicious functions. Malicious applications also utilize the regular API functions to perform its harmful activities. The calling mechanism to API functions does not characterize the difference between malicious and normal programs. Although, the flow order of API calls may lead to the contextual behavioral characteristic of the calling process [17]. However, due to the vast amount of API functions, it becomes laborious to describe running processes' behavioral attributes by monitoring and tracing all APIs simultaneously.
The API calling sequence that takes place among the processes and the operating system is considered influential. Hence, it is viewed as a fundamental distinction between the behavior of malicious and normal processes [3]. Therefore, most research work in malware analysis tried to understand the process behavior through analyzing API calls [18]. The order of functions in the calling sequences could lead to meaningful expressions that provide reliable malware recognition. The API calls encode sufficient information regarding the possible malware functionalities that happen throughout malware execution.
Popular machine learning algorithms such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), and Naive Bayes (NB) are widely used in malware detection [19][20][21][22]. Conventional machine learning algorithms are potentially able to learn behavioral features from malware samples. However, the performance of any machine learning algorithm is determined by the accuracy of the extracted features. In addition, it is also troublesome to extract significant behavioral features to improve detection performance. Therefore, common machine learning algorithms seem discouraging for malware detection [23,24].
Lu et al. [25] and Wu et al. [26] converted API calls into regular expression (RE) rules to identify and extract malicious sequence patterns. They recognized any malicious sequence as malware when any match exists between the observed API call sequence and predefined RE rules. Taejin et al. [27] transformed API calls into some code arrangements and grouped the APIs using n-gram. Tran et al. [28] used natural language processing to analyze the API call sequence. They divided the long sequence calls into small chunks using approaches like n-gram. The resultant n-grams were assigned values by using the term frequency-inverse document frequency (TF-IDF).
The main objective of using TF-IDF is to transform the textual n-grams into numerical features to enable the application of machine learning algorithms. However, statistical approaches like TF-IDF do not conserve any contextual association that exists among words [29,30]. Consequently, in our work, we employed the word embedding on the API calling sequences to infer the contextual association among the API calls.
Despite the accuracy of machine learning-based models for malware detection, researchers getting more suspicions about the reliability of learning algorithms against malware mimicry attacks [31][32][33][34]. These types of attacks became quite popular, as shown in [31], which are subsequently discussed in [35][36][37][38]. They showed that mimicry attacks lead to deceiving malware detection models, which resulted in misleading classification.
Throughout this paper, we proposed a malware detection mechanism relying on the contextual perception among APIs within the calling sequence. We also addressed the mimicry behaviors that malware can have. The proposed work provided a reliable technique that detects with high accuracy, both malware and mimicry malware (fake goodware) calling sequences.

Proposed Model
As mentioned above, former research studies were mainly concerned with finding and extracting behavioral features' patterns in the API calling sequences. Behavioral patterns are used as features for identifying and detecting malware. However, previous studies did not attempt to investigate the association that may exist among the different API functions in the entire calling sequence. In our proposed model, we aimed to discover any relation(s) that occur in benign or malicious calls. As shown in Figure 1, our model consists of three phases, namely, initialization, learning, and testing phase. We will briefly discuss each phase in the following sections.

Initialization Phase
The main purpose of the initialization phase is to restyle the API call sequence form to the cluster sequence one. A major obstacle we faced during malware analysis is the considerable amount of different API functions that make the analysis process extremely hard. However, the analysis process becomes possible if there is a way to customize that massive number of APIs.
Nevertheless, we think that the API functions' arrangements in the malware calling sequences do not exist at random. It conceals some remarkable contextual patterns which carry out their malignant activities. The contextual malicious patterns are relatively similar in some way among various malware sequences. Through extracting the contextual patterns from enormous malware API call sequences, we enhance our capability of characterizing the contextual relations which exist within malicious API call sequences.
Therefore, in our model, we relied on word embedding [39] to find the contextually related API functions. Analogous to word embedding, according to the API call sequences, the distribution of API function vectors in the space depends solely on the contextual similarity among APIs in the input corpus. Therefore, when two API functions are contextually similar, they will be positioned close to each other in the neighborhood space. Similarly, when two API functions are contextually dissimilar, they will be placed remotely from each other. During our experimentation, we set the embedding dimension feature vectors size to 300, window size to 8, and workers to 6. As shown in Figure 1, word embedding produces two outputs for each training sequences namely, APIs and embedding model.
In our model, we used the embedding model that resulted in each training category to calculate the similarity between its API function. The similarity computation produces two outputs, namely, goodware and malware API similarity matrix. The similarity matrix describes the similarity among individual API functions in its categorical sequence.
Through clustering the goodware/malware similarity matrix, we grouped different API functions that are contextually similar traits into a finite number of clusters. In our model, we used the k-means algorithm [40] to cluster the similarity matrix. We relied on the elbow method [41] to acquire the ideal number of k clusters to provide it to the k-means algorithm. In our experiments we obtained k = 10 as the optimal number of clusters for malware and goodware API calls.
As shown in Figure 1, for any API call sequence, we searched the resulted clusters for every API function in the sequence. When found, the function in the given sequence was be replaced by the cluster number that contains it. For example, according to the dataset introduced in [17], the following sample is an API sub-sequence of the malware sequence Worm.Win32.Vob f us.agac : lstrcpyw, getthreadlocale, lstrcmpiw, globalalloc, globallock, globalunlock, globalrealloc, registerclipboardformatw, registerclipboardformata, getsystemdirectorya, isdbcsleadbyte, getversion, virtualallocex, getcommandlinea, getstartupinfoa.
According to our model, each API in the previous API sequence will be searched against the clusters. The following representation denotes the cluster sequence that replaced the above API sequence: The conversion of the original calling sequence into cluster sequence is considered the most pivotal step. Within a limited number of clusters, we got a perfect chance to restrict the sequence combination possibilities that malware can have. Therefore, malware analysis becomes possible.

Learning Phase
We can view the clusters generated in the initialization phase (Section 3.1) as a limited collection of states S where S = {S 1 , S 2 , S 3 , S 4 ,. . . , S n }. According to our new representation for the calling sequence, the process, whether it is a malware or goodware, is expressed using a limited number of states called Markov states. A process normally begins at any state S i , successively it may transit to a different state S j as a subsequent action. According to the input sequence, the process can also change its state or wait in the same state. Therefore, the process is described through generated series of states S i,1 , S i,2 , S i,3 , S i,4 , . . . , S i,k . The movement series across various states are described as transitions between the different states. Our model relied on a first-order Markov chain to model transition sequences, where a state is completely counting on its former one. Therefore, a Markov model that has n states will ultimately have n 2 transition probabilities. These transition probabilities can be depicted as n × n matrix.
In our model, we used the maximum likelihood estimation (MLE) [42] to generate the transition likelihood probabilities, which describes the order of state transitions. Tables 1 and 2 are examples of an actual cluster transitions' matrices that were resulted from our experimentation on the dataset in [17]. Table 1 describes the transition probabilities that had arisen among the malware clusters' states, whereas Table 2 presents the transition' probabilities that were emerged among goodware clusters' states. Both malware and goodware clusters' transition matrices are regarded as the core of our model.
The transition sequence for any process becomes more reasonable when transforming the ambiguous cluster sequences into a meaningful form. The main motivation behind the reformulation is to unveil the behavioral transition of a given process. In another meaning, we require an explicit form to monitor and describe the malicious and the non-malicious likelihood behaviors for given malware and goodware sequences, respectively.    Throughout our model, we relied on Equation (1) to achieve the required reformulations. According to Equation (1), the transition probability for a sequence (i, j) will have a value of one if its corresponding malicious probability is greater than its non-malicious one in cluster transition matrices. Otherwise, it will receive a value of zero.
where (i,j) is referring to the shifting of the sequencing process from state i to state j, p(Malware|(i, j)) and p(Goodware|(i, j)) are referring to the sequence transition in malware and goodware cluster transition probabilities, respectively. According to our model, the final classification for a transition, whether it is malicious or not, is depending on the maximum transition probabilities for the transition in malware and goodware cluster transition matrices. Accordingly, the transition is changed to one when it is malicious and zero otherwise. Hence, the whole calling sequence will be transformed into a new series of ones and zeros. For example, recall the generated cluster sequence which appeared in Section 3.1. Let us examine how our model determines whether it is malicious or not. Our model needs to determine the following transition probabilities that characterize the sequence: p(1,1), p(1,1), p(1,1), p(1,4), p(4,4), p(4,1), p(1,6), p(6,4), p(4,7), p (7,8), p(8,1), p(1,1), p(1,1), p(1,1) The probability of each transition will be fetched from malware and goodware cluster transition matrices in Tables 1 and 2, respectively. Table 3 showed the transition probabilities' tracing for the preceding sequences. Consequently, the formulation outcome of both malware and goodware transitions will be: 1 1 1 1 1 1 1 1 1 0 1 1 1 1.
Our proposed model used the newly formulated sequences to generate generic behavioral models that characterize malicious and non-malicious sequences. Once more, we used the maximum likelihood estimation to generate transition models for malware and goodware. The learning phase finishes its work by producing two behavioral models: the malware and goodware models (Figure 2a,b, respectively).

Testing Phase
Generally, the intended purpose of the testing phase is to investigate the performance of our model in distinguishing newly sequences. Therefore, we provided our model with an unseen test set of malware and goodware sequences. As shown in Figure 1, the testing phase initially reformulates the input sequences as in the demonstration shown in Table 3.
We examined each sequence against both malware and goodware transition matrices. The model stores the transition probability when the sequence progresses from one state into another one. Our model relies on maximum cumulative likelihood of transition probabilities to determine whether a sequence is malicious or not. Accordingly, the formulated sequence that was generated for the example in Section 3.2 will be tested against malware and goodware models as shown in Table 4.

Results and Discussion
Throughout this section, we evaluate our model through various datasets using standard evaluation metrics. We show that our model could efficiently recognize whether a sequence of API calls leads to malicious activities or not.

Datasets
To verify our model, we gathered varieties of API call sequences from [17,43,44]. We carried out our experiments with various datasets of different sizes to observe the efficiency of our model against the size of data.

Evaluation Metrics
Our model evaluation used well-known evaluation metrics such as precision, recall, F-measure, and accuracy. We also used other evaluation metrics inspired by the confusion matrix, such as false-positive rate (FPR) and false-negative rate (FNR). These measures assess the performance quality of the classification methods.

Malware Detection Evaluation
In our experimentation, we split our data into 50% for training and 50% for testing. Throughout the training process, we implemented a modified version of the k-fold strategy called the random subsamples (with replacement). The implemented model is slightly different from the k-fold in that, during each iteration, the selection of the training and testing samples are performed at random. The superiority of random subsamples (with replacement) comes from its elastically to determine the number of iterations and the size of training and testing samples. Our training samples were populated at random while maintaining a condition of eliminating any duplication for samples that may exist in the training or testing samples.
Our model avoided the training bias through performing our experiments 10 times for each dataset. We calculated the average returned results for all experiments per each dataset to be its final evaluation measure. Experimental results demonstrated high proficiency in detecting and discriminating unseen samples.
Our model has a high accuracy detection rate with tiny false positives. Table 5 shows that our method provides an average precision, recall, F-measure, and accuracy of 0.990, an average false-positive rate of 0.010, and an average false-negative rate of 0.010.
We experimented with our model against new unseen test samples to prove its validity and efficiency. The new samples contain 701 malware samples from https://github.com/duj12/cnnlstm-based-malware-document-classification and 300 goodware samples from https://github.com/ leocsato/detector_mw. According to the accuracy measures, as described in Table 6, our proposed model showed a considerable detection accuracy of 0.983, along with a false-positive rate of 0.034. According to the malware detection accuracy measures, Table 7 showed that our proposed work outperformed other peer dynamic analysis approaches that used the API call sequence. We compared our results with different approaches to prove its competency. Our model showed an average accuracy of 0.999, which is considered the most trustworthy one compared to other approaches.

Fake Goodware Detection
Despite our perceptible model accuracy in recognizing malware, there were particular types of malware samples falsely identified as goodware. When we investigated those kinds of examples, we discovered that malicious transitions are surrounded by many non-malicious ones. Therefore, those types of malware contain many non-malicious transitions compared to malicious ones. In other meaning, those kinds of malware samples are falsely acting as goodware ones. Our model identified these kinds of mimicry malware or fake goodware sequences through tracking their likelihood behavior.
Our experiments showed that most malware samples contain a majority of malicious transitions. However, we showed that malware transitions might also include partial non-malicious transitions, even if it does not affect its malicious collective likelihood behavior. However, in malware mimicry, we noticed that the API call sequence contains a significant amount of non-malicious transitions compared to malicious ones. In addition, we observed a continually changing behavior for those fake goodware samples during progressive transitions. Therefore, in our model, we used the behavior inconsistency as a sign, which indicates that a sequence is performing malicious activities.
In Figure 4b, we observed that, although both malicious and non-malicious behaviors are growing, they are not scaling at the same rate. In another meaning, there is a continual separation gap between both behaviors during the progressive transitions. In contrast, the behaviors in Figure 5b converge and intersect at some progressive transition.
In our model, we utilized the behavior monitoring (BM) as a heuristic that identifies whether a sequence retains or modifies its behavior while being examined by our model. Equation (8) describes the behavioral intersection ratio where: • S denotes the input sequence, • n is the total number of transitions of a given sequence, • ∑ p((T(1 : i)) refers to cumulative transition probabilities for the sequence up to the i-th transition in malware and goodware models, • The exterior summation counts the events concerning the internal comparison between the two inner sums judged as true.
The behavior monitoring equation originally assumes that any given sequence is non-malicious until its behavior shows the opposite. Therefore, it continually tracks the sequence transitions' likelihood probabilities' in malware and goodware models simultaneously. When the sequence is malicious, as in the transition sequences in Figure 4a, then the accumulated malicious likelihood will be greater than its accumulated non-malicious likelihood. In other words, the differences between both behavioral likelihood accumulations in real malware will always be positive. However, in the case of fake goodware, as in the transition sequences in Figure 5a, the differences between both behavioral likelihood accumulations are inconsistent and tend to be negative during progressive transitions.
Our analysis concluded that a sequence is recognized as malicious if it has a cumulative changing behavioral ratio of 10% among its transitions. We examined our conclusion with malware false positives that emerged through our experiments in Table 5. As clarified in Table 8, our heuristic is capable of identifying malware mimicry sequences and recognizing them as a possible malicious sequence with an average detection accuracy of 0.993. The high accuracy in detecting mimicry malware adds another reliability dimension to our model in identifying malware.
Along with monitoring the sequence behavior, there is a necessity for estimating the malicious degree of the sequence. Therefore, we have adjusted with a minor change the heuristic that monitors the sequence behavior to perform as a behavior confidence factor (BCF). Through assessing the sequence behavior, the sequence is also given a behavioral evaluation. Equation (9) describes how we evaluate the malicious ratio for a sequence where: • the numerator denotes the number of times where the inner comparison is evaluated as false, • T(S) denotes the total number of transitions in the sequence S.
To clarify the assessment evaluation process for the behavioral sequence, we also relied on the transition sequences in Figures 4a and 5a. Figure 4a contains 469 transitions, including (423 malicious transitions and 46 non-malicious transitions). According to Equation (9), the behavior confidence factor will be 423 469 = 0.92, which is interpreted as the sequence is malicious with a confidence factor of 92%. However, the transition sequence in Figure 5a contains 562 transitions, including (232 malicious transitions and 330 non-malicious transitions). Accordingly, the sequence is malicious with a confidence factor 41%.
Even with the indecisive confidence factor for the sequence in Figure 5a, the behavioral monitoring value is complementing the shortage that may occur when relying only on the behavioral confidence factor. Therefore, any sequence can be classified through the behavioral monitoring heuristic and assigned a confidence score through the confidence factor equation. Table 8. Malware false positive detection evaluation.

Conclusions
Throughout our paper, we proved that the contextual understanding of the API call sequence has enhanced malware detection accuracy. Our proposed model has employed word embedding to understand the latent contextual relations among individual APIs. We have created an API embedding model for Windows APIs. Through clustering APIs that are contextually related, our model has overcome the API tracking impossibility problem due to its huge number. Consequently, any API call sequence for a process could be represented using a finite number of cluster sequences. Our paper proposed generic behavioral models for malware and goodware. Experiments have proved the exceptional accuracy that our model returned. We have addressed and proposed a heuristic that detected mimicry malware sequence. The comparisons with peer approaches have proven that our empirical model is promising.