Attention-Based Automated Feature Extraction for Malware Analysis

Every day, hundreds of thousands of malicious files are created to exploit zero-day vulnerabilities. Existing pattern-based antivirus solutions face difficulties in coping with such a large number of new malicious files. To solve this problem, artificial intelligence (AI)-based malicious file detection methods have been proposed. However, even if we can detect malicious files with high accuracy using deep learning, it is difficult to identify why files are malicious. In this study, we propose a malicious file feature extraction method based on attention mechanism. First, by adapting the attention mechanism, we can identify application program interface (API) system calls that are more important than others for determining whether a file is malicious. Second, we confirm that this approach yields an accuracy that is approximately 12% and 5% higher than a conventional AI-based detection model using convolutional neural networks and skip-connected long short-term memory-based detection model, respectively.


Introduction
Every day, hundreds of thousands of malicious files are created to exploit zero-day vulnerabilities [1,2] For this reason, the existing pattern-based antivirus solutions face difficulties in responding to new malicious files [3]. Traditional pattern-based antivirus solutions determine whether files are malicious by evaluating their hash values, their string content, or their behavior. However, new malicious files are being designed to avoid detection by existing antivirus solutions. Therefore, malware detection requries an alternative solution.
Methods that use artifical intelligence (AI) for detecting malicious files have been recently proposed [3][4][5][6][7][8][9][10]. AI has been widely employed for image recognition and machine translation [11,12]. Machine learning and deep learning techniques are widely utilized in AI applications. An advantage of AI is that it can effectively identify data that are similar to the training data. That is, even if there are no patterns for a new malicious file, it is possible to determine whether it is malicious [13]. Kaspersky Internet Security that uses machine learning has improved accuracy.
If we add the AI-based malware detection method to the existing anti-virus software, we can increase the malware detection rate because we can detect malware that is not found by the existing anti-virus software. It is known that, when using machine learning, the detection rate is higher and the false positive rate is less than that of pattern-based detection [13]. When a new malware is exposed to pattern-based antivirus software, it cannot establish whether it is malicious because the pattern is not recognized. However, the AI-based malware detector determines whether it is malicious with high probability. To extract feature data from malicious files, we can use static [4,5,[14][15][16][17] and dynamic analyses [6][7][8] Static analysis involves the extraction of features such as strings and opcodes, and dynamic analysis extracts features such as application program interface (API) system calls by executing the malware.
Thereafter, by using extracted features with an AI model such as a convolutional neural network (CNN)-based model or long short-term memory (LSTM)-based model, we can determine whether the file is malicious or not. However, in the AI model, we do not know why the file is malicious. Therefore, we must determine which features are more significant than others in a malicious file.
In this study, we propose a malicious file feature extraction method based on attention mechanisms [18]. An attention mechanism calculates the weight of each input value of the deep learning model, in order to determine which has a greater impact on the result. Thus, using this mechanism, we can determine which features are more significant than others. To the best of our knowledge, this study is the first to apply the attention mechanism to malicious file feature extraction. Furthermore, our experimental results indicate that this method increases the accuracy of the AI-based malicious file detection system and determines in reasonable time whether the file is malicious.
The remainder of this paper is organized as follows: Related research is discussed in Section 2. Basic malicious file feature extraction method and basic deep learning techniques for malicious file detection are presented in Section 3. A new method for malicious file feature extraction is detailed in Section 4. In Section 5, we demonstrate that the proposed method outperforms existing methods. Finally, we provide a discussion and conclusions in Sections 6 and 7, respectively.

Related Work
Methods of analyzing malicious files can be categorized into static and dynamic analysis methods. Static analysis methods judge whether a file is malicious by analyzing strings [14], import tables [15], byte n-grams [16], and opcodes [16]. Static analysis methods can analyze malicious files relatively quickly, although they face difficulties analyzing files when they are obfuscated or packed [17]. With dynamic analysis methods, malicious files are analyzed by running them. However, these methods have the disadvantage that malicious files may detect the virtual environment and not operate in it [4]. It may also be difficult to collect complete malicious behavior because malicious files may only run in certain circumstances.
Recently, there have been many studies on malicious file analysis using AI. The methods in [4,5] utilize a static analysis to analyze malicious files. The method in [4] extracts byte histogram features, import features, string histogram features, and metadata features from a file and trains deep learning models based on deep neural networks. The method in [5] learns a convolutional neural network (CNN)-based deep learning model by extracting opcode sequences.
The approaches in [6][7][8] utilize a dynamic analysis method to analyze a malicious file, and execute it to collect the API system call and extract a subsequence of a certain length from the system call sequence using a random projection. The method in [6] utilizes a deep neural network-based deep learning model, and that in [7] employs a deep learning model based on a recurrent neural network (RNN). The authors of [8] proposed a deep learning model that simultaneously performs malicious file detection and malicious file classification. In the present study, we employ both static and dynamic analyses. Furthermore, a novel property is that we propose automatic feature extraction based on the attention mechanism.
Even if malicious files can be detected with high accuracy using deep learning, we do not know which API system calls are more important than others. Recently, the attention mechanism was proposed in the neural machine translation community to provide sentence summarization [18]. To our best knowledge, this study represents the first attempt to analyze and detect malicious files by adapting the attention mechanism. By utilizing the attention mechanism, we can determine which API system calls are more important in malicious files, and our approach yields a higher accuracy than existing malware detection models.
Here, we focus on malicious Portable Executable (PE) files, but other types of malicious files also exist. These include malicious PDF files [19] and PowerShell scripts [20,21]. Some researchers have attempted to determine whether PDF files and PowerShell scripts are malicious using AI. However, such approaches are outside the scope of the present study.

Deep Learning Based Malware Detection
Deep learning-based malware detection requires two modules. The first is feature extraction and the second is a deep learning model. In Section 3.1, we introduce two feature extraction methods. In Section 3.2, we present three deep learning models.

Feature Extraction for Malware Detection
In this section, we provide two feature extraction methods. The first is static analysis-based feature extraction and the second is dynamic analysis-based feature extraction.

Feature Extraction Using Static Analysis
Malicious file feature data are required for malicious file detection using deep learning. To this end, we can extract malicious file feature data using static or dynamic analyses. First, for a static analysis, we extract the assembly code using objdump [22], as shown in Figure 2. Next, we extract the opcode sequences such as add, inc, and add. We can then construct a trigram sequence [23] for three consecutive opcodes. The trigram sequence is created as follows: (add, inc, add), (inc, add, xchg), (add, xchg, inc), . . .
The reason for creating trigram sequences is that there are approximately 100 opcodes, and when these are placed into trigrams, the size of the trigram domain is approximately 100 3 , such that the trigram sequence of each file can easily be distinguished from those of other files.

Feature Extraction Using Dynamic Analysis
Next, we employ the Cuckoo Sandbox [24] to extract the API system calls of the portable executable (PE) file. An example of an API system call list is presented in Table 1. We first run the Cuckoo Sandbox, and then extract the API system call sequence. In addition, we extract the trigrams for three consecutive API system calls to create a trigram sequence.

DL-Based Malware Detection Model
In this section, we introduce three deep learning models for malware detection. The first is recurrent neural network, the second is LSTM, and the third is skip-connected LSTM (SC-LSTM).

Recurrent Neural Network (RNN)
Here, we introduce a basic deep learning model used for malicious file detection. RNNs are mainly employed to process sequential information, such as language translation [25]. Figure 3 illustrates the structure of a typical RNN used to translate German sentences into English. However, RNNs have the disadvantage that the longer the input sentence, the smaller the influence of the preceding words. This is known as the vanishing gradient problem [26].

Long Short Term Momory (LSTM)
LSTM [27] was proposed to solve the aforementioned vanishing gradient problem [26]. LSTM has three gates, as shown in Figure 4. The gates are a way to optionally let information through. They are composed of a sigmoid neural net layer σ and a pointwise multiplication operaion ×. The first is the forget gate f t , which determines whether the state C t−1 of the previous cell is reflected as follows: where x t is an input, h t−1 is the output of the previous cell, W f is a weight, and b f is a bias. The second is the input gate i t , which determines how much the state of the previous cell will be updated as follows: whereC t is a new candidate value. Then, the state C t of the next cell is calculated as follows: The third is the output gate o t . The output h t of the current cell is calculated as follows: In LSTM, the state C t−1 of the previous cell has the possibility to be changed less than in an RNN. Therefore, LSTM has the advantage that the initial state of a cell can be better reflected.

Skip-Connected LSTM
A residual network is a technology that solves the problem that learning and evaluation errors do not decrease in the image recognition field using a CNN model, even when the depth of the network is increased. This technique adds a hidden input layer followed by some steps to present inputs without weight calculations [11].
To improve the information flow of the LSTM network while maintaining the advantages of the gate mechanism, it has been proposed to introduce skip connections between two LSTM hidden states [12]. The result is called SC-LSTM. The structure of this model is illustrated in Figure 5. SC-LSTM has the advantage that the dependencies of long sequence information are better captured by skip connections. The next hidden state is calculated as follows:

Automated Feature Extraction Based on Attention
In this section, we propose an automated feature extraction method based on the attention mechanism. First, in Section 4.1, we introduce the attention mechanism. Subsequently, in Section 4.2, we present an attention-based feature extraction method for malware detection.

Attention Mechanism
Attention is a deep learning mechanism that looks for the parts of sequence data with greater impacts on the results. A typical example of attention is text summarization [18] that involves summarizing a given text. Using the attention mechanism, we can identify some of the main words to summarize an article when it is given as a sequence of words. For example, the text shown in Figure 6 is summarized as follows: russia calls f or joint f ront against terrorism (8) Figure 6. Alignment of text summarization [18].
The RNN model is utilized in neural machine translation (NMT). For example, German sentences can be translated into English. NMT encodes the source sentence into a vector and decodes the sentence based on that vector.
The attention mechanism allows the decoder to refer to a portion of the source sentence [25], as shown in Figure 7 [25]. Here, X is the source sentence and y is the translated sentence generated by the decoder. Figure 7 depicts a bidirectional RNN. The important point is that the output word y t depends on the weight combination of all input states. Here, α is a weight that defines how strongly each input state influences each output state. If α (3,2) is large, it means that the third word in the output sentence refers to the second word in the input sentence.
In the attention model, the hidden state S t is defined as follows: Furthermore, the context vector c t is defined as follows: Weight α i,j is computed as In addition, e i,j is computed as follows: Here, e i,j is an alignment model indicating how well the input j and output i are matched. The advantage of attention is that it provides the ability to interpret and visualize what the model does. For example, by visualizing the attention weight matrix when a sentence is translated, we can understand how the model performs translation. As shown in the text summarization above, we can observe how strongly each word in the output summary statement refers to each word in the input sentence.

Feature Extraction Based on Attention Mechanism
In this section, we propose an automatic feature extraction method based on the attention mechanism. The idea behind this method is as follows: When we utilize the attention model to identify the weight of each API system call in a sequence of length n, we find that the malicious file detection rate increases when we consider subsequences of length k by extracting weighted words for malicious file detection.
For example, the API sequence for a malicious file might may appear as follows:

{LoadLibrary, strcmp, VirtualProtect}
This is the API sequence pattern that appears in import address table-hooking malicious files [9]. We expect the detection rate of malicious files to increase using subsequence data with the top three weights through the attention model, rather than using the full sequence of length 10.
The purpose of the automatic feature extraction method is as follows: First, significant subsequences of length k are extracted from a data sequence of length n extracted from a malicious file, to enhance malicious file detection performance through deep learning.
After extracting significant data subsequences, malicious file analysts can analyze malicious files more easily. That is, malicious file analysts can analyze malicious files by analyzing significant subsequences of data, rather than examining the entire data sequence.
The automatic feature extraction method based on the attention mechanism is described as follows: First, when files are provided, we extract features such as API system call sequences using Cuckoo Sandbox. Note that we can extract features such as opcode sequences by static analysis instead of dynamic analyses such as Cuckoo Sandbox. Second, we construct a trigram sequence {S 1 , S 2 , · · · , S n } for three consecutive data. Third, we train the LSTM model based on the attention mechanism, and calculate the weight W i of each data point S i in the data sequence {S 1 , S 2 , · · · , S n }, as depicted in Figure 8: {(S 1 , W 1 ), (S 2 , W 2 ), · · · , (S n , W n )} (13) Next, we extract the data subsequence with the k highest weights, as follows: Finally, this subsequence is utilized as data for the deep learning model for malicious file detection. As depicted in Figure 8, we obtained the trained deep learning model using the subsequence. When a test file is given, we can determine whether the file is malicious in the testing module using the trained deep learning model. We demonstrate our proposed method in the next section.

Setup
All experiments were performed on the Ubuntu operating system, and the models used in the experiments were implemented using Theano [28]. The detailed experimental conditions are described in Table 2.

Data
The files utilized in the experiments were PE files collected from HAURI [29]. We used 1000 normal and 1000 malicious files. Table 3 presents the types of malware: trojan, win32, backdoor, worm, dropper, malware, and virus. These were classified using Ahnlab V3 [30]. Note that the purpose of this study is only to determine whether the files are malicious and not to classify the type of malware. Different malware families may produce different detection results. There are other studies on malware classification [31], but this is beyond the scope of this research. However, we will consider the topic in future work. We extracted the API system call sequences from these PE files using the Cuckoo Sandbox. These consisted of API unigram sequences. We then converted the three consecutive API system calls into one trigram to create API trigram sequences, as depicted in Figures 9 and 10. Note that some PE files did not execute well in the Cuckoo Sandbox; therefore, we removed the API sequences of length less than 5. Finally, we used 690 normal files and 785 malicious files in the experiments. Each line in Figures 9 and 10 provide the following information: {File Name, PID (dummy), PPID (dummy), sequence length, Normal (0) or Malicious (1), API Trigram Sequence} Note that the length of the API unigram sequence was set to 800. Because we converted the three consecutive API system calls into one API trigram, the length of the API trigram sequence is 798. When the length is less than 798, it is padded with zeros.  Subsequently, we used five-fold cross validation [32] as depicted in Figure 11. Cross-validation is a statistical method used to estimate the skill of deep learning models. We randomly divided the dataset into five subsets. In the first experiment, the first four subsets were used for the training and the fifth subset was used for the test. In the fifth experiment, the first subset was used for the test and the other four subsets were used for the training. The first experiment was performed to demonstrate the effect of attention. To this end, we compared the accuracy of the attention-based model with those of the CNN-based [5] and SC-LSTMbased models [10]. Note that we implemented the CNN-based model using Keras [33] instead of Theano [28]. The second experiment was conducted to measure the time required to train and test the attention-based model which was compared with the time required to train and test the CNN-based and SC-LSTM-based models.

Performance Metric
In this section, performance evaluation indicators are described, before presenting the experimental results. The indicators utilized in this study are accuracy, true positive rate (TPR), and false positive rate (FPR). The confusion matrix used to calculate these values is shown in Table 4. TP indicates that a file has been correctly evaluated by the system to be malicious, and TN indicates that the system correctly determined that a benign file is normal. Furthermore, FP indicates that a normal file has been incorrectly judged by the system as malicious, and FN indicates that the system has incorrectly identified a malicious file to be normal. Each indicator is calculated as follows: Accuracy refers to the rate at which the system correctly determines both malicious and normal files. The higher the value, the better the performance. TPR refers to the rate at which the system correctly identifies malicious files as malicious. The higher the value, the better the detection performance. FPR refers to the rate at which the system identifies normal files as malicious, and the lower the value, the better the detection performance. All experiments were performed using 5-fold cross validation.

Accuracy
The first experiment was conducted to demonstrate the effect of the attention mechanism for malware detection. The accuracy of the attention-based detection model was compared with those of the CNN-based and SC-LSTM-based models, as shown in Figure 12. The attention-based model yielded an accuracy that was approximately 12% and 5% higher than those of the CNN-based and SC-LSTM-based models, respectively. Table 5 shows the numbers of true positives, false positives, false negatives, and true negatives for each model.  Note that, when the length of the sequence was 800, the accuracy of the attention-based model was the same as that of the SC-LSTM-based model because the length of the original sequence was limited to 800. This indicated that, when the length was 800, the input sequence of the attention-based model was the same as that of the SC-LSTM-based model. In this case, the attention-based model did not have any effect. When the length of the sequence was less than 800, the attention-based model extracted subsequences whose API system calls were more important. Thus, the attention-based model yielded a higher accuracy than the others.

Training Time
The second experiment measured the computational time of the attention-based detection model. This was compared with the computational times of the CNN-based and SC-LSTM-based models, as shown in Figure 13. The computational time of the attention-based model consisted of the training time of the attention-based model, whose input sequence length was 800, time required to extract the top-k API system calls, and time needed to train the SC-LSTM-based model whose input sequence length was k.
Note that, in the attention-based detection model, we first computed the weights of each API system call in the input sequence. Then, we extracted the top-k API system calls and created a subsequence. Next, we trained the SC-LSTM-model using the extracted subsequence.
Because we implemented the CNN-based model using Keras [33] instead of Theano [28], the computational time for the CNN-based model was longer than that for the SC-LSTM-based model. Training the attention-based model using an input sequence of length 800 required approximately 400 s. Therefore, the computational time of the attention-based detection model was longer than that of the SC-LSTM-based model.
However, the attention-based detection model yielded a higher accuracy. In addition, by using the attention-based model, we can identify which API system calls are more important to determine whether a file is malicious.

Test Time
The third experiment measured the time required to test the attention-based detection model using a GPU. Once the system call information is available, the time taken to establish whether the 280 test files are malicious in the deep learning model can be determined. Note that the test time does not include the time required to extract the API system call sequence using Cuckoo Sandbox. To extract the API system call sequence using Cuckoo Sandbox, we executed each file for 3 min. However, modern anti-virus software contains a module to extract API system calls in real time [30]. Therefore, we believe that malware can be detected using the attention-based detection model in real time.
The time to test the attention-based detection model was compared with what is required to test the CNN-based and SC-LSTM-based models, as depicted in Figure 14. The time required to test the attention-based model is the time required to determine whether the file is malicious. The number of test files was 280. Because we implemented the CNN-based model using Keras [33] instead of Theano [28], the test time for the CNN-based model was longer than that for the SC-LSTM-based model. Because the attention-based model uses the same sized neural network as the SC-LSTM-based model, the test time for the attention-based model was the same as that for the SC-LSTM-based model, and it was proportional to the length of the API trigram sequence. When there were 280 test files and the length of the API trigram sequence was 800, it required approximately 180 ms. We believe that this is reasonable.
In addition, we measured the time required to test the attention-based detection model using only the CPU. This was compared with the time required to test the CNN-based and SC-LSTM-based models using only the CPU, as presented in Figure 15. The time required to test the attention-based model is the time required to determine whether the file is malicious using only the CPU.
When only the CPU was used, the time required to test the attention-based detection model was significantly longer than when the GPU was used. This implies that we can reduce the time required to test by using the GPU because it supports parallel computing in the deep learning model. Moreover, when using only the CPU, the time required to test the CNN-based model is proportional to the length of the sequence. It appears that, because the CPU does not support parallel computing in the deep learning model, the test takes a significantly longer time. Furthermore, we give the memory overhead of each deep learning model as depicted in Figure 16. Because the CNN-based model was implemented using Keras [33] and it is more complex than the the SC-LSTM based and the attention-based models, its memory overhead is much larger than the others. We believe that the attention-based model is suitable for devices with less memory.

Discussion
In this study, we proposed an attention-based detection model. It has approximately 12% and 5% higher accuracy than the CNN-based and SC-LSTM-based models. However, the attention-based model requires a longer training time than the CNN-based and SC-LSTM-based models since it must extract a data subsequece of length k as shown in Figure 13. However, when we detect malware with anti-virus software, the training time is not important because we can use the pre-trained neural network by the attention-based model. Because the test time of the attention-based model is less than 1 ms per file as shown in Figure 14, it is suitable for use in real anti-virus software.

Conclusions
In this study, we proposed an attention-based feature extraction method, and verified the performance of the deep learning-based malicious file detection model utilizing this method. The deep learning-based malicious file detection model using an attention-based feature extraction method yields an accuracy that is approximately 12% and 5% higher than those of a CNN-based model and SC-LSTM model, respectively.
In addition, the attention-based feature extraction method allows malicious code analysts to only analyze parts of malicious code based on the features extracted by the attention-based feature extraction method, rather than analyzing the entire malicious code. This is expected to considerably reduce the efforts required by malicious code analysts.

Conflicts of Interest:
The authors declare no conflict of interest.