1. Introduction
Every day, hundreds of thousands of new malicious files are created. [
1] Existing pattern-based antivirus solutions have difficulty detecting these new malicious files because they do not contain patterns [
2]. To solve these problems, artificial intelligence (AI)–based malware detection methods have been investigated [
2,
3,
4,
5,
6,
7,
8,
9].
AI-based malware detection consists of two phases. In the first phase, features are extracted from malicious files; in the second phase, a deep-learning model is trained using training data and tested using test data. Two methods are used to extract features from malicious files. The first is a static method that extracts opcodes from malicious files [
3,
4]. The second is a dynamic method that obtains application program interface (API) system-call sequences by running the malicious files in a sandbox [
5,
6,
7].
Deep-learning models for malware detection include a convolutional neural network (CNN)-based model, a long short-term memory (LSTM)-based model, and a model using a CNN and LSTM together [
9,
10]. Deep-learning models have a high detection rate for malware detection. However, it takes a long time to determine whether or not files are malicious.
Other studies have investigated detecting malicious files using a similarity hash [
7,
11,
12]. Trend Micro proposed Trend Micro locality-sensitive hashing (TLSH), which is a type of similarity hash. If a bit is changed in a file, its cryptographic hash, e.g., MD5, is completely different from the original hash. However, its similarity hash is quite similar to the original. Thus, when two similarity hashes are similar, the two files are similar. Similarity hash–based malware detection is faster than deep learning–based malware detection. However, its detection rate is lower.
In this paper, we propose two methods to solve these problems, as shown in
Figure 1: k-nearest-neighbor (kNN) classification for malware detection and a vantage point (VP) tree using a similarity hash. First, using kNN classification for malware detection, we classify the training data into three groups, as follows.
kNN classification is used for malware detection to increase the detection rate of similarity hash–based malware detection and increase the speed of deep learning–based malware detection. In the normal group, each file and its kNN are normal files. In the malicious group, each file and its kNN are malicious files. The remaining files are in the undecided group. Thus, if a file in the undecided group is a normal file, its kNN is a malicious file, and if a file in the undecided group is a malicious file, its kNN is a normal file.
If the kNN of a new file is in the normal group, we can determine that it is a normal file. If its kNN is in the malicious group, we can determine that it is a malicious file by using a similarity hash. However, if its kNN is in the undecided group, we should determine whether or not the new file is malicious by using deep learning–based malware detection.
When a new file’s kNN is in the normal group or the malicious group, we can determine its type using only a similarity hash. Thus, we can reduce the malware-detection time compared to using only the deep learning–based method. Meanwhile, when a new file’s kNN is in the undecided group, we determine it using a deep learning–based model. Therefore, we can increase the detection rate compared to using only a similarity hash.
The second proposed method is a VP tree using a similarity hash. Even if similarity hash–based detection is faster than a deep learning–based method, it still takes a long time for many malicious files. At present, there are about 1 billion malicious files. [
1] Therefore, we need to increase the speed of similarity-hash searches. Using a VP tree can accomplish this.
The proposed system works as follows. If training files are given, we compute similarity hashes, e.g., TLSH. Then, we generate a VP tree using the similarity hash and conduct a kNN classification. This classifies the training data into a normal group, a malicious group, and an undecided group. In addition, we extract features and train the deep-learning model for malware detection. Thus, we reduce the detection time, compared to using only the deep-learning model, by using kNN classification.
When a test file is provided, we compute a similarity hash and search the kNN using a VP tree. Then, if the kNN is in the normal group or malicious group, the file is determined to be normal or malicious, respectively. Otherwise, we extract the features from the test file and determine whether it is malicious or not using the deep-learning model. Thus, we increase the detection rate, compared to using only the similarity hash, by using the deep-learning model.
The contributions of this paper are as follows. First, by proposing the kNN classification for malware detection, we increase the malware detection rate compared with the similarity hash–based malware detection and reduce the detection time compared with deep learning–based malware detection. Second, by providing a VP tree, we reduce the search time of the similarity hash. Third, on conducting experiments about kNN classification and VP tree, we report that the malware detection rate is increased by 25%; the detection time is reduced by 67%; and the search time of the similarity hash is decreased by 20%.
This paper is organized as follows. In
Section 2, we list related work. In
Section 3, we introduce TLSH and the VP tree. In
Section 4, we present kNN classification for malware detection and a VP tree using a similarity hash. In
Section 5, we provide the experimental results. Finally, in
Section 6, we conclude the paper.
2. Related Work
The methods used for analyzing malicious files can be categorized into static and dynamic analysis methods. The static analysis methods judge whether a file is malicious by analyzing strings, import tables, byte n-grams, and opcodes [
13]. These methods can analyze malicious files relatively quickly, although they face difficulties analyzing files when the files are obfuscated or packed [
14]. In contrast, using the dynamic analysis methods, malicious files are analyzed by running them. However, the disadvantage of these methods is that the malicious files may detect the virtual environment and not operate in it [
3]. It may also be difficult to detect complete malicious behavior as malicious files may only run under specific circumstances.
Recently, there have been many studies on malicious file analysis using AI. The methods in [
3,
4] utilize a static analysis method to analyze malicious files. The method in [
3] extracts byte histogram features, import features, string histogram features, and metadata features from a file and trains deep learning models based on deep neural networks. The method in [
4] learns a CNN-based deep learning model [
15] by extracting opcode sequences.
The approaches in [
5,
6,
7] use a dynamic analysis method to analyze a malicious file, and then, execute the file to collect the API system call and extract a subsequence of a certain length from the system call sequence using a random projection. The method in [
5] uses a deep neural network–based deep learning model, and that in [
6] uses a deep learning model based on a recurrent neural network (RNN) [
16]. The authors of [
7] proposed a deep learning model that simultaneously performs malicious file detection and malicious file classification. The disadvantage of using deep learning for malware detection is that even if deep learning leads to a high detection rate, it still is time-consuming. To solve this problem, we propose using a kNN classification method for malware detection.
We focus on malicious portable executable (PE) files, but other types of malicious files also exist. These include malicious PDF files [
17] and Powershell scripts [
18,
19]. Some researchers have attempted using AI to determine whether PDF files and Powershell scripts are malicious. However, such approaches are outside the scope of the present study.
In contrast, AI-based malware detection methods show good detection rates when a large volume of labeled data is available to run supervised learning. However, the lack of data can cause severe overfitting; further, collecting large quantities of malware is not easy. To solve this problem, the few-shot learning approach is proposed [
20,
21]. Using few-short learning, a model can learn general and high-level knowledge of malware from a few samples and adapt unseen classes during training. Few-shot learning is useful in the absence of a large volume of data. However, in this study, we assume that there are enough data to train the deep learning model for malware detection.
On the other hand, there are researches about evading malware detection using GAN [
22,
23,
24,
25]. They modify malware by using a generator to be detected as normal. Then, a discriminator cannot determine whether it is malicious. To solve this problem, we need various malware detection methods as well as AI-based malware detection.
Also, there are researches about AI-based intrusion detection as well as AI-based malware detection [
26,
27,
28]. In [
26], the proposed intrusion detection system (IDS) combines data analytics and statistical techniques with machine learning techniques to extract more optimized. In [
27], when the number of abnormal connections is small, it may cause overfitting. To solve the problem, a resampling method is proposed. In [
28], to evade intrusion detection, IDSGAN is proposed. However, intrusion detection is outside the scope of the present study.
Research on similarity hash–based malware detection has also been conducted [
7,
11,
12]. Trend Micro proposed the Trend Micro locality-sensitive hash (TLSH) [
11]. We can determine whether two files are similar by using TLSH. Unfortunately, fuzzy hash algorithms produce different similarity hashes when the input order differs. To solve this problem, Li et al. [
12] proposed malware clustering, based on a new distance-computation algorithm. Huang et al. [
7] extracted API system calls by running malicious files in the Cuckoo sandbox and computed similarity hashes using TLSH. Then, it classified the type of the malicious file by generating a distance matrix. However, the disadvantage of a similarity hash is its low detection rate. In this paper, we propose a method that increases the detection rate using kNN classification.
3. Preliminaries
In this section, we introduce two basic methods. First is TLSH, which is a type of similarity hash. Second is the VP tree, which rapidly finds the kNN.
3.1. TLSH
The purpose of a similarity hash is to find variants of a file. Hundreds of thousands of variant malicious files are created every day, which makes it difficult to detect them using existing pattern-based antivirus solutions. A similarity hash can help to solve this problem. When existing files are known to be normal or malicious, if a new file has similar similarity-hash values, it can be judged to be similar to an existing file.
The properties of the similarity hash are as follows. A cryptographic hash, e.g., MD5, produces a completely different hash value if the file is only slightly different. However, for similarity hashes, similar files have similar hash values and different files have different similarity-hash values.
TLSH is a typical example of a similarity hash [
11]. A TLSH is created as follows. First, a byte string from the file is processed using a five-element sliding window to compute the bucket-count array. Second, quartile points
,
, and
are calculated. Third, the digest header values are constructed. Fourth, the digest body is constructed by processing the bucket array.
First, as shown in
Figure 2, a byte string of a file is counted into a 128-dimensional array using a Pearson hash [
29]. That is, the value computed by the Pearson hash through a five-element window is matched to one of the 128 buckets and the bucket count is incremented by one.
Second, the quartile points are calculated. Seventy-five percent of the bucket counts are greater than or equal to , 50% of the bucket counts are greater than or equal to , and 25% of the bucket counts are greater than or equal to .
Third, as shown in
Figure 2, the first byte of the first three bytes of the similarity hash is the checksum, the second byte is the length of the file, and the third byte is calculated from
,
and
, as follows.
Finally, if the value of a bucket is less than or equal to , 00 is entered into . If the value is less than or equal to , 01 is entered. If it is less than or equal to , 10 is entered. Otherwise, 11 is entered. The TLSH is created through these four steps.
A method for calculating the similarity difference of two files using two similarity hashes is as follows. First, the difference between the similarity-hash header values is calculated. Second, the difference between the similarity-hash body values is calculated.
The difference in the header values of the similarity hash is calculated as follows.
The difference in the body values of the similarity hash is calculated as follows.
Finally, the difference between the two similarity-hash values is calculated as the sum of the difference between the similarity-hash header values and the difference between the similarity-hash body values.
3.2. VP Tree
The purpose of a VP tree [
30] is to quickly find the kNN. The first step is the VP-tree generation and the second is the VP-tree search.
The VP tree is generated as follows. When there are many points, we select a vantage point (VP) randomly. Then, we compute the distances between the vantage point and the other points. We set the radius of the vantage point to the median of the distances. Then, we classify the points into two groups: inner and outer. The distance between the vantage point and a point in the inner group is less than the radius of the vantage point. The distance between the vantage point and a point in the outer group is greater than the radius of the vantage point. Then, the points in the inner group are assigned to the left subtree of the vantage point and the points in the outer group are assigned to the right subtree. Then, we recursively repeat this process in the subtree.
For example, we select
as a vantage point, as shown in
Figure 3. Then, we compute the distances between
and the other points. We set the radius of
to the median of the distances. Then, we classify the points into two groups.
The points in the inner group are assigned to the left subtree of
and the points in the outer group are assigned to the right subtree of
, as shown in
Figure 4.
Next, in the subtree, we repeat this process recursively. In the left subtree, we select vantage point . Because the distance between and is less than the radius of , is assigned to the left subtree of . Because the distance between and is greater than the radius of , is assigned to the right subtree of .
In the right subtree, we select vantage point . Because the distance between and is less than the radius of , is assigned to the left subtree of . Because the distance between and is greater than the radius of , is assigned to the right subtree of .
Second, the VP tree is searched as follows. First, if the following condition is satisfied, the left subtree is pruned. This condition means that the points in the inner group of the vantage point are not within the distance of the kNN.
Second, if the following condition is satisfied, the right subtree is pruned. This condition means that the points in the outer group of the vantage point are not within the distance of the kNN.
Thus, by using the VP tree, we can quickly find the kNN.
For example, when a query point
Q is given, as shown in
Figure 3, because the distance between
Q and vantage point
is less than the difference between the radius of vantage point
and the kNN distance, the right subtree of
is pruned. This means that the points in the right subtree of
are not kNNs of
Q. Thus,
,
, and
are pruned.
Next, in the left subtree of , the left subtree of is pruned because the distance between Q and vantage point is greater than the sum of the radius of vantage point and the kNN distance. This means that the points in the left subtree of are not kNNs of Q. Thus, is pruned. Finally, is the kNN of Q.
4. Proposed Method
The disadvantage of deep learning–based malware detection is that it takes a long time, and the disadvantage of similarity hash–based malware detection is its low detection rate. In
Section 4.1, we propose the kNN classification method for fast malware detection. In
Section 4.2, we provide a fast search method for the similarity hash.
4.1. kNN Classification Method for Fast Malware Detection
The disadvantage of similarity hash–based malware detection is its low detection rate. This means that the first nearest neighbor (1NN) of a normal file can be a malicious file or the 1NN of a malicious file can be a normal file. Therefore, we classify the training data into three groups in
Figure 5: Normal, malicious, and undecided. In the normal group, each file is normal and its 1NN is also normal. In the malicious group, each file is malicious and its 1NN is also malicious. In the undecided group, if a file is normal, its 1NN is malicious and if a file is malicious, its 1NN is normal.
The purpose of 1NN classification is to improve the malware detection speed and increase the detection rate. Given a test file , if its 1NN is in the normal group, we determine that is normal. If the 1NN of is in the malicious group, we determine that it is malicious. However, because the 1NN of is in the undecided group, we cannot determine whether it is malicious. Therefore, to determine if it is malicious, we use a deep learning–based malware detection method.
When the 1NN of a test file is in the undecided group, we can increase the detection rate by using a deep learning–based method. When the 1NN of a test file is in the normal group or malicious group, we can speed up the malware detection because we can determine whether or not the file is malicious without using a deep learning–based method.
The kNN classification method is composed of a training phase and a test phase. The training phase proceeds as follows.
We find the 1NN of each file in the training data.
If a file is normal and its 1NN is also normal, it is assigned to the normal group. If a file is malicious and its 1NN is also malicious, it is assigned to the malicious group. Otherwise, it is assigned to the undecided group.
The test phase proceeds as follows.
- 3.
We find the 1NN of each test file.
- 4.
If the 1NN is normal, we determine that the test file is normal and if the 1NN is malicious, we determine that the test file is malicious.
- 5.
Otherwise, we determine whether or not the test file is malicious by using the deep-learning method to be introduced in the next section.
For example, in
Figure 5, the training data include three normal files,
,
, and
, and three malicious files,
,
, and
. In the training phase, we find the 1NN of each file, as follows.
Then, we classify the training data into three groups, as follows.
Next, the test phase is performed. First, we find the 1NN of each test file, as follows.
Second, we determine which group each test file is assigned to, as follows.
Finally, because the 1NN of is in the normal group, is normal, and because the 1NN of is in the malicious group, is malicious. However, because the 1NN of is in the undecided group, we must determine whether it is malicious by using a deep learning–based detection method.
Therefore, if the 1NN of each test file is in the normal group or malicious group, we can quickly determine whether or not it is malicious using only the kNN classification. In addition, when the 1NN of each test file is in the undecided group, we can increase the detection rate because we determine whether it is malicious by using the deep learning–based method. In
Section 5, we will present the detection rates and detection times of kNN classification for malware detection.
4.2. Rapid Search Method for a Similarity Hash
Even if similarity hash–based malware detection is faster than the deep learning–based method, it still takes a long time when there are many data. For example, there are about 1 billion malicious files [
1]. When we find the kNN of a new file, we need to improve the speed of the kNN search of the similarity hash.
Therefore, in this section, we propose a kNN search method for the similarity hash using a VP tree. The purpose of the VP tree is to reduce as much as possible the number of distance comparisons needed to find the kNN. When there are n existing files and a new file is given, we can find the kNN of the new file using a brute-force search method. Its complexity is O(n). However, by using a VP tree, we can reduce the number of distance comparisons. Then, its complexity is O(logn).
The body of the TLSH is given as follows.
We define the distance between two files
and
as follows.
Note that originally TLSH difference is computed by the sum of the difference of the header and the difference of the body of the TLSH. However, because TLSH does not guarantee triangle inequality [
31], we modify the distance between the two files as the difference of only the body of the TLSH.
Then, we generate a VP tree using the training data, as follows. First, we randomly select a vantage point and compute the distances between the vantage point and the other points. Then, we compute a vantage-point radius that classifies the other points into two groups: inner and outer. Second, the points of the inner group are assigned to the left subtree of the vantage point and the points of the outer group are assigned to the right subtree of the vantage point. Third, we repeat the first and second steps recursively.
Next, we search the kNN of a new file using the VP tree, as follows. First, if the following condition is satisfied, the left subtree is pruned.
In this case, the points of the VP’s left subtree are guaranteed to not be contained in the kNN of the query point, as shown in
Figure 6. Otherwise, we should search the left subtree.
Second, if the following condition is satisfied, the right subtree is pruned.
In this case, the points of the VP’s right subtree are guaranteed to not be contained in the kNN of the query point, as shown in
Figure 7. Otherwise, we should search the right subtree.
Therefore, we can reduce the kNN search time for the similarity hash using a VP tree. In
Section 5, we will show the kNN search time using a VP tree.
4.3. Deep Learning–Based Malware Detection
The deep learning–based malware-detection method consists of two modules: feature extraction and deep learning. First, the feature-extraction module extracts the feature data from the portable executable (PE) files. Second, the deep-learning module trains a deep-learning model using training data and tests it using test data.
Figure 8 shows the complete deep learning–based malware-detection system.
In the feature-extraction module, we convert the PE files into assembly-language (ASM) files using Objdump [
13]. We extract opcode sequences from the ASM files and extract trigram sequences using three consecutive opcodes.
For example,
Figure 9 shows an ASM file. Next, we extract an opcode sequence, as follows.
Then, we construct trigram sequences using three consecutive opcodes, as follows.
In the deep-learning module, we train a deep-learning model that combines a CNN layer [
15] and an LSTM layer [
16] using training data, as shown in
Figure 8. Then, we test it with test data. The CNN layer helps to obtain more features from the files and the LSTM layer helps to distinguish sequence data from other sequence data. In
Section 5, we will show the detection rate of the deep-learning model.
6. Discussion
In this paper, we proposed two methods to improve the speed of malware detection. First, we proposed a kNN classification to increase the detection rate and reduce the detection time. Second, we provided a similarity hash using a VP tree to improve the similarity-hash search speed. By using kNN classification, we increased the detection rate by 25% and decreased the detection time by 67%. By using the VP tree with a similarity hash, we reduced the search time by 20%.
In this study, we determined whether a file was malicious. Thus, we classified the files into a normal group, a malicious group, and an undecided group. However, there are different types of malware such as trojans, backdoors, worm, dropper, and viruses. Therefore, we need the kNN classification for specifying the type of malware. We can use the kNN classification two times. On the first use of the kNN classification, we classify files into a normal group, a malicious group, and an undecided group. On using the kNN classification the second time, we can classify malicious files into a trojan group, a backdoor group, a worm group, a dropper group, a virus group, and an undecided group.
Moreover, we propose a method for improving the VP tree to reduce search time. Originally, we expected that when there are
n training files, the search time would be
O(log
n). However, we discovered that the query circle is often overlapped with the inner space and the outer space of the vantage point in the experiment. Thus, we had to search the left subtree and right subtree of the vantage point frequently. We expect that by using an
m-tree [
36] having
m subtrees instead of a VP tree having two subtrees, we may reduce the search time.
On the other hand, there are recent researches about evading malware detection using GAN [
22,
23,
24,
25]. GAN can be used to modify malware using a generator to be determined as normal by a discriminator. It is reported that GAN significantly reduces the detection rate. We need a method to prevent evading malware detection. We expect that we may solve this problem to some extent using deep learning–and similarity hash–based detection methods together.