A Simhash-Based Integrative Features Extraction Algorithm for Malware Detection

In the malware detection process, obfuscated malicious codes cannot be efficiently and accurately detected solely in the dynamic or static feature space. Aiming at this problem, an integrative feature extraction algorithm based on simhash was proposed, which combines the static information e.g., API (Application Programming Interface) calls and dynamic information (such as file, registry and network behaviors) of malicious samples to form integrative features. The experiment extracts the integrative features of some static information and dynamic information, and then compares the classification, time and obfuscated-detection performance of the static, dynamic and integrated features, respectively, by using several common machine learning algorithms. The results show that the integrative features have better time performance than the static features, and better classification performance than the dynamic features, and almost the same obfuscated-detection performance as the dynamic features. This algorithm can provide some support for feature extraction of malware detection.


Introduction
In recent years, the quantity of malware has increased significantly, and new types of malware [1][2][3] have steadily emerged, creating severe challenges for cyberspace security.Therefore, it is critical in the field of malware detection to quickly analyze malicious samples and extract their real and effective features to form the detection model [4].The existing sample analysis technology mainly includes static analysis and dynamic analysis.The static and dynamic features of malicious samples can be extracted separately by analysis [5,6].
The static feature is formed by analyzing the structure and format of the sample and then extracting the hash value, string information, function information, header file information, and resource description information.The technology obtains most of the malware information from the malware itself, thus the analysis results are relatively comprehensive.However, static features cannot correctly discriminate malware when the static information is packed or obfuscated or compressed [7], making it difficult for static features to express the true purpose of malware, thus affecting the accuracy of detection.
Dynamic features are the behavior of the sample execution and the features of the debug record, such as file operations, the creation and deletion of processes, and other dynamic behaviors.Since the malicious behaviors of malware at dynamic runtime can't be concealed, the extracted dynamic features provide a more realistic description than the static features.However, the extraction of dynamic features needs to be run in a virtual environment [8], which will be reset and restored to the previous state after each malicious sample has be analyzed to ensure that the virtual environment is a real user environment.As a result, features extraction efficiency is much lower than for static features.
It can be seen from the above that it is difficult for one aspect of training and testing to make up for the other defects.Therefore, the research of integrative features aims to integrate the advantages of various types of features; this is a current trend in the development of malware detection technology [9][10][11].Integrating the efficiency of static features and the authenticity of dynamic features can optimize both the performance of features and the spatio-temporal performance of features for classification training.To this end, this paper extracts integrative features of static and dynamic features based on simhash.On the basis of the simhash algorithm, the API function information of the static analysis is merged with the file, process, registry and network behavior information of the dynamic analysis to form integrative features.This algorithm can further improve the training speed of the classification model, as well as achieve better classification and obfuscated-detection results.

Related Work
In recent years, many researchers have proposed a variety of solutions.For instance, some detection methods are based on content signatures [12,13], which compare each sample with known malware signatures.However, the signature-based warning mechanism cannot solve the metamorphic or unidentified instances of malware [14].Ni [15] proposed a malware classification algorithm that uses static features called MCSC (Malware Classification using Simhash and CNN), which convert the disassembled malware codes into gray images based on simhash and then identifies their families by a convolutional neural network.In addition, Idrees [16] combines permissions and intents, and extracts API calls to detect malware [17].These methods all belong to static analysis, which means that it is the process of detecting malware without executing.Thus, it cannot reflect the behavior of malware very well, and cannot easily detect malicious code that is packed and obfuscated.
In addition to the above methods, observing dynamic behaviors and features is also usually used to detect malware [18].For example, a sandbox is used to monitor the sample in real time and dynamically analyze the behavior by extracting the IP address, system calls, memory usage, returned values and times between consecutive system calls [19].In addition, Shibahara [20] proposed a method for determining whether dynamic analysis should be suspended, based on network behavior, to collect malware communications efficiently and exhaustively.While dynamic analysis can offer a more comprehensive view of malware detection, the cost of building the environment and of the manual endeavors, in the process of investigation, is high.More importantly, dynamic analysis is often more complex than static analysis and requires more resources to reduce the number of false positives reported, and usually suffers from low detection rates due to lack of adequate training [21].
Therefore, we propose a method to integrate static and dynamic features to obtain new features.The remainder of this paper is organized as follows.In Section 3, we present our materials and method.Experiments and comparisons are reported in Section 4. Finally, we conclude our work in Section 5.

Methods
The purpose of the algorithm is to use simhash to integrate the results of static analysis and dynamic analysis of malicious samples, to form a uniquely identified hash value (i.e., the integrative feature).The process of the algorithm is shown in Figure 1.As shown in Figure 1, the algorithm includes three sub-portions, which are a static analysis, a dynamic analysis, and a dynamic and integrative feature extraction.
In the static analysis, the first step is to judge whether the sample has a shell and, if so, use the relevant unpack tools to remove it.Next, the API call functions are obtained through IDA Pro disassembly.In the dynamic analysis, the process is to configure and debug the sample analysis environment, run the samples in the environment, and then use Process Monitor, Wireshark and other tools to capture the file, registry, process and network behaviors.In the integrative feature extraction, we use the simhash algorithm to correlate the API functions with the file, process, registry and network behaviors respectively, and finally integrate a hash binary value.The three processes are detailed below.

Static Analysis
Static analysis means obtaining the static information (e.g., structural information, format information, etc.) directly by disassembling the malicious samples.Before disassembly, it would generally detect whether the malicious samples have shells (i.e., unpacking using unpack tools) [22].In addition, malware must interact with the system by means of the API provided by the operating system to express malicious behaviors.If the attacker does not directly use the API for system calls instead of a large amount of program code, the payload in the malicious code will be longer, which will further increase the size of the malware, in turn highlighting its malicious features and making it more easy to detect by the intrusion detection system.Although the API itself is not malicious, malware can be malicious by combining certain API functions, and these are generally uncommon in normal files such as process injection, key files alteration and deletion, etc. [23].Therefore, this paper mainly extracts the API function information.The APIs provided by Windows systems are not only extremely large but also have different functions.From the literature, we have selected 9 API modules that may cause system security problems: advapi32.dll,kernel32.dll,ntdll.dll,Psapi.dll,rasapi32.dll,shell32.dll,user32.dll,wininet.dlland ws2_32.dll[24].

Dynamic Analysis
Dynamic analysis refers to running malicious samples in a simulated sample analysis environment (e.g., a sandbox), and then capturing the various behaviors of the samples in the run.The dynamic analysis process is more complicated than static analysis and requires longer running hours.However, the dynamic analysis process can accurately capture the behaviors [25].Therefore, dynamic analysis is also the key to malware detection.In this paper, we used Cuckoo Sandbox [26] to monitor and capture information on four sensitive behaviors i.e. file behavior, registry behavior, process behavior and network behavior from a .jsonformat analysis report from Cuckoo Sandbox.

Integrative Features Extraction
In dynamic analysis, the API functions are called during file read and write, memory allocation and release, and process creation and destruction.Therefore, this paper improves the simhash algorithm to mark the behavior information associated with the API with different weights, and thus As shown in Figure 1, the algorithm includes three sub-portions, which are a static analysis, a dynamic analysis, and a dynamic and integrative feature extraction.
In the static analysis, the first step is to judge whether the sample has a shell and, if so, use the relevant unpack tools to remove it.Next, the API call functions are obtained through IDA Pro disassembly.In the dynamic analysis, the process is to configure and debug the sample analysis environment, run the samples in the environment, and then use Process Monitor, Wireshark and other tools to capture the file, registry, process and network behaviors.In the integrative feature extraction, we use the simhash algorithm to correlate the API functions with the file, process, registry and network behaviors respectively, and finally integrate a hash binary value.The three processes are detailed below.

Static Analysis
Static analysis means obtaining the static information (e.g., structural information, format information, etc.) directly by disassembling the malicious samples.Before disassembly, it would generally detect whether the malicious samples have shells (i.e., unpacking using unpack tools) [22].In addition, malware must interact with the system by means of the API provided by the operating system to express malicious behaviors.If the attacker does not directly use the API for system calls instead of a large amount of program code, the payload in the malicious code will be longer, which will further increase the size of the malware, in turn highlighting its malicious features and making it more easy to detect by the intrusion detection system.Although the API itself is not malicious, malware can be malicious by combining certain API functions, and these are generally uncommon in normal files such as process injection, key files alteration and deletion, etc. [23].Therefore, this paper mainly extracts the API function information.The APIs provided by Windows systems are not only extremely large but also have different functions.From the literature, we have selected 9 API modules that may cause system security problems: advapi32.dll,kernel32.dll,ntdll.dll,Psapi.dll,rasapi32.dll,shell32.dll,user32.dll,wininet.dlland ws2_32.dll[24].

Dynamic Analysis
Dynamic analysis refers to running malicious samples in a simulated sample analysis environment (e.g., a sandbox), and then capturing the various behaviors of the samples in the run.The dynamic analysis process is more complicated than static analysis and requires longer running hours.However, the dynamic analysis process can accurately capture the behaviors [25].Therefore, dynamic analysis is also the key to malware detection.In this paper, we used Cuckoo Sandbox [26] to monitor and capture information on four sensitive behaviors i.e. file behavior, registry behavior, process behavior and network behavior from a .jsonformat analysis report from Cuckoo Sandbox.

Integrative Features Extraction
In dynamic analysis, the API functions are called during file read and write, memory allocation and release, and process creation and destruction.Therefore, this paper improves the simhash algorithm to mark the behavior information associated with the API with different weights, and thus extracts the integrative features.The following mainly describes the simhash algorithm and integrative process.

Simhash Algorithm
The simhash algorithm comes from Moses Charikar's paper and is the core of feature extraction [27], which was originally designed to solve the deduplication tasks of hundreds of millions of web pages.The main idea is to map high-dimensional feature vectors into low-dimensional feature vectors.The implementation process is mainly divided into five steps, namely segmenting, hash, weighting, merging, and descending dimension (Figure 2).extracts the integrative features.The following mainly describes the simhash algorithm and integrative process.

Simhash Algorithm
The simhash algorithm comes from Moses Charikar's paper and is the core of feature extraction [27], which was originally designed to solve the deduplication tasks of hundreds of millions of web pages.The main idea is to map high-dimensional feature vectors into low-dimensional feature vectors.The implementation process is mainly divided into five steps, namely segmenting, hash, weighting, merging, and descending dimension (Figure 2).

Integrative Process
The simhash algorithm is used to integrate the dynamic and static information to form features in this paper, and some improvements are made in the process of segmenting and weighting.The process is shown in Figure 3.The process is mainly divided into five steps: data preprocessing, weight calculating, hash, weighting and merging.
A. Data Preprocessing

Integrative Process
The simhash algorithm is used to integrate the dynamic and static information to form features in this paper, and some improvements are made in the process of segmenting and weighting.The process is shown in Figure 3. extracts the integrative features.The following mainly describes the simhash algorithm and integrative process.

Simhash Algorithm
The simhash algorithm comes from Moses Charikar's paper and is the core of feature extraction [27], which was originally designed to solve the deduplication tasks of hundreds of millions of web pages.The main idea is to map high-dimensional feature vectors into low-dimensional feature vectors.The implementation process is mainly divided into five steps, namely segmenting, hash, weighting, merging, and descending dimension (Figure 2).

Integrative Process
The simhash algorithm is used to integrate the dynamic and static information to form features in this paper, and some improvements are made in the process of segmenting and weighting.The process is shown in Figure 3.The process is mainly divided into five steps: data preprocessing, weight calculating, hash, weighting and merging.
A. Data Preprocessing The process is mainly divided into five steps: data preprocessing, weight calculating, hash, weighting and merging.

A. Data Preprocessing
For a certain sample S, the API function information captured by static analysis is expressed as the API call sequence, X s = {x 1 , x 2 , . . . ,x t , . . . ,x m }, according to the results of monitoring by the .dllmodules, where x t represents the t-th function called, and m represents the total number of call functions.Then, dynamic analysis is used to extract the file behavior information, the process behavior information, the registry behavior information, and the network behavior information, which are respectively represented as corresponding sequences according to the execution sequence of each behavior; these are the file behavior sequence, F S = { f 1 , f 2 , . . . ,f i , . . . ,f a }, the process behavior sequence, P s = p 1 , p 2 , . . ., p j , . . ., p b , the registry behavior sequence, R s = {r 1 , r 2 , . . . ,r k , . . . ,r c }, and the network behavior sequence, N s = {n 1 , n 2 , . . . ,n l , . . . ,n d }, where: f i is the i-th file behavior of the execution; p j is the execution of the j-th process behavior; r k is the execution of the k-th registry behavior; n l is the execution of the l-th network behavior; a, b, c, and d, respectively indicate the length of each sequence of behavior; and, A S = {F S , P S , R S , N S }.

B. Weight Calculating
Since the sequence of the API function is the same as the integrative process of each behavior sequence, the API function sequence, X S , and the file behavior sequence, F S , are selected as examples for explanation of weight calculating.The weights are defined as follows: Definition 1. API function sequence weight w t .The weight level is defined as 2. For each function, x t , in the sequence, if x t is related to the integrative behavior information, the weight is 2; otherwise the weight is 1.Definition 2. File behavior sequence weight w i , whose weight is the number of times each behavior information, f i , in F S , is repeated in the sequence.

D. Weighting
H X S ∪F S is weighted by every bit.If a bit in hash(x t ) or hash( f i ) is 1, then +w t or +w q ; if it is 0, then −w t or −w q .For example, if the hash value is 101110, the weight is 2, and the result of the weighting calculation is 2−2222−2.Therefore, each element in H X S ∪F S is represented as a sequence of numbers of b-bits, resulting in H X S ∪F S ,b−bits .

E. Merging
Each b-bits sequence in H X S ∪F S ,b−bits is accumulated and merged to obtain a final b-bits sequence.Then, the b-bits sequence is normalized with a negative value taking 0 and a positive value taking 1.
Ultimately, a b-bits sequence of numbers consisting of 0 and 1 is formed, which is the integrative feature, Z X S ∪F S , of X S and F S .
Output: four integrative features Z X S ∪F S , Z X S ∪R S , Z X S ∪P S , Z X S ∪N S .
Step 1.Let n = 1, read each line of .jsonfile and disassembly file, capture API call sequence, X S , file behavior sequence, F S , registry behavior sequence, R S , process behavior sequence, P s , and network behavior sequence, N S .
Step 2. Calculate the weight, w, of each of the four behavior sequences corresponding to the API call sequence, and obtain W X S ∪F S , W X S ∪R S , W X S ∪P S , W X S ∪N S .
Step 3. Calculate each hash value of X S , F S , R S , P s , N S , and the result is represented as H X S ∪F S , H X S ∪R S , H X S ∪P S , W X S ∪N S .
Step 4. Calculate each weight value of H X S ∪F S , H X S ∪R S , H X S ∪P S , H X S ∪N S and the result is represented as Step 5. Accumulate the sequence of each b-bit in H X S ∪F S ,b−bits , H X S ∪R S ,b−bits , H X S ∪P S ,,b−bits , H X S ∪N S ,,b−bits , and merge to a final b-bits sequence, and then normalize it to obtain the integrative features Z X S ∪F S , Z X S ∪R S , Z X S ∪P S , Z X S ∪N S .

Experimental Configuration
The experimental data come from VXHeavens and Malshare malware sharing websites.We collect a total of 5949 malicious samples in the form of PE (Portable Executable) files under the win32 platform in four main types: Backdoor, Trojan, Virus, and Worm (Table 1).The normal samples in the experiment are system files or various application files in .exeformat, with no malicious behaviors.We use Python to implement the algorithm in this paper.The hardware and software configuration of the experiment are shown in Table 2.

Experimental Design
In the experiments, correctly classified (CC), true positive (TP; the ratio of malicious samples predicted to be malicious), true negative (TN; the ratio of non-malicious samples predicted to be non-malicious), false positive (FP; the ratio of non-malicious samples predicted to be malicious), false negative (FN; the ratio of malicious samples predicted to be non-malicious) and time (T; training time) were used as evaluation metrics.The experimental process consists of the following three parts:

Classification Effect Evaluation
The extracted four integrative features are represented as four-dimensional features and compared with the dynamic and static features which are not integrated.First, according to the literature [28], the N-gram algorithm is used to extract the static features from the API function information, where N = 3.Then, four kinds of dynamic behavior information are filtered to extract dynamic features.Finally, we use different classification algorithms to train and test the integrative features, static features and dynamic features, and evaluate the impact of different features and different classification algorithms on the classification effect.Among these, the classification algorithm is trained and tested by five common classification algorithms: NB (Naive Bayes, NB), SGD (Stochastic Gradient Descent, SGD), SVM (Support Vector Machine, SVM), Ada (Adaboost, Ada) and RT (Random Trees, RT).The description of the selected classification algorithms is as follows.
NB is a simple technique for constructing classifiers; i.e., models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable [29].It has the advantages of simple calculation, ability to handle multi-classification tasks, and insensitivity to missing data.However, the algorithm needs to calculate the prior probability and has a higher requirement on the form of the input data.SGD, also known as an incremental gradient descent method, is used to optimize a differentiable objective function and a stochastic approximation of gradient descent optimization [30].The accuracy of the algorithm is higher, but the training speed for large samples is slower.
SVMs are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.It can be used for the processing of high-dimensional features and nonlinear features, but it is less efficient for large samples and is sensitive to missing data [31].
AdaBoost is often referred to as the best out-of-the-box classifier.When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples [32].The algorithm can use different classification algorithms as weak classifiers, and it fully considers the weight of each classifier relative to the random forest algorithm.However, the number of weak classifiers is difficult to set, and the training takes a long time.
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.Random decision forests correct for decision trees' habit of overfitting to their training set [33].It can handle very high-dimensional features without having to make feature selections.In addition, it is fast in training and insensitive to missing features.
At last, we use the method of K-fold cross validation to train and test, where K = 10.

Obfuscated-Detection Evaluation
Obfuscated-detection refers to the ability that can obfuscate samples.First, we randomly select 500 obfuscated samples from the dataset and extract their features.Then, the 500 obfuscated samples and 500 normal samples are used as the verification set, and the three feature-trained detection models are used to detect the set.

Time Performance Evaluation
We compare the time it takes for different features to train the detection model by different classification algorithms.

Classification Effect Evaluation
The training and test results of various types of features under different algorithms are shown in Tables 3-7, and the ROC curve of the algorithms is shown in Figure 4.It can be seen that the classification effect of the integrative features is almost the same as for the other two types of features, and the CC values of the three types of features are all above 80%.The experimental results are analyzed from the perspective of algorithms and features, respectively.As shown in Figure 5, the CC value of the static features is higher than the other two types of features, the CC value of the dynamic features is relatively low, and the CC value of the integrative features is centered.The reason for this may be that the static information is rich and that there are more static features.Therefore, the generalization ability of the training model of the static feature is relatively high, and the CC value is increased.Some samples, e.g., samples in the .dllfile format, do not have behavior information, so there are fewer dynamic features than static features and the CC values of the dynamic features is lower.The integrative features integrate the above two types of features.Although the number of features trained is large enough, the simhash algorithm is affected by the missing dynamic information during the integrative process, so that the classification effect is between the other two types of features.As shown in Figure 5, the CC value of the static features is higher than the other two types of features, the CC value of the dynamic features is relatively low, and the CC value of the integrative features is centered.The reason for this may be that the static information is rich and that there are more static features.Therefore, the generalization ability of the training model of the static feature is relatively high, and the CC value is increased.Some samples, e.g., samples in the .dllfile format, do not have behavior information, so there are fewer dynamic features than static features and the CC values of the dynamic features is lower.The integrative features integrate the above two types of features.Although the number of features trained is large enough, the simhash algorithm is affected by the missing dynamic information during the integrative process, so that the classification effect is between the other two types of features.As shown in Figure 6, the model trained by the SGD algorithm has the highest CC value, followed by the RT, the SVM and Ada algorithms.The model trained by the dynamic features using the Ada algorithm has the lowest CC value.From the perspective of training time, although the SGD algorithm has the highest CC value, the training model takes a relatively long time, especially in the experiment using K-fold cross validation (i.e., approximately 40 min for 5000 samples detection), thus the algorithm is not suitable for actual detection.In summary, using the RT algorithm to train the integration integrative features is comparatively better.

Obfuscated-Detection Evaluation
Table 8 and Figure 7 show the results of the obfuscated detection of the three types of features, where the DR (Detection Rate) represents the ratio of the obfuscated-detection samples to the total samples.It can be seen that the detection rate of dynamic features and integrative features is very close and better than that of static features.The reason is that it is difficult to capture the true and complete static information of obfuscated samples in static analysis.In terms of algorithms, the detection rate using the NB, SGD or RT algorithms is higher than other algorithms.As shown in Figure 6, the model trained by the SGD algorithm has the highest CC value, followed by the RT, the SVM and Ada algorithms.The model trained by the dynamic features using the Ada algorithm has the lowest CC value.From the perspective of training time, although the SGD algorithm has the highest CC value, the training model takes a relatively long time, especially in the experiment using K-fold cross validation (i.e., approximately 40 min for 5000 samples detection), thus the algorithm is not suitable for actual detection.In summary, using the RT algorithm to train the integration integrative features is comparatively better.As shown in Figure 6, the model trained by the SGD algorithm has the highest CC value, followed by the RT, the SVM and Ada algorithms.The model trained by the dynamic features using the Ada algorithm has the lowest CC value.From the perspective of training time, although the SGD algorithm has the highest CC value, the training model takes a relatively long time, especially in the experiment using K-fold cross validation (i.e., approximately 40 min for 5000 samples detection), thus the algorithm is not suitable for actual detection.In summary, using the RT algorithm to train the integration integrative features is comparatively better.

Obfuscated-Detection Evaluation
Table 8 and Figure 7 show the results of the obfuscated detection of the three types of features, where the DR (Detection Rate) represents the ratio of the obfuscated-detection samples to the total samples.It can be seen that the detection rate of dynamic features and integrative features is very close and better than that of static features.The reason is that it is difficult to capture the true and complete static information of obfuscated samples in static analysis.In terms of algorithms, the detection rate using the NB, SGD or RT algorithms is higher than other algorithms.

Obfuscated-Detection Evaluation
Table 8 and Figure 7 show the results of the obfuscated detection of the three types of features, where the DR (Detection Rate) represents the ratio of the obfuscated-detection samples to the total samples.It can be seen that the detection rate of dynamic features and integrative features is very close and better than that of static features.The reason is that it is difficult to capture the true and complete

Time Performance Evaluation
As shown in Tables 3-7, the model trained with the integrative features takes the least time.The reason is that the feature uses the simhash algorithm to integrate the features into binary values, greatly reducing the storage space requirements of the training data and making training faster.However, the number of static and dynamic features is large, the storage space requirements are increased and the training speed is slow.It can be seen that the extracted integrative features in this paper have great advantages in terms of time performance.

Conclusions
In the field of malware detection, the features of the samples play an important role.This paper improves the simhash algorithm to integrate static and dynamic information, and proposes an extraction algorithm for integrative features.Through experimental analysis and verification, the integrative features combine the advantages of static features and dynamic features, while greatly reducing the training time of the detection model and further improving detection efficiency.However, it is also found that the amount of dynamic and static information is significant and that there is no further processing optimization, which makes it more expensive to extract integrative features using the simhash algorithm.Therefore, future research will improve the method of analyzing and capturing filtering information to reduce the time needed for feature extraction, and integrate other static information and dynamic information.

Time Performance Evaluation
As shown in Tables 3-7, the model trained with the integrative features takes the least time.The reason is that the feature uses the simhash algorithm to integrate the features into binary values, greatly reducing the storage space requirements of the training data and making training faster.However, the number of static and dynamic features is large, the storage space requirements are increased and the training speed is slow.It can be seen that the extracted integrative features in this paper have great advantages in terms of time performance.

Conclusions
In the field of malware detection, the features of the samples play an important role.This paper improves the simhash algorithm to integrate static and dynamic information, and proposes an extraction algorithm for integrative features.Through experimental analysis and verification, the integrative features combine the advantages of static features and dynamic features, while greatly reducing the training time of the detection model and further improving detection efficiency.However, it is also found that the amount of dynamic and static information is significant and that there is no further processing optimization, which makes it more expensive to extract integrative features using the simhash algorithm.Therefore, future research will improve the method of analyzing and

Figure 2 .
Figure 2. The process of the simhash algorithm.

Figure 2 .
Figure 2. The process of the simhash algorithm.

Figure 2 .
Figure 2. The process of the simhash algorithm.