Malware Variant Identiﬁcation Using Incremental Clustering

: Dynamic analysis and pattern matching techniques are widely used in industry, and they provide a straightforward method for the identiﬁcation of malware samples. Yara is a pattern matching technique that can use sandbox memory dumps for the identiﬁcation of malware families. However, pattern matching techniques fail silently due to minor code variations, leading to unidentiﬁed malware samples. This paper presents a two-layered Malware Variant Identiﬁcation using Incremental Clustering (MVIIC) process and proposes clustering of unidentiﬁed malware samples to enable the identiﬁcation of malware variants and new malware families. The novel incremental clustering algorithm is used in the identiﬁcation of new malware variants from the unidentiﬁed malware samples. This research shows that clustering can provide a higher level of performance than Yara rules, and that clustering is resistant to small changes introduced by malware variants. This paper proposes a hybrid approach, using Yara scanning to eliminate known malware, followed by clustering, acting in concert, to allow the identiﬁcation of new malware variants. F 1 score and V-Measure clustering metrics are used to evaluate our results.


Introduction
This paper provides a technique called Malware Variant Identification, using Incremental Clustering (MVIIC). A sandbox is an instrumented virtual machine that executes malware samples and other programs, and gathers dynamic analysis features resulting from program execution. These features include filesystem activity, registry activity, network traffic, program execution, and code injection. Two common sandboxes are Cuckoo [1] and CWSandbox [2]. Yara is a pattern matching technique that can use sandbox memory dumps for the identification of malware families. Yara rules contain regular expressions and strings [3]. Yara rules may fail to identify new malware variants when software development modifies code corresponding to the Yara regular expressions, or when program strings are changed. These unidentified malware samples provide a valuable source of unknown malware families and new malware variants.
While the clustering of dynamic analysis features has previously been used for malware detection [4,5], this paper proposes a hybrid scheme using Yara rules, and a novel incremental clustering algorithm [6] to enable the identification of new malware families and malware variants.
In this research, malware identification is performed in two layers. In the first layer, the previously developed Yara rules are used to reject samples of known malware families. A novel incremental clustering algorithm is applied in the second layer. Unlike traditional clustering algorithms that are sensitive to the choice of starting cluster centers, incremental clustering is able to find or approximate the best cluster distribution and detect small size clusters. This is important in malware clustering when new data (unidentified malware samples) become available, as such data generate separate and small clusters. Therefore, this incremental clustering algorithm is well suited for use in a malware analysis system,

•
They are readily created. • Testing allows the elimination of obvious false positives. • They are excellent for the matching of machine code, program headers, metadata, or strings in programs. • Detection may fail, due to code changes resulting from malware development, or recompilation. • When Yara rules fail, no warning is given.
MVIIC provides a key component of the intelligent malware analysis process shown in Figure 1. In this scenario, dynamic analysis of samples from a malware feed (1) is performed, using a Cuckoo sandbox. Yara rules are used to scan the memory dump (2) from the sandbox. Clustering is performed using the dynamic analysis features (4) of each unidentified malware sample (3,5). New malware families (6) and malware variants (6) contained in a random selection of malware samples from each cluster are identified, and new Yara rules (8) are created, using the memory dumps (7) of the clustered malware samples. This paper makes the following contributions: • Provides a use case for a novel incremental clustering algorithm. • Provides feature engineering of features derived from the dynamic analysis logs of malware samples captured from the live feed. • Develops a two-layered dynamic analysis system that uses Yara rules to reject known malware, and uses a novel clustering algorithm and feature engineering to perform clustering of unidentified samples to enable the identification of new malware families, variants, and deficiencies in the Yara rules.
The structure of this paper is as follows: Section 2 presents related work; Section 3 provides the problem definition and overview; Section 4 describes our approach; Section 5 provides an empirical evaluation of the new approach; Section 6 provides discussion and limitations; and Section 7 presents the conclusion.

Malware Classification Using Machine Learning
Machine learning is widely used in malware detection, classification, and clustering. Common feature extraction techniques are used in each of these activities [7]. Malware feature extraction can be performed using dynamic or static analysis. Feature extraction using dynamic analysis is often performed by running a malware sample in a sandbox. The sandbox extracts behavioral features of the malware execution. These behavioral features include API call traces, filesystem and registry activity, network traffic, memory, instruction, and register traces. A further type of feature may be created by representing malware bytes as gray-scale images. The main limitations of behavioral analysis are malware detection of the analysis environment and the limited program path that a malware sample executes without command and control inputs. Static analysis provides features that are extracted from the malware sample, and includes the function call graph, control flow graphs, API function calls, instruction statistics, graphical representations, strings, byte representations, entropy measurements, and hashes. Malware obfuscation, including packing (compression or encryption of the original malware binary), is used to hinder static analysis [7].
Traditional machine learning techniques, such as support vector machine (SVM) or random forest, require time-consuming feature engineering for the design of the machine learning model. Deep learning architectures do not require feature engineering, and provide a trainable system that automatically identifies features from the provided input data. Deep learning algorithms include the convolutional neural network, residual network, autoencoder, recurrent neural network, short-term memory network, gated recurrent unit network, and neural network [7].

Malware Clustering
A malware clustering and classification system using dynamic analysis was developed by Rieck et al. [8]. In this research, API call names and call parameters are extracted using CWSandbox. The API call names and parameters are encoded into a multi-level representation called the Malware Instruction Set (MIST) [8]. Malware samples of a specific family frequently contain variants with a high degree of behavioral similarity, providing a dense vector space representation. This research study used prototype clustering to represent the dense groups. The use of a prototype representation accelerated clustering times by reducing the machine learning computation. The clustering experiments used hierarchical clustering with a Euclidean distance measure; features were taken from 33,698 malware samples, and provided a maximum F 1 score of 0.95.
A study of the clustering of malware samples using dynamic analysis is provided by Faridi et al. [9]. This research used a data set of 5673 Windows malware samples. The ground truth was determined by using the Suricata Intrusion Detection System (IDS) alerts [10] to classify the malware samples based on the network traffic generated from the execution of each malware sample. This led to a manually verified identification of 94 malware clusters. The clustering features were extracted from the Cuckoo sandbox file, registry keys, services, commands, API names, and mutex reports. Features present in less than two samples were dropped to reduce the outliers. The cluster number was estimated, using the gap statistic [11]. Clustering was performed, using the following algorithms: DBSCAN, K-Means++, Spectral, Affinity propagation. The features were derived from the Term Frequency Inverse Document Frequency (TF/IDF) matrix of the behavioral data extracted from the Cuckoo sandbox reports. The following pairwise distance functions were used in calculating the TF/IDF matrix: Cosine7, Cityblock, Euclidean, Hamming, Braycurtis, Canberra, Minkowski, and Chebyshev. The performance of density-based, hierarchical, and prototype clustering was evaluated. The metrics that were used to evaluate the clustering performance are as follows: adjusted mutual information score, adjusted Rand score, homogeneity, completeness, V-measure, G-means (geometric mean of precision and recall), and silhouette coefficient. This paper notes that precision and recall are, in general, sensitive to the number of clusters calculated, while homogeneity and completeness are not. Hierarchical clustering with a Bray-Curtis distance measure provided the best performance with 107 clusters and a V-measure of 0.921 in 195.71 s [9].

Incremental Clustering
Internet security organizations often maintain a knowledge base of malware family behaviors. When a new malware sample is received, the malware knowledge base is used for malware classification, which identifies whether the malware sample is a variant of an existing family, or if it is a new malware family that requires reverse engineering to understand its behaviors. Clustering of malware features can be used in the identification of malware families. However, samples in a malware feed are subject to an ongoing concept drift, due to the software development in existing malware families and the creation of new malware families. Traditional clustering algorithms are designed to operate with static ground truth. To deal with concept drift, traditional clustering algorithms require that the whole data set be periodically re-clustered to incorporate the updated ground truth. The execution time of traditional clustering algorithms increases in proportion to the data set size. The problems of concept drift and data set size can be addressed by a two-layered solution, where clustering is used to identify new malware families and a classifier is then trained with the updated malware families. Difficulties with this approach lie in determining an optimal batch size and the retraining interval. MalFamAware [12] performs both the identification of new malware families and malware family classification through the use of online clustering, and removes the cost of periodic classifier retraining. MalfamAware makes use of the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) incremental clustering algorithm. MalFamAware extracts static and dynamic features, using the Cuckoo sandbox. MalFamAware uses the following two phases: tuning, and family identification and classification. In the first phase, BIRCH is tuned using the existing ground truth; in the second phase, BIRCH is used to perform malware classification and new family identification. MalFamAware testing was performed, using a data set of 18 malware families with a total of 5351 malware samples. In the evaluation of MalFamAware, the BIRCH algorithm provided the best performance with an F1 score of 0.923 [12].

Yara
Yara rules were developed by Victor Alvarez of VirusTotal [13] for the identification and classification of malware samples [14]. They consist of strings and regular expressions [15] that are used to match the program data. They are relatively simple and can be easily created. Yara rules represent patterns and other data in programs, including malware.
These patterns are dependent on the toolchain used to build the program, and can change whenever the program is rebuilt or when software development is performed. Yara rules will fail to identify new malware variants where software development or recompilation have resulted in modifications to machine code corresponding to previous Yara rules. This may result in a failure to identify new malware variants or new malware families.

Malware Variant Identification Using Incremental Clustering
There are two classes of incremental algorithms in clustering. The first class contains algorithms that add data points incrementally and update clusters accordingly (e.g., BIRCH [16]). Algorithms from the second class construct clusters incrementally. In order to compute the k > 1 clusters, these algorithms start by calculating one cluster, which is the whole data set, and gradually add one cluster at each iteration. Our algorithm belongs to the second class of incremental algorithms. The following section provides a summary of this incremental clustering algorithm.

Incremental Non-Smooth Clustering Algorithm
Clustering is an unsupervised partitioning of a collection of patterns into clusters based on similarity. Most clustering algorithms can be classified as hierarchical or partitional. We consider partitional clustering algorithms. These algorithms find the partition that optimizes a clustering criterion [6]. Next, we briefly describe the non-smooth optimization formulation of the clustering problem used in this paper.
Assume that A is a finite set of points in the n−dimensional space R n , that is A = {a 1 , . . . , a m }, where a i ∈ R n , i = 1, . . . , m. The hard clustering problem is the distribution of the points of the set A into a given number k of pairwise disjoint and nonempty subsets A j , j = 1, . . . , k such that The sets A j , j = 1, . . . , k are called clusters, and each cluster A j can be identified by its center x j ∈ R n , j = 1, . . . , k. The problem of finding these centers is called the k-clustering (or k-partition) problem. The similarity measure is defined using the squared L 2 norm: In this case, the clustering problem is also known as the minimum sum-of-squares clustering problem. The non-smooth optimization formulation of this problem is [6,17] as follows: One of the well-known algorithms for solving the clustering problem is the k-means algorithm. However, it is sensitive to the choice of starting cluster centers and can find only local solutions to the clustering problems. Such solutions in large data sets may significantly differ from the global solution to the clustering problem.
To enable an effective clustering process, we apply the incremental non-smooth optimization clustering algorithm (INCA), introduced in [18] (see also [6]). This algorithm computes clusters gradually, starting from one cluster. It involves the procedure to generate starting cluster centers; the procedure is described in [6]. This algorithm is accurate and efficient in data sets with a large number of features, and can find either global or nearglobal solutions to the clustering problem (3). In INCA this problem is solved using the nonsmooth optimization method, the discrete gradient method [19]. In the implementation of the INCA algorithm, we apply the following stopping criteria: , i ≥ 2, f i is the value of the clustering function obtained at the i-th iteration of an incremental algorithm and ε is a user defined tolerance. 2. n c > C, where n c is the number of clusters and C stands for the maximum number of clusters.
The parameter ε is user defined. This parameter is defined using the cluster function value at the first iteration and the number of records: ε = f 1 /m. The incremental nonsmooth clustering algorithm is shown in Algorithm 1.

Algorithm 1 Incremental non-smooth clustering algorithm
Require: Cluster count k > 0 Require: Find starting pointȳ ∈ R n by solving the auxilarity clustering problem for L th cluster Findỹ ∈ R nL by solving clustering problem starting from (x 1 , . . . , x L−1 ,ȳ) Set solution of the L-partition problem x j =ỹ j , j = 1, . . . , L and find the value f L of the cluster function f at this solution. end while

Experimental Methodology
The use of the incremental clustering algorithm allows the computation of accurate solutions to the clustering problem to consider at once cluster distributions for different numbers of clusters, and to cluster unidentified malware samples.

Feature Engineering
The following features were extracted from the analysis reports following the dynamic analysis in a Cuckoo sandbox [1] running on a Windows 7 virtual machine. Previous machine learning research achieved good performance, using API call frequency histograms [20,21], while other research indicated good performance associated with network and HTTP features [9], including hostnames. As a result, it was decided to use API call frequency and DNS hostname features. Two types of feature encoding were performed: histogram encoding, and TF/IDF encoding. The histogram was built, using call counts for each unique API call, and DNS lookup requests for each unique hostname. The TF/IDF encoding created vectors from the unique API call names and unique DNS hostnames.

Clustering
Clustering was performed, using INCA [6]. The evaluation of feature encoding was performed, using both histogram features and TF/IDF encoding. Feature engineering experiments were performed to identify the configuration that provided the highest accuracy.

Clustering Metrics
The metrics applied for the clustering results presented from this research are the F 1 score, V-measure, and cluster purity. The F 1 score is calculated using precision and recall, which are calculated from the TP, TN, FP, and FN counts. Precision and recall [22] are defined in Equations (5) and (6), respectively.
The F 1 score (F 1 ) [22] defined in Equation (7) is used to assess the quality of the results in terms of precision and recall.
The V-measure is an entropy-based external cluster evaluation measure. This measure is based on two criteria: homogeneity and completeness. It determines the degree of similarity between a clustering distribution and a class distribution. Homogeneity of a set of clusters means that each cluster from this set contains data points only from a single class. Completeness of a set of clusters means that data points from a given class are elements of the same cluster. Given a set of classes C = {C i : i = 1, . . . , p} and a set of clusters K = {A j : j = 1, . . . , k} the V-measure is defined as follows: Here, homogeneity h is defined as follows: and, Completeness c is defined as follows: and, In the above formulas n ij is the number of data points belonging to the i-th class and the j-th cluster and | · | is the cardinality of a set. In this paper, we take β = 1.
Purity shows how well the cluster distribution obtained by a clustering algorithm reflects the existing class structure of a data set [6]. LetĀ = {A 1 , . . . , A k }, k ≥ 2 be the cluster distribution of the set A and C 1 , . . . , C l be the true classes of A. Denote by n tj the number of points from the t-th class belonging to the j-th cluster. Compute the following: The purity for the cluster distributionĀ [6] is given in Equation (16).

MVIIC Algorithm
Algorithm 2 illustrates the process of building a malware data set from existing samples and then submitting the malware data set for processing in a Cuckoo sandbox. When the sandbox processing is completed, a script is used to extract the raw features from the Cuckoo logfiles. The raw features are encoded as either an API histogram or as API name and hostname data that are encoded using TF/IDF. The incremental clustering program is then run to cluster the features; when the clustering is finished, the clustering metrics are extracted.

Algorithm 2 MVIIC Algorithm
Require: Malware data set Submit malware data set to Cuckoo sandbox while Sandbox processing samples do Wait end while Extract features from Cuckoo report Encode features to vector format Run incremental clustering program while Clustering in progress do Wait end while Read clustering metrics

Malware Feed
The malware feed used in this work was provided by the abuse.ch research project on [23]. A common feature of machine learning research is the need for accurate identification of the feature labels. In the case of malware clustering, the feature labels identify the malware families. The automatic identification of malware families is an open research problem. While prior research used anti-virus labels [24] or network traffic [9] to estimate the ground truth, the filenames of the samples used in this research contain the malware family name and the sample hash. The malware family name was used to provide the ground truth for this research. The malware families in this feed are a mixture of banking malware and Remote Access Trojans (RATs).

Research Environment
Feature extraction was run on a laptop with an Intel i5-5300U CPU 2.30 GHz, with 8 GB of RAM, using Cuckoo 2.0.7, Virtualbox 5.2.42, and a Windows 7 64-bit VM. Clustering was performed on a workstation with an Intel i7-3770 CPU 3.40 GHz, and 32 GB of RAM. The Cuckoo sandbox [1] was used for dynamic analysis in this research. The malware sample data sets were submitted to Cuckoo, and a script was written to extract the features from the Cuckoo reports.

Data Sets
A prior study [25] examined research that clustered malware samples selected based on their anti-virus classification. This research reported a high degree of clustering accuracy. However, reduced accuracy was observed using the clustering algorithm on a different malware data set. This study concluded that the method used for the selection of malware samples may bias sample selection toward easily classified malware samples.
Data Set 1 contains samples of 6 malware families; the details of this data set are provided in Table 1. Prolonging the sandbox execution timeout allows the malware samples to execute more API calls, and this may improve clustering performance. To investigate whether clustering performance is influenced by the sandbox timeout, three sets of feature extraction were performed with Data Set 1, using sandbox timeouts of 60, 120, and 180 s. These features are referred to as Data Set 1A, 1B, and 1C, respectively.

Experiments
The experiments listed below were performed to examine the effectiveness of the clustering algorithm. These experiments were designed to investigate feature engineering for dynamic analysis features, to investigate the potential of sandbox hardening measures, to demonstrate the effectiveness of the incremental clustering algorithm in this application, and to provide a comparison of our clustering approach against Yara rule-based malware detection.  [26]. Our research uses a similar approach to remove infrequent or high-frequency API features, using minimum and maximum thresholds.
The effects of using API feature frequency thresholds are shown in Tables 2 and 3. Referring to Table 2, it can be seen that the clustering performance improves when features with fewer than 1000 nonzero API counts are excluded, giving a maximum F 1 score of 0.68. Referring to Table 3, it can be seen that the clustering performance is improved by excluding features with an API count of more than 2120, giving an F 1 score of 0.47. Experiment 2: API Histogram Compression. Some malware samples make a large number (multiple 100,000) of calls to specific APIs. This may be an attempt to hinder analysis by causing sandbox timeouts. CBM [27] improves performance, using a logarithmic method to compress the value of API histogram counts.
In this experiment, we use a simpler method to compress the API histogram. Our method truncates the API histogram value to a specified maximum value called the maximum API count. The results in Table 4 use a minimum API count of 1000 and a maximum API count of 2120 and varies the API count truncation value. The best clustering performance occurs with an API count truncation value of between 5 and 25. Experiment 3: TF/IDF Encoding. TF/IDF was used to encode the API names and DNS addresses. Duplicates were removed from the API names and DNS addresses prior to TF/IDF encoding. The results of clustering using TF/IDF feature encoding are shown in Table 5. The columns labeled Min DF and Max DF contain the values passed to the SciKit-Learn TFIDF CountVectorizer that are used to ignore terms that occur below a low-frequency threshold or above a high-frequency threshold. These results show that the performance of the histogram and TF/IDF encoding in these tests is equivalent.

Experiment 4: Sandbox
Hardening. Some malware families contain anti-analysis code that terminates the malware or performs decoy actions when execution in an analysis environment is detected [28,29]. In this research, the following anti-analysis techniques were mitigated: Processor core count checking. • Malware slow startup.
VirtualBox and other VM environments provide optional software to improve the integration between the host computer and the VM. While this software improves the integration of the VM, it is readily detected by malware [29]. The first VM hardening technique used in this research was the removal of the VirtualBox Guest Additions software. Some malware, e.g., Dyre [30], counts the number of processor cores. If only one processor core is present, the malware terminates. This test provides a mechanism for detecting virtualized analysis environments. The VM used in this research was configured with two processor cores. Some malware programs start slowly, and this may cause a sandbox timeout before malware behaviors are revealed. Setting a sandbox analysis timeout to a low value allows a higher malware sample throughput but may not allow sufficient time to capture malware behaviors. Setting a longer analysis timeout may allow identification of malware behaviors, but does so at the cost of reduced throughput. In this research, an experiment was performed to determine if clustering performance varies as a result of setting different analysis timeout values. The results of this experiment, shown in Table 6, indicate that increasing sandbox execution times results in improved clustering metrics (cluster purity, F1 score, and V-Measure). This indicates that longer sandbox timeouts can be used to capture more details of malware behaviors. In addition, the following changes were also made to harden the VM [31]: disable Windows Defender, disable Windows Update and deactivate Address Space Layout Randomization (ASLR) [32].  Table 7. The experiments in this research were conducted on a workstation using an Intel i7-3770 3.40 GHz CPU with 32 GB of RAM. Feature extraction was performed, using a Python script. The clustering program was written in Fortran and was compiled with gfortran. These results show that the clustering operation accounts for the majority of the run time, the malware clustering times are suitable for research purposes, and that the clustering time is proportional to the feature count.

Yara Rules
Yara rules for the six malware families in Data Set 1 were obtained from Malpedia [33], and these rules are summarized in Table 8. While these Yara rules were not written specifically for this research, they were used to conduct a study that highlights the shortcomings of Yara rule-based malware detection. The best performing Yara rules were Agent Tesla, Njrat, and Tinba rules. These rules were used in experiments that compared the performance of Yara rules and clustering.  [33] were run against process memory dumps from Data Set 1. The malware identification results for the Yara rules are shown in Table 9. These Yara rules were not optimized for this data set. However, it can be seen that the detection rate using these Yara rules varied from 18.90 percent to 94.12 percent. Yara rules are known to be sensitive to minor malware changes resulting from software development. Referring to Table 6, it can be seen that the cluster purity was approximately 90%. If the dynamic analysis features from the malware samples that were not identified by the Yara rules had been clustered, then clusters with a purity of approximately 90% would have allowed identification of the unidentified variants, allowing an opportunity for updating the Yara rules. MVIIC uses a Cuckoo sandbox-based dynamic analysis platform to provide the first step in the identification of new malware variants. The runtime performance of the Cuckoo sandbox is heavily dependent on the hardware platform, sandbox configuration, and the configured Yara rules. This leads to the conclusion that, while the clustering performance is an important parameter of this research, the Cuckoo sandbox execution time (including Yara scanning) is not relevant.

Discussion and Limitations
This research is set in the context of a two-layered malware analysis system that uses Yara rules for the identification of malware families and attempts to address the problem of minor changes in the malware code, causing a detection failure. In this use case, this paper proposes a hybrid approach, using Yara scanning and clustering, these act in concert to identify new malware variants. In this scenario, those malware samples that are not detected by Yara rules are collected for clustering, and the results of this clustering process can be used for the identification of new malware variants. While previous malware clustering research focused on the identification of the feature combinations and clustering algorithms that provide the highest performance, this research investigates the application of a novel incremental clustering method.
The results of the experiments in this paper show that this novel clustering algorithm can be used for the task of clustering malware, using features obtained by dynamic analysis. In the data set used in this research, the filenames of each malware sample contain the malware family name. This family name was used to provide the clustering ground truth. Any errors in the malware labeling will reduce the reported clustering accuracy. This highlights a need for the accurate labeling of malware data sets. The experiments in this paper showed that clustering performance was improved by filtering infrequent or high-frequency features. The performance of API frequency histogram encoding and TF/IDF encoding of features gave similar results. Increasing the sandbox malware sample execution timeout improved the clustering performance. The performance of malware family detection using Yara rules was compared with clustering performance, this experiment illustrated that the clustering approach has a more consistent and higher level of performance that is resistant to small changes introduced by malware variants.
While there is a need to compare MVIIC with other related algorithms, care needs to be taken to ensure that these comparisons are valid and to evaluate MVIIC using data sets from previous research. MVIIC research has not yet progressed to the point of being evaluated against data sets from other research.

Conclusions
This paper proposes a two-layered MVIIC technique that makes use of a novel incremental clustering algorithm to support Yara-based malware family identification by the clustering of unidentified malware variants. These unidentified malware samples represent a valuable source of unknown malware families and new malware variants.
While some previous malware clustering research has focused on the identification of the feature combinations and clustering algorithms that provide the highest performance, this research investigates malware clustering using a novel incremental clustering method. Using a data set of in-the-wild malware, this research has shown that this clustering algorithm can be used to cluster malware features obtained by dynamic analysis. The clustering performance was improved by excluding infrequent and high-frequency features. A comparison of feature encoding using an API frequency histogram and TF-IDF encoding of API names was performed. These feature encoding methods were shown to provide equivalent performance. Increasing the malware sample sandbox execution time was shown to improve clustering performance.
An investigation of the detection of the malware samples by a set of existing Yara rules was performed. Clustering provided cluster purity of 90%, while Yara detection varied from 18.90% to 94.12%. This illustrates that the pattern matching approach used in Yara rules may fail, due to minor code variations, while the performance of a clustering approach is not impacted by minor code variations.
This research has shown that this novel incremental clustering algorithm is able to cluster malware dynamic analysis features.

Future Work
The work presented in this paper can be extended as follows: • Investigate techniques to improve the clustering performance of dynamic malware analysis features.

Data Availability Statement:
The authors can be contacted for the availability of the datasets, and requests will be processed case by case basis.