Android Malware Family Classiﬁcation and Analysis: Current Status and Future Directions

: Android receives major attention from security practitioners and researchers due to the inﬂux number of malicious applications. For the past twelve years, Android malicious applications have been grouped into families. In the research community, detecting new malware families is a challenge. As we investigate, most of the literature reviews focus on surveying malware detection. Characterizing the malware families can improve the detection process and understand the malware patterns. For this reason, we conduct a comprehensive survey on the state-of-the-art Android malware familial detection, identiﬁcation, and categorization techniques. We categorize the literature based on three dimensions: type of analysis, features, and methodologies and techniques. Furthermore, we report the datasets that are commonly used. Finally, we highlight the limitations that we identify in the literature, challenges, and future research directions regarding the Android malware family.


Introduction
Android Operating system has become the dominant mobile OS in the market capturing 86% in 2017, Gartner [1]. Regarding Android malware, and based on McAfee's report, the malware app increased to 22 million in Q3 of 2017 [2]. Symantec also reported that one on every five Android apps is malware [3]. This has put Android application security at risk and encourages researchers to increase efforts on defending users from malicious developers.
As we investigate the scientific databases on Android malware, we found that most of the current detection techniques are focusing on malware detection. Detecting malware in the sense of labeling an application as one of two labels: benign or malware . For example, in [26], they study the dangerous permissions in malware. However, the number of malware samples and its variants are rapidly increasing. Although malware detection is essential to anti-virus (AV) software, studying malware families and identifying/categorizing a malware to its family is even more important. Identifying malware families help AV companies and security researchers focusing more on family (group level) than malware (member level). For example, in order to identify a malware family, researchers analyze common static and dynamic characteristics in a large number of malware samples. If malware families are identified, researchers can focus more on widely spread and highly threaten families rather than in individual samples or less risky families. As a result, identifying the risky families can help detection systems identifying more malware by recognizing its associate family and seizing its effect on the users.
In our survey, we focus on reviewing the literature for the past ten years based on what has been published on the scientific databases regarding Android malware families. Our contributions are stated below: • We conduct a comprehensive survey of the state-of-the-art in Android malware families, which is one of the first surveys in this topic.

•
We introduce a novel taxonomy that categorizes all the related work in familial classification in terms of the type of analyses, features, and techniques that has been used. The complete taxonomy is shown in Figure 1 and Table 1.

•
We highlight the limitations of the related works as well as future trends.
The rest of this paper is organized as follows: taxonomy and related work is discussed in Section 2. The type of analyses that are applied is presented in Section 3. In Section 4, we discuss the techniques that are implemented. In Section 5, we talk about the features that are used in the literature. Furthermore, we discuss limitations and future directions in Section 6. Finally, conclusions are presented in Section 7 of the paper.

Taxonomy and Related Work
In this section, we discuss the Android operating system, Android malware as well as the related work.

Android and Malware
In this section, we discuss Android in general as an operating system. We address the main components inside the app and define some technologies and fundamentals. Then, we discuss Android malware and attacks they use to harm the user.

Android Operating System
Android is a Google product that is designed for smartphones and mostly written in Java language. Android uses a Linux kernel to communicate with the hardware. The updated overall architecture of the Android in [68] is shown in Figure 2. Android contains four main components that form the building blocks of the app [69]: Activities, Services, Broadcast receiver, and Content providers. Activity is a Java class (a single screen) and entry point that the user interacts with. For example, in a phone app, contacts screen is an instance of an activity that shows a list of contacts. Services are background processes that process long-running jobs. An example of a service is running some updates for the application. Broadcast receiver is a component that responds to system announcements or delivers broadcasts to another or within the same app. An example of this component is when the user notified that the battery is low. Finally, Content provider manages data stored in a database, i.e., SQLite, or in the file system and allows other apps to query such data if they have the permissions. An example is a content provider response to a user clicking on a person's contact in a contact app. It is also important to talk about an important message event called Intent. Intent is a message object that is used to perform some operations such as starting an activity or a service, or delivering a broadcast message to broadcast receivers. The intent object contains a set of information such as component name, action to be performed, data type, category type, extras, and flag.
Android applications, either system or third-party app, communicate with the Android platform via defined Application Programming Interfaces (APIs). Android framework provides a list of APIs that a developer can call to extend the functionality of the hardware without direct use of lower layers of the architecture. Such functionalities are managing user interface (UI) elements, accessing shared data storage, and passing messages between application components. As in Linux, the Android app is assigned a unique user id (UID) and group id (GID). Each app runs in a separate process to identify and isolate each app's resources from each other. Using UID, Android creates kernel-level application sandbox to enforce kernel security.
Android application is compressed in an archive format file, like any other known formats such as ZIP and JAR, called Android Application Package (APK). APK contains seven files: asset, lib, meta-info, res, androidmanifest.xml, classes.dex, and resources.arsc. In this section, we limit our discussion on two main files: the manifest file (Androidmanifest.xml) and the code file (classes.dex).
Android manifest. The manifest file is an XML format file that provides beforehand a set of information about the app and declaration of the app components. Information such as the app's package name and version number, permissions required by the application, app entry points, and registered intents.
Dalvik executable (DEx). The file Classes.dex contains a set of files (bytecodes). Those files are a special type of bytecode called Dalvik Bytecode that are compiled from normal Java classes. In Figure 3, we show the steps of converting Java classes and the generation of a DEx file [70].

Android Malware
In this section, we discuss the most recognized type of malware attacks in the literature such as: repackaged, update attack, and drive-by download as listed in [71]. Furthermore, we discuss the way that malicious payload is executed. Finally, we end by talking about malware families and characteristics.
Attack techniques. One of the most common techniques is to piggyback a known app with a malicious payload. This technique is known as repackaging as the malicious disassemble an app, insert the malicious code (payload), and repack the app. Examples of such malware families are ADRD, AnserverBot, and BgServ. [71]. An alternative way of the same technique is update attack. This is in order to repackage the application when performing updates. A victim installs the modified app, without the payload, to avoid detection. When it is time for the update, a payload will be installed with the new version. Families such as BaseBridge, DroidKungFuUpdate, and Plankton are some examples of families adopting this technique.
Another technique is called Drive-by download. In this technique, the victim installs an app that advertises another app that is either standalone or repackaged malware. In addition, instead of advertising, the download request can happen without user notification. This could happen when the user grants certain permissions to the app to download when the user first installs the main application. GGTracker, Jifake, Spitmo, and ZitMo are some of the families using this type of attack.
Obfuscation techniques. Some malware encrypts strings in the code. Strings such as method name, class name, or URLs are encrypted via some obfuscators to avoid static analysis detection. The obfuscated strings are hard to reverse engineer and then hard to read. Common Java obfuscators are ProGuard [72] and DexGuard [73], which are widely used.
Activation techniques. This technique associated with Android events. BOOT_COMPLETED event, for example, is triggered when the device finishes the booting process. Malware uses this event to be notified when the device is up and running to activate the malicious process. Other events such as SMS_RECEIVED that is triggered when an SMS is received is utilized by zSone family. Another example is a ACTION_MAIN event that is triggered when an app's icon that is clicked is adopted by a DroidDream family.
There are many papers that contribute to detection such techniques such as [74][75][76][77][78]. For example, Tian et al. [78] designed a repackaged detection technique. Their technique based on partitioning the code into two levels, class-levels dependency graph (regions), and method-level call graphs. They utilize machine-learning to recognize internal behavior using three types of features: permissions, sensitive API calls, and user interaction.
Malware families. A family of malware is a group of malware that shares common characteristics and behavior. Adopting an attack or malicious behavior by inserting a payload (or more than one payload) requires using the same package names used for the attack. By frequent use of package names (or other common characteristics), this becomes one identity (signature) of a group of malware (family). For example, AnserverBot family, a popular malware family, uses com.sec.android.provider.drm the package name in the code. Another example is that malware in DroidKungFu family contain a package named com.google.ssearch [71]. Other common malware families are listed in Table 2 [79]. How anti-virus works. Malware signatures, as they have been manually analyzed or detected, are saved in an AV database to be compared against files under scanning. When a match is found in the file, the file (or app) is considered malicious, and it will be quarantined.

Android Malware Related Work
In this section, we review the survey papers on Android malware. Most of the surveys focus on malware detection, including [80][81][82][83][84][85][86][87][88]. The most recent survey has reviewed papers on malware detection while focusing on their approaches; they discussed the advantages and disadvantages of each detection approaches and methods [81].
The following survey has proposed a taxonomy to categorize Android malware detection techniques; they highlighted the trends and the challenges [83]. The following two survey papers have provided an outline of the methodologies used in classifying malware based on work surveyed [82,87]. The authors in [84] have focused on the state-of-the-art papers in identifying malware behaviors based on a diverse set of features; they highlighted the effective features in detecting malware. Yan and Yan have surveyed the related work in dynamic malware detection; they focused on the performance evaluation criteria on malware detection [85].
Souri and Hosseini have conducted a systematic survey on the state-of-the-art papers in utilizing data mining techniques in malware detection; they categorize the techniques into signature-based and behavioral-based. Furthermore, they discuss the importance of data mining techniques in malware detection [86]. Riasat et al. have provided a comprehensive survey on the tools and methods used on malware detection; they highlighted the various types of tools used in the research field [88]. Arshad et al. categorize the antimalware and penetration techniques proposed by state-of-the-art research to protect the Android system; they highlighted their limitation and benefits [80].
The previous surveys on malware detection have focused on malware detection. In this survey, our focus is on malware familial classification, detection, and analysis, which will introduce a baseline for future work in this domain.
In order to conduct our review, we followed an exploratory research approach. We looked into more than a thousand papers published in journals and conferences. To filter out the selected papers, we considered keywords. The following respectable scientific databases are explored: IEEE Xplore [89], ACM Digital Library [90], MDPI [91], ScienceDirect [92], Hindawi [93], Springer [94], and arXiv [95], and we also used reputable literature search engines such as Microsoft Academic [96], Semantic Scholar [97], and Google Scholar [98]. Keyword criteria for selecting a literature contain main and optional keywords. Main keywords are Android malware and malware family. Optional keywords are malware detection, familial classification, malware family identification, and malware family categorization. We have classified the related work according to their type of analysis, techniques, and features.

Analysis
In this section, we discuss the type of analysis followed by the state-of-the-art. They are static, dynamic, and hybrid analysis.

Static Analysis
Static analysis is applied while the app is in a static state. It basically collects information about the app such as the app's name, size, permissions, code, and programing pattern. Some of the information requires reverse engineering the app from machine code to a readable format to analyze the code. The advantage of performing such analysis is that it is fastest and cheapest since it doesn't require executing the application nor does it require monitoring activities. A drawback of the analysis is that many malware launch their attack at runtime. In addition, other malware use an obfuscation technique or encrypted methods which cannot be read or decrypted unless the app is executed. A set of papers [28][29][30][31][32][33][34][35][36][37][38][39]42,[46][47][48]50,52,53,[55][56][57]59,62,63,[65][66][67] used static analysis. Details on the static features used by the papers were discussed in Section 4, Features.

Dynamic Analysis
This type of analysis (also known as behavioral analysis) performed during the execution of an app. It monitors the inside and outside action, connections, calls, and clicks that happen while the app is being executed. Such analysis has the advantage of detecting wide-range and sophisticated malware. Malware families that are bound to an event that were mentioned earlier can only be detected while the app is running. The disadvantage of such analysis is that it is time-consuming. In addition, it requires a priori knowledge of the malware technique to monitor. Several papers have applied dynamic analysis such as [43,44,49,54,58,61]. Details on the dynamic features used by the papers were discussed in Section 4, Features.

Hybrid Analysis
Hybrid analysis is a combination of both static and dynamic analysis. Although hybrid analysis has the advantage of covering both analyses, it has a major drawback. Such analysis is a time-consuming process considering the huge number of malware samples to be detected and analyzed. Papers such as [40,41,45,51,60] have used hybrid analysis and the details on the features used were discussed in Section 4, Features.

Techniques
In this section, we discuss the techniques used by the state of the art to address the familial malware problem. There are two main techniques used: model-based and analysis-based.

Model-Based
In a model-based technique, a model is created to classify malware into families. There are four main categories of techniques used, which are machine learning, similarity analysis and image processing, and evasion.
Machine learning. The literature use machine learning to classify malware samples into families.
In [31], the authors classify the malware using Deep Learning (DL) techniques. In [44], the authors classify malware into families using classical machine learning such as Support Vector Machine (SVM) and DL algorithms such as CNN and RNN. In [66], the authors use a Nearest Neighbor classifier (NN) to classify malware into families. In [35], the authors preprocess the data and extract the sensitive opcode sequence. For the minor families, they use the oversampling technique to overcome this issue. To represent the semantic features of the sensitive opcode sequence, they use text mining (i.e., Doc2Vec algorithm [99]). Finally, they train their model using nine machine learning algorithms such as SVM and Randomforest. In [49], the authors feed the fingerprint to an SVM algorithm to classify malware into families. In [63], the authors construct the feature vector and feed it to several machine learning algorithms such as Randomforest. In [38], the authors used SVM to classify the samples into families. In [67], the authors feed the features to several machine learning classifiers such as Decision Tree and Association rules. In [47], the authors build a framework to train the classifier algorithm with a set of samples to drive the heuristic search using a Genetic algorithm. In [42,56], the authors use frequency graphs (FreGraph) as their features to be fed into several machine learning algorithms such as SVM, Decision Tree, and Randomforest to classify the malware into families. In [37], the authors feed the Android-oriented matrices to several machine learning algorithms such as SVM, KNN, and Decision Tree. In [46], the authors apply machine learning algorithms to extract complex features and used them to classify malware into families. In [39], the authors use three machine learning techniques: standard classifier such as SVM, ensemble classifier, and Neural Network to classify malware into families. In [48], Alswaina et al. use two models to perform familial classification. The authors use the binary representation of the features and weighted importance. Then, they use six machine learning algorithms to predict malware families. In [45], the authors apply three filters to filter the features. The dynamic and static features are combined and fed to machine learning algorithms, such as Randomforest and KNN for classification. In [29], the authors apply Linear SVM, DT, and DL algorithms. Fene et al. [60] utilize the SVM algorithm.
In [51], the authors use supervised algorithms such as Randomforest. Moreover, the authors use unsupervised learning such as K-means and mean-shift due to unbalanced samples in each family. They also propose ensemble clustering and classification techniques, which integrate the results generated from the supervised and unsupervised algorithms. In [41], the authors optimize the weight of features using community detection algorithms. They further classify the malware into families using machine learning. In [30], the authors use the fingerprints to classify malware into families using online passive-aggressive (PA) classifiers. Further details of PA can be found in [100]. In [36], the authors extract features from the apps and create code metrics. Then, they binary classify (coarse-grain) the samples. The malware is further classified into families (fine-grain).
Evading detection. In this technique, the goal is to evade detection or elude classifiers into misclassification. In [47], the authors build a framework to alter the malware to perform an attack and misclassify the results.
Similarity analysis. Literature computes the distance between any malware and the family.
In [61], the authors use the token-subsequence algorithm to extract and generate signatures from each family based on network traffic analysis. In [57], the authors represent opcode as a vector of binary and frequency to compute the similarity between the malware and families. In [50], the authors evaluate their approach by performing similarity analysis. In [65], the authors perform two tests. The first test is used to binary classify malware. In the second test, they apply the agglomerative clustering algorithm to cluster the apps into families. To evaluate their model, they compute the distance between the malware and the clusters' centroids to validate which family the sample belongs to. In [62], the authors cluster the families based on the most frequent key terms used by each family. Then, they use the dictionary search method for classification. In [29], the authors use TF-IDF to represent the frequency of the features.
Image representation. Some literature classifies the malware to malware families based on image representation. In [28], the authors convert the DEX file into an image and plain text. Then, they extract the color and the texture feature from the image. For the three features: color, texture, and text, they feed them into the feature Fusion algorithm to classify malware into families.

Analysis-Based
In the analysis-based technique, an analysis is carried to analyze and construct features to observe families' characteristics. There are three sub techniques under this approach, which are signature-based, statistical analysis, and visualization analysis.
Signature-based. They construct a signature for each family to identify the families. In [61], the authors use a multi-step clustering approach: First, they apply coarse-grained clustering and then apply fine-grained clustering. In [30], the authors construct the fingerprint of the malware families using n-grams analysis and features hashing. In [49], the authors generate a fingerprint for each family. In [62], the authors construct a signature of each malware family based on the collected features. Feng et al. [60] propose an approximate signature matching algorithm to generate signature for malware families.
Statistical analysis. They applied statistical tools to analyze and identify the family's characteristics and the important features. In [66], the authors use statistical analysis and text mining to extract the features. In [44], the authors use Markov chain to represent the features. In [38], the authors eliminate unimportant features using the frequency-based approach. In [67], the authors compute the bytecode frequency. In [39], the authors apply a feature ranking algorithm to identify the most important features. Visualization analysis. They visualize the characteristics of families using graph mining and PCA. In [31], the authors extract DFG and CFG. Then, they encode the graphs into a matrix. In [41], the authors represent the features using a network graph. In [50], the authors collect the sensitive API calls and then construct graphs based on sensitive API calls. Then, they characterize malware families based on the subgraph isomorphism. In [63], the authors construct a short and long APIs dependency path to perform context and constant analysis. In [65], the authors disassemble the app into Smali files. Then, they create class dependency graph (CDG) to group the classes into modules to identify which module contains malicious code. In [42,56], the authors use community detection, subgraph matching, and subgraph clustering to generate the FreGraph. Feng et al. [60] utilize an inter-component call graph (ICCG) to represent the communication in the app to construct the features.

Features
In this section, we discuss the types of features used by works of literature to classify malware into families. They are classified into static and dynamic features.

Static Features
Static features are any features that can be recognized or utilized without the execution of the application. Some examples of static features are package name, application size, permissions, and list of APIs.
A set of papers [33,34,48,52,55] uses features that are related to malware installation such as repackage and update, payload activation such as on booting and receiving calls, and privilege escalation attack such as asroot and exploid families [71]. Moreover, in [33,34,52], they include other features related to financial charges such as SMS and phone calls. Vega et al. in [33,34], also include features related to personal information stealing such as phone number.
Permissions used in the app are included as features in [29,38,39,48,59]. Moreover, in [35], sensitive opcode sequence, actions, and strings are utilized in their features. Garcia et al. [46,64] added native code-based to their set of features.
Fasano et al. [36] and Blanc et al. [37] use a set of metrics generated from Smali files to measure the quality code of the app to be used as features. However, in [35,53,57,62,67], code-based analysis such as Java bytecode, bytecode frequency, opcode, or opcode sequence are used as features.
Other papers such as [31,66] use data-flow graph (DFG) and control-flow graph (CFG) as features. In [28], the authors extract the texture, color, and text features from the DEX file. Zhang et al. [30] use features extracted from DEX as n-gram and hash code.
Finally, some works of literature have applied a set of static features in addition to dynamic features. In [51], the authors use 190 static features such as permissions. In [45], the authors use static features such as the number of services and receivers. In [41], the static features such as permissions, filename, and activity name are utilized. In the paper [40], a set of static features from the Android manifest in addition to an APK file that is generated from Androguard [101] tool, a Python code to reverse engineer Android files.

Dynamic Features
Features that require execution of the application are considered dynamic. For example, network traffic, send/receive SMS, resource consumption, system logs, and I\O operations.
In [58], the author traced the system calls during the execution of the application. Aresu et al. [61] utilize network traffic (HTTP) in their classification. Martin et al. [44] depend on the features that are generated by a DroidBox [102] tool, an Android sandbox for dynamic analysis, which is represented as operations and function of time. In [54], the authors record the API calls that are performed during application execution. In [49], resources' consumption is utilized as features for their classification. In [43], the authors use sensitive and permission-related API calls.
Finally, a group of literature works has applied a set of dynamic features in addition to static features. In [51], the authors use around 2048 dynamic features logs such as file I/O, network usages, and cryptographic usage. In [45], the authors use dynamic features that are generated using a DroidBox tool [102] such as the number of open/closed connections and the number of sent/received network packets. In [41], the dynamic features such as API call sequence are utilized. In [40], a set of dynamic features uses DroidBox [102] and CuckooDroid [103]. Feng et al. [60] use suspicious API call behaviors such as sendSMS API and data leakage.

Discussion
In this section, we highlight the datasets that have been used, the limitation of literature, the general challenges related to malware families, and we also report future directions.

Experimental Datasets
There are many datasets used in the literature that contain a collection of Android malware grouped into families such as: Android Malware Genome Project (Malgenome) [71], Drebin [104], the AMD [105] Project, and AndroZoo [106]. Some papers collected the malware samples from the Android market such as Anzhi, or a repository such as VirusTotal [107] and VirusShare [108].
The datasets differ in the number of samples and number of families. For example, AMD [105] contains 4354 malware samples grouped in 42 families. While Drebin [104] has 5560 samples grouped in 179 families, other datasets such as AndroZoo [106] contain many more samples and families, where the number of samples is 10.7 million grouped into more than 3000 families.
In Figure 4, we show the number of publications that uses each dataset found in the literature. Furthermore, Table 3 shows detailed information where the publications are included. The repository category includes sites like VirusTotal, VirusShare, and Koodous, for which there is no fixed set to be used as benchmarks. Collection category refers to either an unknown collection performed by the author or sites such as HelDroid, FalDroid, and the Anzhi app market. As we see from the figure, the most used datasets are Drebin [104] and Genome [71]. More details on the commonly used datasets are reported in Table 4.

Limitations
As we surveyed forty research papers, we summarize the limitations to the following: First, most of the literature uses small datasets such as a a few numbers of families or a few malware samples for studying families. Moreover, they use outdated or discontinues datasets such as Contagiodump [109] and Malgenome Project [71]. In addition, several papers build their experiments on manually collected data without testing their model on benchmarked data. Several papers lack the disclosure of the list of features applied to reproduce the work.

Challenges
Family naming. One of the challenges that we observe is that there are no naming schemes (conventions) for the malware family. Naming a family is varied depending on the AV company. Families such as BaseBridge (or adSMS), Smssend (or fakeplayer), and DroidDream (or DORDRAE) are some of the families that have multiple names. One of the reasons is that one company names a family based on different share characteristics than other companies. Characteristics such as installation methods, activation, or the name of their malicious file name are discussed in [71,110,111]. Attempts have been made by [106,112,113] to establish naming standards. Sebastian et al. [114] address the issue of inconsistent labeling (naming) of malware family and contribute the AVclass tool, an auto-labeling, as an effort to unify labeling. In addition, Euphony is a system proposed by [106] to unify different AV companies.
Imbalance Dataset. Some of the malware families contain hundreds of samples, while others contain as little as one sample i.e., a DroidKungFuUpdate family in the Malgenome dataset [71]. The whole list is shown in Table 5. This cause identifies the characteristics of a family as challenging. In case of standalone malware (not repackaged), the identification is almost impossible.

Future Directions
Advanced machine learning. Malware families should be deeply analyzed and identified. Deep learning technology has been adapted to address various research problems including voice recognition, image processing, and text analysis. One of the advanced techniques of Deep Learning is reinforcement learning, which can be utilized to better understand the families' characteristics. Reinforcement learning has shown very promising results, especially in dynamic analysis. Another technique that should be adopted is transferred learning, which can be utilized to address the lack of samples in families.
Big data handling. Since the amount of malware is increasing exponentially, as it was reported by GDATA that almost 9K of new malware programs are reported daily [115], a scalable solution should be considered. For example, the AndroZoo [116] dataset has millions of samples that can be handled using big data technologies. One of the most important tools are Hadoop [117] and Spark [118]. They can handle a huge amount of malware data with fast processing.
Crowdsourcing. Beside Big data technologies, a group of malware family analyzers can be utilized to better identify and characterize the families. For example, a source can use a subset of features, while other sources investigate other feature sets. A malware repository VirusTotal [107] and VirusShare [108] are some examples.
Automated detection. The huge number of generated malware necessitate the call for automated analysis and classification of malware family rather than performing such tasks manually [41,119].

Conclusions
Malware family detection and analysis have been a problem for many years. With the escalation in the amount of malware, especially on Android devices, researchers have studied malware deeply using various tools, such as machine learning, graph mining, image processing, and statistical analysis. Most of the literature is focused on detecting malware rather than detecting families. Detecting malware families can help us to better understand the characteristics of the malware family.
In this paper, we surveyed a total of forty research papers on Android malware familial detection, classification, and categorization from various scientific databases. We classified the literature according to their type of analysis, type of features, and the techniques applied. We further report the datasets that have been used and include details about each of them. Moreover, we discussed the limitations of the literature approaches, challenges faced by the researchers, and future trends for the research community.
Our findings show that most of the limitations circulate around the availability and the size of benchmarked datasets. In addition, some common challenges are the lack of samples and standardization of family naming. As for the future directions, investment in advanced artificial intelligence techniques such as machine learning and big data technologies should be considered. Moreover, crowdsourcing and automated detection should be utilized to better address malware family identification problems.
Author Contributions: The work has been primarily conducted by F.A. under the supervision of K.E. Extensive discussions about the algorithms and techniques presented in this paper took place among the authors over the past year. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Acknowledgments:
The authors acknowledge the University of Bridgeport for providing the necessary resources to carry this research conducted under the supervision of Khaled Elleithy.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: