An Android Malicious Code Detection Method Based on Improved DCA Algorithm

.


Introduction
In the second quarter of 2016, from Statista's global market share held by the leading smartphone operating systems in sales to end users from 1st quarter 2009 to 2nd quarter 2016, it can be seen that the market share of Android equipment has reached 87.8% [1].According to a 2015 Android malicious code development report from the Wuhan Antiy Mobile Security Laboratory (AVL Team) mobile security team [2], it can be found that Android malicious code has had explosive growth since 2014.In addition, the outbreak of an Short Messaging Service (SMS) interception Trojan has resulted in a significant increase in the number of privacy theft malicious applications.
In a word, the current Android security situation is grim [3], malicious code types continue to increase, and users' privacy and personal property are still being confronted with the threat [4].Due to the influence of code obfuscation and polymorphic deformation technology, most of the existing detection methods can hardly extract features that have a strong classification effect from an Android installation Package.These problems lead to the fact that it is difficult for the current Android malicious code static detection algorithm to extract the code features from the Android Package (APK) file.
For these reasons, based on the Dendritic Cell Algorithm (DCA), this paper presents an Android malicious code detection method.This method is mainly divided into two parts: training and detection.For the training part, the main task is to extract features from the APK through Dalvik reduced instruction sequence (DRIS) and N-perm, and to mark the characteristics as a different level of danger by danger theory.DRIS and N-perm are introduced, respectively, in Sections 3.1 and 3.2.After the training part, the experiment can gain a feature database which is composed of a risk feature set (RFS), a common feature set (CFS) and a safe feature set (SFS).In a feature database, each characteristic will has a danger level vd and security level vs.For the detection part, the main task is to classify suspicious APK by using an improved DCA and feature database.The feature extraction of the detection section is the same as the training section.
In the experiment, all malware are obtained from AndroMalShare and VirusShare projects.In the malware data set, 15 kinds of major malicious code family are selected as specimens, such as FakeInst, Fakesysui, Jxt, ADRD, and Geinimi.Ultimately, the experiment has a high detection precision rate (97.4%) in the best cases and a low false positive rate (2%) in the worst cases.
The remainder of this paper is organized as follows: in Section 2, the related work of this paper is introduced, which includes the analysis of the characteristics of the existing research work.Section 3 briefly introduces the related technology of this paper.Sections 4 and 5, respectively, introduce the algorithm implementation and the design of experiments, and Section 6 presents conclusions.

Related Works
With the rapid popularization of Android, the Android system security has become an important research direction around the world.
Shabtai and Qing Sihan [5,6] have carried on the analysis of the Android system structure and security mechanism.Enck, Ongtang, Nauman and others [7][8][9] put forward an optimizing of an Android authorization mechanism.Wognsen et al. [10] have conducted in-depth research and analysis on the Dalvik byte code.Sarma, Sato, Wu and Weichselbaum [11][12][13][14], respectively, focus on the AndroidManifest file for Android malicious code static detection.Grace et al. [15] developed an automated system called RiskRanker to scalably analyze whether a particular Android application exhibits dangerous behavior.These pieces of research have laid a solid foundation for the study of Android malicious code detection.
In 2014, Li et al. [16] presented a static detection scheme for Android malicious code detection.Li's method proposed a way to extract the features using a streamlined Dalvik assembly sequence, and retained the structure of the application source code.However, Li used singularly original Dalvik assembly sequence characteristics, did not make further processing, and the classification used only the text edit distance metric.Therefore, Li's method is still uneffective in the treatment of polymorphic malicious code and new malware variants.At the same time, Suarez-Tangil [17] put forward a detection system about the code structure of an Android malicious code family whose analysis and classification method are based on text mining.Guillermo uses hierarchical clustering to classify feature vectors which are obtained from a Dalvik reduced instruction sequence (DRIS), and uses phylogenetic trees to study the evolution of the malicious code family.Compared with Li [16], the system proposed by Guillermo has more complete abstraction of the malicious application code structure, can identify existing malicious code family quickly, but is not able to achieve the same classification results for polymorphic malicious code.Kong Deguang, Abdurrahman et al. [18,19], in the research of malicious code on the X86 platform, are, respectively, using the N-perm algorithm and N-gram algorithm to extract the statistical features of malicious code assembly sequences.Kong's research proves that the N-perm algorithm can extract the statistical features of the code sequence.Seung-Hyun [4] studies the Android code application interface (API) and selects 17 kinds of dangerous API calls.Deniel [20] et al. use the dendritic cell algorithm (DCA) with principal component analysis (PCA) to classify each of the Android applications as a benign application or as malware.However, there are still some shortcomings because of running in the sandbox, such as poor real-time, low detection efficiency.
Based on the dendritic cell algorithm, this paper proposes a new Android malicious code detection method.Compared to other Android malicious code detection algorithms, this algorithm has a faster detection speed under the premise of guaranteeing the detection accuracy, and is suitable for Android malicious code lightweight detection.In addition, the APK code feature extraction method of this algorithm can reduce the effect of code obfuscation to a certain degree.

Dalvik Reduced Instruction Set
Dalvik Reduced Instruction Set (DRIS) is based on a grammar proposed by literature [21] and shown in Figure 1.DRIS refers to the processing of a Dalvik assembly sequence, which reflects the program semantic instruction program, such as goto, if, move, invoke and so on.DRIS removed independent instructions and interference, such as jump destination address and register parameters.Finally, the single letter is used to substitute the extracted instruction and to obtain sequences.This sequence can represent the semantics of code.The use of DRIS will be introduced in Section 4.

N-Perm Algorithm
Previously, the N-gram algorithm is used for text classification and information retrieval technology [22], whose core idea is to calculate the frequency of continuous N words.The N-perm algorithm is the deformation of the N-gram algorithm [18], and the difference between two algorithms being N-perm ignored the sequence when computing the frequency of continuous N words, which can reduce the influence of polymorphism in the experiment.As shown in Figure 2, in the extraction of the specified operation code from the Dalvik bytecode, the experiment counts the frequency of operation code (Opcode) combination, and uses the two-tuple (Opcode combination, frequency) as the feature vector.Assuming that the Opcode sequence is CVCPCP.After using the 1-perm algorithm, the two-tuple ({C, V, P}, {3,1,2}) can be obtained.Using the 2-perm algorithm, the two-tuple ({CV, VC, CP, PC}, {1,1,2,1}) can be obtained.Using the 3-perm algorithm, the two tuple ({CVC, VCP, CPC, PCP}, {1,1,1,1}) can be obtained.By analogy, when n = 4, 5, 6, they have the same algorithm.const/4 v2, 0x0 invoke-direct {p0}, … const-string v0, "" iput-object v0, p0, … const-wide/16 v0, 0x0 iput-wide v0, p0, Lcom/a/a/j;->b:J

DCA Algorithm
Danger theory is a controversial and famous immune theory, proposed by immunologist Matzinger [23] in 1994, which has challenged the traditional Self-NonSelf (SNS) theory.Based on danger theory, Greensmith et al. [24], in the study of natural dendritic cells (DC), proposed the dendritic cell algorithm (DCA).There are four kinds of signals that can be transmitted: pathogen-associated molecular patterns (PAMP), safe, danger and inflammatory.A PAMP signal is transmitted if any bacteria exist.A safe signal is transmitted if a cell dies from aging.A danger signal is transmitted if a cell is damaged and dies.An inflammatory signal is transmitted if there is any inflammation in the body.
DCA is a population-based algorithm.The population in a DCA is filled with Dendritic Cells (DCs).Dendritic Cells (DC) are a kind of special antigen presenting cell that is sensitive to the change of environment.The main functions of a DC are processing signals and representing antigens.DC is divided into three states: immature DC (iDC), semi mature DC (smDC), and mature DC (mDC).When the Costimulatory Molecules (CSM) signal concentration reaches the threshold, the iDC will be changed to smDC or mDC, which is determined by the concentration of semi-mature output signal and mature output signal.The transformation relations of DC's three states are shown in Figure 3. Compared with other algorithms, the DCA algorithm has a faster detection speed and does not reduce the detection accuracy.These features make the DCA suitable for Android malicious code lightweight detection.

Code Obfuscation
To protect their applications, Android application developers use code obfuscation technology to counter the reverse analysis.Android platform code obfuscation mainly uses ProGuard tools for completion.Android application code obfuscation replaces the source code's variable name, function name, class name and package name with meaningless characters [25], which prevents the analyst from performing a reverse analysis of the code logic.Code obfuscation gives greater resistance to static analysis of Android malicious code [26].However, in the Android application source code, to ensure the correct execution of the code, there are some code segments that can not be confused such as Android system components, native method, the method which needs to use Android serialization, enum, API, the content that resources file reference, the place where the reflection method is used, and the reference to a third-party library.In addition, code obfuscation does not cause large changes to the application code structure [27], so the statistical characteristics of the application code will not be greatly affected.

System Overview
Based on the Dendritic Cell Algorithm, this paper proposes an Android malicious code detection method, and the detection process is shown in Figure 4.The method that this paper proposed mainly consists of two parts, training and detection.The symbols that appear in this section are shown in Table 1.In the training part, the experiment has extracted Dalvik code features by DRIS and N-perm from training set APKs.Then, the experiment uses danger theoryto mark Dalvik code features as risk features (RFs), common features (CFs) and safe features (SFs).Finally, RFs, CFs and SFs are stored in risk feature sets (RFS), common feature sets (CFS) and safe feature sets (SFS), respectively.
In the detection part, the experiment uses improved DCA to classify the detected APK after feature extraction and processing.

Training Part
The training part of this experiment mainly consists of two parts: input data processing and mark characteristics.The training flow is shown in Figure 5.

Input Data Processing Method
Algorithm input includes DRIS and Android static code APIs.The processes of obtaining DRIS are as follows.First of all, for the batch of APK files, the experiment uses ApkTool to decompile it to extract the .smalifiles.The .smali file is the Android application source code file which analyzes from the classes.dexfile.Classes.dexfile is a part of the Android package (APK) file, including the source code of the logic realization of the Android application.Smali source code and Dalvik code are based on a similar grammar rules, so the DRIS can be used to deal with the .smalifile.Then, the method content has extracted from the .samlifile and replaced by reduced instruction set to get the reduced instruction set sequence.Next, the 4-perm algorithm is used to simplify the instruction set sequence, and to get a set of two-tuples that is organized by method.At last, the two-dimensional array is used to store all two-tuples.
Android static code APIs are also obtained from the smali file, in which dangerous API list references from the Seo's paper [4].
API feature extraction uses two-tuple (API list, frequency) to complete statistics.Finally, the API feature Two-Tuple will attach to the tail of Dalvik sequence features two-dimensional array.
In the N-perm algorithm, selecting a suitable N is essential.The smaller the N value, the less sufficient the extracted characteristic information.The bigger the N value, the less representative the extracted features.In this paper, the N value is set as 4. The experiments of different N value selections are shown in Figure 6.

Mark Characteristic
The 4-perm algorithm will produce more features, so the information gain value is used to measure the classification performance of feature and screen features [28].Random variable Y represents the classification results, and random variable X i represents features i corresponding random variable.Information gain of features i is determined by the formula (1): ( In formula (1), H(Y) is the entropy of the random variable Y, and H(Y|X i ) is the conditional entropy of random variable Y that is under the premise of random variables X i .The information gain value can be used to measure the difference of information entropy before and after the occurrence of the feature [29], which is applied with multiple domains, such as the intrusion detection system [30].The more the entropy gain, the better the classification of corresponding features.
For each APK used for training, 400 features that have the largest information gain value have been selected, and combined with 17 suspicious APIs, the characteristic set S is composed.In DCA, the algorithm's input is antigen and signal.The antigen and signal are the Android malicious code sample and the set S after the information fusion, respectively.
Let U represent a set of programs with known classification, ∀u ∈ U, and u j is a feature set of the method j in u, where u j =< x 1 , x 2 , x 3 , ..., x n , y >, (j = 1, 2, ..., m), x i , y ∈ {0, 1, 2, ...}, (i = 1, 2, ..., n).The x i = 0 indicates the feature x i not appearing in u j .On the contrary, the feature x i has appeared in u j when x i is equal to 1.The y = 0 indicates that u is a normal sample.Contrarily, it is a malicious sample.Set M represents the set of malicious code, and set N represents the set of normal code.The provisions of the set M and set N are as follows: Set V represents an unknown malicious code set.For ∀v ∈ V, v j is the set of features in method j, where v j =< x 1 , x 2 , x 3 , ..., x n , y >, y = 2 represents that v is unclassified.
In the algorithm, set V is the antigen, and the features x 1 , x 2 , x 3 , ..., x n will be integrated into the danger signal (DS) and security signal (SS).The specific steps are as follows.
(1) S is the set of features, which is composed of 1233 features extracted from DRIS and 17 features extracted from suspicious APIs.The feature set S is divided into three sub sets-Risk Feature Set (RFS), Security Feature Set (SFS), and Common Feature Set (CFS)-which are defined as follows.RFS, defined by Definition 4, includes features that only appear in malicious applications.SFS defined by Definition 5 includes features that only appear in normal applications.CFS defined by Definition 6 includes features that appear in both malicious applications and normal applications: (2) Denote nm i as the number of elements for which x i = 0 and x i ∈ M. Denote nb i as the number of elements for which x i = 0 and x i ∈ N. The weight of each signal component which forms by features are calculated by the following formula: (2) In the above formula, wd i represents the risk weight, and ws i represents the security weight.
(3) Finally, calculate the danger signal vd j and security signal vs j , where j = 1, 2, ..., m, and m is the total number of method which belongs to u.Let z i be the value of x i .The calculation formulas are as follows:

The Improved Dendritic Cell Algorithm
The process of iDCA is simply described as follows.Firstly, the experiment extracts DRIS and suspicious APIs from the APK file, and generates the feature vector by feature fusion.An application corresponds to a feature vector.Then, the experiment initializes the iDC with the i-th feature vector, calculates the CSM concentration in a DC environment and the k value of each method in the i-th feature vector.If the CSM concentration is higher than the threshold value, then calculate the expected of k marked k.If k is larger than zero, the iDC state would migrate to mDC.If k is smaller than zero, the iDC state would migrate to smDC.Through the final state of DC, it can be judged whether the feature vector belongs to malware or not.The iDCA detection flow is shown in Figure 7.Because of a large number of uncertain parameters, traditional Dendritic Cell Algorithm (DCA) has excessive uncertainty and a lack of a formal definition [31], which could lead to ambiguity in understanding and using the algorithm.Based on literature [31], the experiment attempts to streamline and optimize the traditional dendritic cell algorithm, which is suitable for use in Android malware detection.This method is referred to as iDCA.
By streamlining, the algorithm input signal is reduced to danger signal vd j and security signal vs j .The concentration threshold of CSM is the number of methods.The life cycle of DCs is uniformly distributed, and each DC is used alone for a single sample processing.The input of DC is two kinds of signals and antigens, the output of DC is k, and the result of classification is context.The k calculation formula in iDCA are as follows: The k i in the above formula is the risk degree of the i-th method from the current sample.
The pseudocode for the iDCA is shown in Algorithm Initialize DC() while Antigen i f rom 1 to n do 4: while method j f rom 1 to m do 5: initialize vd j , vs j , k j 6: calculate vd j , vs j , k j

Analysis of iDCA
As a simulation of the natural dendritic cells' working principle, iDCA has many advantages.First of all, the training phase workload of iDCA is very small, which only needs to classify the features in S into RFS, CFS, and SFS, and calculate the danger degree vd and the safe degree vs of each feature separately.Secondly, due to the low complexity of the iDCA, the algorithm has higher efficiency than DCA and some other algorithms.From Algorithm 1, it can be seen that the time complexity of iDCA is O(n × m).Then, the iDCA has the advantages of highlighted data feature fusion, which can fuse a large number of features as the danger signal and the security signal.Fourthly, due to the low sensitivity of system parameters in a certain range (such as the change of the weight in feature data fusion and the change of the antigen storage space), the iDCA algorithm has strong robustness.Finally, for the whole detection process, the calculation of the feature's information gain value removes a large number of results that have poor classification performance, and the overall classification efficiency of the algorithm is improved.
At the same time, the detection algorithm still has some shortcomings.Firstly, because of the different sizes of the Android applications, the number of methods in which application contained have a large difference, and this situation will affect the computational efficiency of weight.Secondly, this algorithm is sensitive to the amount of antigen.When there is an insufficient amount of antigen, the feature classification performance will lose effectiveness and then affect the classification performance of the overall algorithm.

Experimental Data Processing
The experimental environment of algorithm designs is as follows: Intel Core i7-4710MQ CPU, 16GDDR31600Mhz memory and Windows 10 professional operating system.This experiment uses Python as the programming language.
Datasets are divided into normal samples and malware samples.In this experiment, from Google Play, 1000 samples have been selected as normal samples, and, from AndroMalShare and VirusShare, two Android malicious code sharing projects of 15 mainstream malicious code families have been selected.From each family, 50 cases (750 in total) are selected as malicious samples.In the malicious sample data set, half of the samples have been processed by code obfuscation.Due to the use of 10-fold-cross-validation, the proportion of the training set and the test set in the experiment is 9:1.Specific division methods are as follows: all samples were divided into 10 parts, each of the parts keeps the normal samples and malicious samples ratio of 4:3 constant, that is, 100 cases of normal samples and 75 cases of malicious samples.In the division of the training set and test set, 10 parts were successively set as the test set, and the remaining 9 parts were used as the training set.Finally, in each experiment, the total number of training sets is 1575 cases, and the total number of test sets is 175 cases.
Because of the static characteristic detecting method, the experiment does not need to run the Android application.The method that paper proposed mainly consists of two parts: training and detection.
For the training section, the main task is to extract features from the APK and to mark the characteristics as a different level of danger.Firstly, by using the APK unpacking tool (apktool), each sample has been unpacked, and the smali file is extracted from each unpacked APK.Secondly, for all smali source files, methods and Android static code APIs were extracted by python script in experiment.Thirdly, for the extracted method, DRIS and 4-perm algorithm complete statistical feature extraction are used to obtain two-tuples.According to a well-defined API call sequence table, the experiment processes the extracted API calls and generates a feature vector.Because the contrast table of dangerous API contains 17 APIs, the size of API feature vector is 17.
The selection of parameter N value in N-perm is important.When N is small, the N-perm algorithm will produce a large number of features, and these features do not have a better classification effect.When N is large, the N-perm algorithm will produce fewer features, whose features may only exist with the current APK file and are not universal.Through Figure 6, it can be seen that when the value of N is equal to 4, the detection result is best.
Next, the algorithm calculates the information gain value of characteristics from Dalvik sequence two-tuples, and, for each APK used for training, our experiment selects 400 characteristics that have maximum value of the information gain as a part of the final feature vector.Another part of the final feature vector is generated by the API calls.
As shown in Figure 8, although the numbers of RFSs and SFSs are not dominant, this algorithm can reduce the impact of CFS from weight calculation and feature fusion mode, and enhance the classification accuracy of the detection algorithm.
For the detection section, the main task is to classify suspicious APK by using a feature database that is obtained during the training section.The feature extraction of the detection section is the same as the training section.

Results Contrast
In addition to the detection algorithm of this paper, as a comparison, the experiment has selected support vector machines (SVM), Naive Bayes (NB), J48 Decision Tree (DT) and the k-Nearest Neighbor (KNN) algorithm to verify the effect of the classification algorithm.These algorithms are applied to many fields of machine learning because of being simple and efficient [32,33].At the same time, the method this paper proposed is compared with the method proposed by Wang [34] and Sato [12].Wang [34] detected the malware by building different information models from Android permission.Sato [12] extracts permission, intent filter (action), intent filter (category) and process name from the AndroidManifest.xmlfile as the basis of classification, and uses the J48 decision tree to detect the Android malware.
Test results use the true position rate (TPR), false position rate (FPR) and accuracy as the evaluation index.Accuracy is the proportion of samples that are correctly classified.The TPR, FPR and accuracy are calculated by the following formula: The malware samples are regarded as positive in the classification model.TP is the number of correctly classified malware; FN is the number of malware samples wrongly identified as benign; FP is the number of Google app samples misclassified as malware, and TN is the number of correctly identified benign applications.From the 10-fold cross-validation, when the number of malicious samples is 750, the test results are shown in Table 3.As is shown in Table 2, compared with SVM, NB, DT, KNN, Wang [34], and Sato [12], iDCA has a higher detection accuracy rate when the number of malicious samples is 750.However, from Figure 9, it can be seen that the performance of iDCA is poor when the number of samples is small.The FPR of iDCA is mainly caused by the features in CFS.There are three reasons for the FPR of iDCA: (1) the number of features in CFS is more than the sum of RFS and SFS features; (2) there are still a lot of sensitive operations in the normal code, which will increase the weight of the risk signal of the normal sample feature vector; (3)through the analysis of the false positive sample feature vector, it can be found that the number of feature items with a value of 1 is smaller than the items with a value of 0, which means that the false positive sample feature vector misses identification information because some samples just have a small amount of code.These problems also lead to the other four classifiers generating false positives.
From Table 2, the detection result of the iDCA is much better than Wang [34] and Sato [12], to which it can be attributed that some developers will declare more permissions than they require.The DRIS and the suspicious APIs compared to permissions have a better performance in identifying malicious applications.In addition, it can be seen from Table 3 that, due to the low time complexity, iDCA training time and testing time is much smaller than the other methods, respectively achieved with 0.85 s and 0.55 s.  [34] 1.88 0.69 Sato [12] 2.13 0.63 iDCA 0.85 0.55

Conclusions
Compared with the other machine learning based detection methods, the method this paper proposed has the following improvements.This method takes DRIS and suspicious APIs as features in the experiment.The N-perm algorithm and information gain value are used in the characteristic processing, which reduce the data redundancy and improve the classification ability of data.Danger theory is used to complete feature mark and feature fusion.Finally, iDCA is used for classification.
Through these improvements, this method has a higher classification precision and a higher time efficiency than the other existing machine learning based detection algorithms.In addition, because code obfuscation mainly aims at anti-reverse analysis by substituting variable name, function name, class name and package name, the code statistical characteristics and system API call adopted by this algorithm will not produce big changes.This algorithm can reduce the influence of code obfuscation to a certain degree.
However, this algorithm still has some shortcomings: (1) this detection algorithm is not used in small sample training data sets; (2) this algorithm can not detect the malicious code with dynamical loading, which is also vulnerable to the influence of the deformation of the polymorphic environment.Because this algorithm requires a large number of samples to learn, and the sample experiments used is less, the ability for detection unknown malware still needs to be verified.
In the future, the main focus of the research is to reduce the sensitivity of the polymorphic code and build a more effective feature library by expanding sets for training, which may improve the practicality of the algorithm.Meanwhile, to detect malware more accurately, the study of malicious code dynamic loading technology is also needed.

Figure 6 .
Figure 6.Effects of N-perm parameter N on detection accuracy.

Figure 8 .
Figure 8.Comparison of feature classification results.

Figure 9 .
Figure 9.Comparison of different numbers of malicious samples.
SFS safe feature set, a set contains safe characteristic vd represents the danger degree of current characteristic or method vs represents the safe degree of current characteristic or method 1. Classification results are stored in array result[], in which array result[i] = 1 indicates the i-th sample is malware.Conversely, result[i] = 0 indicates that the i-th sample is normal.

Table 2 .
The results contrast of different algorithms.

Table 3 .
The time efficiency contrast of different algorithms.