A Hierarchical Approach for Android Malware Detection Using Authorization-Sensitive Features

: Android’s openness has made it a favorite for consumers and developers alike, driving strong app consumption growth. Meanwhile, its popularity also attracts attackers’ attention. Android malware is continually raising issues for the user’s privacy and security. Hence, it is of great practical value to develop a scientific and versatile system for Android malware detection. This paper presents a hierarchical approach to design a malware detection system for Android. It extracts four authorization-sensitive features: basic blocks, permissions, Application Programming Interfaces (APIs), and key functions, and layer-by-layer detects malware based on the similar module and the proposed deep learning model Convolutional Neural Network and eXtreme Gradient Boosting (CNNXGB). This detection approach focuses not only on classification but also on the details of the similarities between malware software. We serialize the key function in light of the sequence of API calls and pick up a similar module that captures the global semantics of malware. We propose a new method to convert the basic block into a multichannel picture and use Convolutional Neural Network (CNN) to learn features. We extract permissions and API calls based on their called frequency and train the classification model by XGBoost. A dynamic similar module feature library is created based on the extracted features to assess the sample’s behavior. The model is trained by utilizing 11,327 Android samples collected from Github, Google Play, Fdroid, and VirusShare. Promising experimental results demonstrate a higher accuracy of the proposed approach and its potential to detect Android malware attacks and reduce Android users’ security risks.


Introduction
With the popularity of mobile Internet, smartphones have been integrated into everyone's life. According to the China Internet Information Center statistics, mobile Internet users' proportion in China's total Internet users increased year by year from 2016 to 2019 [1]. By June 2019, the number of mobile Internet users in China reached 847 million, the proportion of mobile Internet users in China has gained 99.1%. This shows that access to the Internet through smartphones has become the primary way for Internet users. Smartphones store more and more personal privacy information; consequently, more and more attackers develop mobile malware to attack smartphones, bringing substantial security risks to mobile users.
By February 2020, the iOS operating system's global market has exceeded 20%, while that of Android has surpassed 74%. The two mobile operating systems occupy almost all mobile markets [2]. Due to the closeness of the iOS platform and the strict review harmful programs. Therefore, we focused on the similarities of malware software and proposed a hierarchical approach that combines machine learning technology with deep learning to deal with the unpredictable malware's variety. The hierarchical approach extracts authorization-sensitive features that can be effective in distinguishing between malicious and benign applications. According to the extracted different features, we adopt the hierarchical classification method for Android malware detection. The significant contributions of this paper include the following aspects: 1. Instead of extracting and analyzing all Android static and dynamic features separately, we hierarchically extracted four authorization-sensitive features: basic blocks, permissions, API calls, and key functions. 2. We extract basic block features based on the proposed multichannel transforming method. Mapping Table and Finding Adjacent Free Pixels method are put forward to deal with pixel conflict. Except for macro features, we extract permissions and API calls to build a feature library. We also pay close attention to key functions called by the application. A key function call graph is generated to research the key function call relationship. 3. The novelty of our proposed hierarchical malware detection approach is as follows: firstly, for the system functions, we use traditional techniques to hash key function and calculate the similarity of a similar module to test; secondly, taking into account the permissions and API calls, eXtreme Gradient Boosting (XGBoost) is used to classify; thirdly, for the given basic block features, CNN classifier is used for detection; finally, CNNXGB model that integrates XGBoost and CNN models is built to improve the classification accuracy. 4. Apart from the novelty, another contribution is the collection of Android samples (67,577) between 2014 and 2020 to initialize a similar module feature library for our experiments. Secondly, we adopt 11,327 Android samples to train the deep learning model. Then we conduct an extensive evaluation of our dataset to compare the detection results with widely used detection methods.
The rest of this paper is organized as follows. Section 2 reviews the related work concerning this paper. Section 3 presents the proposed method, including feature extraction and malware detection methods. Section 4 describes the experimental setup, results, and evaluation. Finally, we conclude the paper and outline the main directions for future research in Section 5.

Related Work
This section elaborates the different literature reviews, which are essential to acknowledge the malware detection methods for Android applications.

Malware Detection Methods
Scholars at home and abroad conducted various detection schemes in the face of the increasingly severe Android malware trend. The detection methods of mobile malware mainly include the signature, dynamic analysis, static analysis, and deep learning. The malware detection methods based on signature focus on signature codes [22][23][24], such as semantics [25], threat behavior sequence [26], similarity [27][28][29][30][31], etc. Many manufacturers widely use these methods, which have a great advantage in detection efficiency, but they depend entirely on the signature database's size. In addition, mobile devices' storage and computing capacity are limited, which further limits the application of the detection method based on signature in mobile devices.
Dynamic analysis methods [22,23] monitor a program's network behaviors, process calls, and interprocess communication to analyze whether the program has harmful behaviors. These methods can effectively detect malicious programs with encrypted code. However, the Android system's fragmentation is severe, and each mobile phone manufacturer has added a customized part to the Android system. Static behavior methods are to extract the features that represent the program's behavior without executing the program, and then detecting the malware according to the data. The common static features include API calls, bytecode, permission data, Dalvik, etc. [32,33]. Nevertheless, static behavior methods cannot detect some malicious programs that are executed by downloading malicious code from a regular program.
Recently, machine learning has shown state-of-the-art performance for malware detection. This approach is based on learning the characteristics of the malware. This detection process can be generally split into two steps: feature extraction and classification. In the first step, kinds of features are extracted from samples including malware and benign, to represent the program, and then a classifier is trained to automatically recognize the malware. Li et al. [34] used the API calls and permissions in danger level as features and then used Deep Belief Network (DBN) model to train. The training accuracy on the data set Drebin was 90%. Luo et al., directly transformed APK (Android application package) files into images and then extracted image textures with the DBN model as a part of the features, API calls, permissions, and activities as another part of the features. The training accuracy on the Drebin dataset was 95.6% [35]. The machine learning method is dependent on data sets and extracted features.

Supportive Features for Malware Detection
There are several features for detecting malicious applications on Android. Generally, they mainly revolve around permissions requested, API calls, and system calls extracted with static analysis or dynamic analysis techniques. There are other features for malware detection, such as native layer code, the whole application, Dalvik, etc.
Permission is a security mechanism proposed by Google for component access between applications and the restriction of some security-sensitive items within applications. Android is a permission-separated operating system, whose permissions are easy to extract [36], so permission features have become the most widely used Android malware detection features. However, there are some problems: (1) Android system has a large number of permissions; if we use all of the permissions it will consume substantial computing resources, (2) abuse of permission may cause a high positive false rate, and (3) some programs may bypass permission checking using special skills which makes the permission-based method invalid.
API is a call interface left by the operating system to the application, making the operating system execute the application commands (actions). API called by an application program is the embodiment of its behavior. Therefore, some researchers [15] propose to detect malware by finding features with API calling in the system, but (1) the number of APIs is relatively large, and if all of them are used, it is easy to cause excessive resource consumption, (2) Android applications tend to integrate third-party libraries, which also call many APIs, and (3) no consideration is given to the difference in the frequency of using API by malicious and regular programs.
The function interfaces provided to applications by the framework layer of Java are called Android system functions. System functions provide useful functions to applications such as window, network, string, and other related operations. Therefore, analyzing the system functions can obtain accurate information about the applicants' behaviors. Li and Qiao [37] proposed a method based on simhash to detect function reuse from high-volume code. The similar code blocks are extracted and determine whether the applications are similarly based on the calling relationship between function codes. Ruttenberg et al. [38] proposed an identifying shared components method to find malware code functional relationships. These methods focus on code reuse, and the complexity of code similarity determination is high, which will result in less efficiency and unable to adapt to the rapid growth of malware.
The detection methods based on permission, API, and system functions usually focus on the program's locality. Some researchers also use transforming malicious programs into images and then combining them with deep learning to detect malware. Qiao and Jiang [39] proposed a multichannel visualization method for malware detection with deep learning in Windows. Three 256 × 256 matrices were extracted from the original Windows malicious program like the three channels of RGB image, which were combined to generate an RGB image. LeNet5 trained the image to obtain the detection model. Nataraj [40] and Xue [41] put forward to convert the whole application into the image, and then input the image as a feature to the CNN network. CNN requires that the size of input images are the same, so how to change the different sizes of applications into the same size images is a difficult problem. Nataraj [40] solved that problem by separately outputting the different sizes of programs into various sizes of images for training, which is difficult to be applied to the CNN network. Xue [41] used functions can obtain accurate information about the applicants' behaviors. Qiao [37] proposed a method based on simhash to map applications to the same size images. Still, it could not effectively solve the problem of pixel point burst under the same coordinate by the simple summation, which would lose some original information. Luo [35] converted the whole program as a binary stream into an image without ignoring the non-program code files, such as pictures, audios, videos, etc., which would cause relatively large irrelevant noise in the generated picture. We found few related studies about the Android malware detection method with a hierarchical approach, such as [42] proposed a two-level hierarchical denoise network method utilizing LSTM. It detects the malware by decompiling the Android files. However, this hierarchical approach is not flexible due to only two-level structures that can encounter accuracy issues with different features. Our proposed hierarchical approach has different levels, which facilitate the various features to detect Android malware. As mentioned earlier, these pieces of literature encouraged us to propose a novel method for Android malware detection.

Proposed Method
This section presents the overall workflow of our approach. Figure 1 illustrates the system architecture of the hierarchical approach for Android malware detection using authorization-sensitive features. It consists of five significant steps: Data Collection, Decompilation, Feature Extraction, Classification Algorithms, and Malware Detection Model. The outline of our proposed method is following as: 1. Data Collection: We collected 67,577 Android samples (.apk) between 2014 and 2020 to initialize a similar module feature dataset which contains the benign and malicious applications. 2. Decompilation: To analyze the Android application, we transferred the unreadable program code to a readable file, for which we unzipped the Android application, got its .Dex file, which decompiled a .Dex file into a smali file. 3. Feature Extraction: First of all, we extract binary code stream features, basic block by using RGBA (multichannel picture) method; next, extract local features, permissions, and API calls; and then extract system functions to get key function call graph. Moreover, we built a similar module feature library. 4. Classification Algorithms: Based on the extracted features, we use the hierarchical classification method. On account of the key functions, we use the sequence of API calls to serialize them, calculate the similarity of a similar module. In contrast, for the permissions and API calls, the XGBoost classifier is used to classify. Similarly, for the extracted basic block features, the CNN classifier is utilized for classification. 5. Malware Detection Model: When an anonymous sample comes for detection, we check the similarity, if there is a record in the similar module database before or not. If there is, then it is malicious, and it will be added to a similar module feature library, which is dynamically expanded. Otherwise, we use a combinatory deep learning model CNNXGB, with specific conditions, if the probability p > 0.5, then the program is malicious or else benign. If it is malicious, it will be added to a similar module feature library.
We provided a detailed process of feature extraction and malware detection models in this section for the broad-range explanation of these steps. However, the other steps will be elaborated on the experimental section.

Feature Extraction
In this paper, we extracted four different types of features. The comprehensive process of these feature extraction is given below.

Basic Block Features
The application's binary code stream harbors important information for malware detection. We take the basic block as a research unit to process the whole application to a multichannel 1024 × 1024 PNG picture. That is taking images as the characteristics of the program. As mentioned earlier [39][40][41], there are still the following problems with converting the whole application into a picture representation:

•
How to change the different sizes of applications into the same size pictures? • How to effectively solve the problem of pixels burst under the same coordinate? • How to reduce the irrelevant noise of the generated picture?
This subsection proposes its novel solution for the problems mentioned above. We map each basic block to a 1024 × 1024 pixels picture of 1,048,576 pixels (about 1 million), enough to hold most of the basic blocks for the first question. This method can keep the same size of all the pictures. For the second question, we add A channel based on the RGB method to deal with conflict. The value of A channel can be acquired by the Mapping Table and Finding Adjacent Free Pixels method. For the third question, the standard approach is to open the program in the form of a binary stream, read the program data in 8-bit as a unit [40]. Assuming that a program's size is S bytes, then a program can finally be represented by an S dimensional vector. The composition of a program includes not only code but also many resource files used by the program, such as pictures, audio, etc. Therefore, the generated picture contains a lot of noise. Our method is to unpack the Android applications, discard all resource files such as pictures, audio, and videos used in the program, and only keep the files storing the program code. The detailed processes will be presented in the following paragraph.
A program is composed of some algorithms which contain many conditional judgments in the specific implementation, and different results of conditional decisions will lead to executing different code branches. Therefore, we use conditional judgments as a division point; a program is divided into many basic blocks. Figure 2 shows many basic blocks separated by a program and the relationship among them. After extracting all the basic block instructions, a sequence is mapped into a 44-bit binary sequence using the simhash method [43]. This binary sequence is divided into 10, 10, 8, 8, and 8 binary sequences, from the most significant to the least significant. The values and meanings of each sub-sequence are shown in Table 1.
The picture is composed of pixels. This paper takes the upper left corner of the picture as the coordinate system's origin, stretches to the right as the x-axis, drawn down as the y-axis, respectively. The whole picture is divided into grids with unit 1 as the length. Each grid represents a pixel. The default initialization color value of the pixel is (0, 0, 0, 255).
Mapping conflicts comprise of two different types: the same colors' mapping conflicts and the different colors' mapping conflicts under the same coordinate. For these two conflicts, this study offered two different solutions. For the first conflict, if the basic block's mapping coordinates are the same and the color is the same, then the value of channel A with the range of [0, 255] is used to represent the frequency of conflict. The paper defines the mapping table between the value of channel A and the conflict frequency, which is shown as Table 2. For example, we suppose that a basic block after conversion is mapped to (245, 418), and RGB color is (50,56,168). For the first mapping, its default value of channel A is 255, so its corresponding RGBA color is (50,56,168,255). If the pixel point has 1500 conflicts, the corresponding value of channel A is 150, taking into account in Table 2, so its RGBA color is (50,56,168,150), as shown in Figure 3.  For the second conflict, the paper proposes a new algorithm, which is the Finding Adjacent Free Pixels method, then the conflicting pixels will be placed in the free pixels searched. That is, if the coordinate of the conflicting pixel is (x, y), then take (x, y) as circle, define the coordinate of (x, y) with a radius of r as ( The importance of the pixels with the same radius is regarded as equivalent. Search for free pixels from the top left corner in turn and end when the free pixel is found, then the free pixel is used as the filling point. Each Android application will eventually become a 1024 × 1024 RGBA image by Finding Adjacent Free Pixels. Those images that represent the features of the application will be stored in the Android feature library. The pixel where the radius r is 1 (r = 1) shows in Figure 4, the orange pixel in the center is the conflict pixel, while the free pixels used to fill are blue. Discarding the mapping or fusing the mapping value with the existing pixel points will lose the original and current information. The pixel space of a 1024 × 1024 picture is about 1 million. For most programs, the space is sufficient, and there must be some empty unfilled pixels. The problem of image size inconsistency and mapping conflict is solved through Finding Adjacent Free Pixels. At the same time, the original information of the application program is effectively preserved. The malicious and benign sample image features of Android are shown in Figures 5 and 6, respectively.

Permission and API Calls Features
Except for the basic block features, we also focus on each system function called in the basic blocks, as Figure 7; wherein the red boxes represent the basic block, the underlines indicate the functions called. However, calling different functions requires the system's permission, and access to operating system functionality and system resources need API calls used by the android application. Therefore, the permissions and API calls represent the local feature of an application. Permission Extracting: If an application wants to use a system function in the Android operating system, it needs to apply to the system for the corresponding permission. Therefore, permissions are an essential characteristic of application behavior. With the continuous development of the Android system, it provides more and more permissions. By analyzing the source code of Android 4.0 to 10.0, the number of native permissions in each version of the Android system is shown in Figure 8. It shows that the latest Android 10.0 version has more than 500 permissions. If all permissions are extracted as features, the feature dimension will increase dramatically. We select 22 necessary permissions [36] as research objects. The names and corresponding meanings of each permission are shown in Table 3. The vector corresponding to the permission feature is FP = (x 1 , x 2 , · · · , x 22 ), and FP i corresponds to the ith component in Table 3. By traversing all permissions requested by the application program, if the requested permission is the ith component in Table 3, set x i to 1, otherwise to 0.  Although permission features can reflect programs' behavior to a certain extent, because of the universality of permissions, and some applications apply for particular permission but not necessarily use it at runtime, it is not reliable to detect malicious programs only with permissions. A program that wants to interact with the system must invoke the the system's API interface, so the system API gathered in the program is also a reflection of program behavior. The frequency of some system API calls by Android is different in malicious programs and benign programs [15]. Therefore, we propose API Calls Frequency Difference method to make statistics on the system API calls of benign and malicious programs in the sample set. The detailed steps are as follows: 1.
Read the smali file, extract the code between ".method" and ".endmethod" to obtain the function body, which reflects the structural information among API calls.

2.
Extract the APIs, which is called by the Android system.

3.
Travel the entire application, and repeat steps 1 and 2.

4.
Count the times that the benign applications in the dataset call each API, and calculate each API's frequency in the benign applications.

5.
Count the times that the malware calls each API and calculate each API's frequency in the malware. 6.
Compare the frequency with which the same API appears in a benign and malicious application.
Based on the proposed API Calls Frequency Difference method, we extract the top 40 system APIs with the enormous difference in the call frequency; the results are shown in Table 4. In counting the system API call frequency, this paper excludes the third-party library integrated by the application program to prevent the system API's statistical results.
The vector corresponding to the API features is recorded as FA = (x 1 , x 2 , · · · , x 40 ), then the number of calls to the ith API in the application is counted and set x i to this value.

Key Function Call Graph (KFCG)
Some fundamental terms and definitions are used for the description of the key function call graph, which can be defined as: An application contains many functions, but the primary way that an Android application interacts with the system is through the system functions. After research and analysis, we find that all system function call times are different, and non-key functions account for more than key functions. If all functions are processed, non-key functions will consume a tremendous amount of system resources. This paper then extracts the key functions and digitizes them through the sequence of API calls, which improves the application's analysis performance and reflects the original function of the program.
The detailed steps for how we construct the key function call graph are as follows: 1.
Traverse through the function body, find each called function in order, and store it in a key-value pair. The key is the globally unique identifier of the function, and the value is a list, 1 indicating that the function is the key function, and 0 indicating that the function is the non-key function.

2.
Process all smali files using step 1 to get function call graphs (FCG).

3.
Use an adjacency matrix to represent the function call graph, in which 1 means that there is a calling relationship between two functions while 0 means there is no calling relationship.

4.
Remove the non-key functions from the FCG to get KFCG, and then obtain key function call table.
How do we transform FCG to KFCG? Function call graph (FCG) is used to represent the calling relationship between function blocks. Let KFCG = (V, E), where V and E represent the vertices and edges of the graph KFCG, respectively. KFCG is a directed acyclic graph, and it should not contain self-loop and recursive functions. If a function FA calls the function FB, then the number of hops between these two functions is called the distance from FA to FB, written as DISTANCE(FA, FB). For ∀u, v ∈ V, DISTANCE(u, v) satisfies: if there are multiple paths from u to v, choose the shortest route; 3.
generally, DISTANCE(u, v) equals the number of non-key function between u and v plus 1.
For example, all functions of the application and the called relationships of each function are shown in Table 5 (uppercase letters indicate key functions, lowercase characters indicate non-key functions, and fancy letters represent system call functions). For the function A, it is a key function, and four functions (the non-key function a, the key function B, and the system call function S 1 and S 2 ) are called successively in its function body. According to Table 5, we can initialize the function call graph, as shown in Figure 9, and then remove the non-key functions one by one updating the call distance between functions. For non-key function a, since A calls a and a calls C, the hop value A to C should be updated to 2 after removing a; A calls B directly, the hop value of A to B is less than the one of A to a to B. Therefore, the hop value of A to B is not updated, as in Figure 10a. For non-key function b, since B calls b and b calls C, the hop value B to C should be updated to 2 after removing b. The resulting key function call graph (KFCG) is shown in Figure 10b. Then we can get key function call table, as shown in Table 6. Table 5. The list of the functions in application and the called relationship by each function.

Malware Detection Approach
In the previous Section 4, we extracted different features from Android applications. In this subsection, we use those features to detect malware. For key function, we consider the details of the similarities between malware. Suppose a similar module cannot make sure whether an unknown sample is a malware. In that case, we adopt other features. Considering the permissions and API calls, XGBoost is used to classify, and for the given basic block features, the CNN classifier is used to detect malware. Simultaneously, the CNNXGB model is built to improve the classification accuracy.

Similar Module Detection
In contrast to [37,38], our method is based on the Android system function call sequence and can be effectively used to extract similar modules between malware. A similar module can be used to determine whether the two Android applications are identical. For instance, for the sample α to be detected, we first extract a known malicious sample β from the similar module feature library, then calculate their similarity. If the two values are identical, it can be judged that the sample α is a malicious program; otherwise, it is a non-malicious program.
When selecting a sample β, it will take too long to traverse the malicious sample database one by one. This paper uses an inverted index to choose a comparison subset from the malicious sample database to solve this problem. Then the samples in the subset are all the samples to be compared with sample α. Following is the generation method of the comparison subset. Set the kth application in the sample library as APP k , gain the all function's Hash value F k 1 , F k 2 , · · · , F N(k) k included by APP k , N(k) represents the number of function included by APP k . There may be the same function among multiple applications. By reversing this mapping, we can get the mapping relationship between the function and the application.
We use the hash values of the sequences of API calls as the function's flag. Suppose there is a function f in the application and the sequences of API calls of the function f are F 1 , F 2 , · · · , F n . In that case, we connect these sequences with a colon (:), then get a string "F 1 : F 2 : F 3 : · · · : F n ", next take the MD5 value of the string as the unique flag of the function f, finally get the similar module graph (SMG), as Equation (1), and the corresponding matrix is the similar module (SM). When we extract all of the SMs of the collected samples, we build a similar module feature library.
where, C ij denotes the distance from F i to F j . In order to compare two similar modules, it is necessary to unify their dimensions, which contains two steps. First, we extract the same function from the two similar modules to form a common similar module matrix. Then we can acquire the similarity value, as Equation (2), which lies between 0 and 1, and the larger the value is, the more similar the two.

Detection with CNNXGB
Due to the limited number of samples in a similar module feature database, some malicious samples are not similar to any modules in a similar module database. This section builds a deep learning model CNNXGB based on XGBoost and CNN by extracting the permission, frequency of API calls, and basic blocks of the Android application program.
We can acquire permission features, frequency of API features, and RGBA picture features transformed by basic blocks from the above processing. Then the paper proposes a new CNNXGB detection algorithm to improve the detection accuracy. The CNN algorithm can realize end-to-end learning, and the middle features can be obtained by automatic learning. The XGBoost algorithm is a combination of a series of classification regression trees; its advantages are uneasy about overfitting, fast training, and strong interpretability [44]. CNNXGB detection algorithm combines the goodness of CNN and XGBoost. Half of the model is a linear stack of CNN convolutional layer to process RGBA image features, and another part is the XGBoost model that deals with permission and API features. The flow chart of the CNNXGB detection model is shown in Figure 11. In the multi-classification problem, CNN will output several probability values to the predicted target in the fully connected layer, indicating the probability that the target belongs to each category. In this study, the classification of Android malicious programs is a two-fold classification problem. CNN will output the probability values of normal and malicious programs, respectively, and the prediction results of XGBoost are similar to those of CNN. Suppose CNN and XGBoost respectively obtain the probability that the program to be detected is malicious as p 1 and p 2 , and their weights are w 1 and w 2 . In that case, the probability that the program is detected as malicious as follows: when P ≥ 0.5, the program to be detected is malicious; otherwise, it is a normal program. In this paper, CNN only deals with one feature; however, XGBoost handles two features: permission and API. Thus, the weight of CNN detection result w 1 is set to 1/3, and the weight of XGBoost detection result w 2 is set to 2/3.

Experimental Results and Analysis
In this paper, two sets of experiments are conducted to evaluate our proposed malware detection approach's performance. Firstly, the detection performance using extracted authorization-sensitive features separately. Secondly, we developed a hierarchical Android malware detection system by comparisons with other often-used classification methods.

Data Collection and De-Compilation
First, we collected 67,577 Android samples between 2014 and 2020, as shown in Table 7, of which the number of the normal samples is 17,564, and the number of the malicious samples is 50,013. An initial database of similar modules for Android malware detection is created based on a sequence of API calls from these raw samples. Second, we download the experimental data, including 6116 malicious samples and 5211 normal samples, mainly from Github, Google Play, Fdroid, and VirusShare [45]. The SHA256 list of samples can be obtained from Archive [46]. Before extracting the features of the Android application, we need to decompile the application dataset. On the one hand, to get the similar module based on the sequence of API calls, we use Apktool to decompile to get a recognizable smali assembly code. On the other hand, it is necessary to decompile the Android application with Androguard [48] to obtain its Dalvik code. The preprocessing steps are shown in Figure 12.

1.
Prepare the Hash value list of all samples; 2.
input the Hash list into the scheduler; 3.
the scheduler queries the sample storage path in the data management system according to the hash value of each application; 4.
after the data management system returns the application path, the scheduler groups the applications and starts multiple processes for processing; 5.
when the scheduler obtains the processing results of multiple processes, the results are stored in the Android feature library. Each process with one program simultaneously; thus, multiple processes can efficiently and quickly handle large data quantities. In each processing, the study uses Anroguard to get the basic information of the application and uses LibScout to analyze the program's third-party Java library [43,49]. As a result that the third-party library is not the program's implementation code, to eliminate its interference, our method records the third-party package's name. In the subsequent analysis, the third-party library code will be excluded based on the package name. The tools and extracted information used by each process to manage Android applications is illustrated in Table 8. The Paper Multichannel picture

Experiment Setup
Different types of machine learning classifiers [11,50,51] such as support vector machine (SVM), decision trees(DT), random forest (RF), and deep learning classifiers [14,40,41,52] are used to produce models that can be used to detect mobile malware. SVM draws on a hyperplane to separate two classes with maximal margin, widely used in malware classification. DT learns decision rules from the given features to build a rule-based model. There are also some DT variants, i.e., C4.5, ID3, C5.0, and CART. The depth of the tree may bring an overfitting problem. RF is an integrated learning product, where many decision trees are integrated into a forest and combined used to predict the outcome. It will also overfit on some noisy classification or regression problems. XGBoost is a blended learning algorithm that combines weak classifiers to form a robust classifier [44]. The basic idea is to train a weak classifier from the training set using initial weights and update the weights based on its learning error rate. The weights of sample points with high learning error rates are given more attention in the subsequent weak classifiers. It is repeated to produce a robust classifier model consisting of several simple weak classifiers. XGBoost is not easily overfitted and can be fast trained. CNN is a feedforward neural network consisting of four layers: convolutional layer, pooling layer, fully connected layer, and output layer. When the input data undergo multiple convolutional and pooling layers, the obtained salient features are passed through the full connected layer for advanced inference. Finally, using mathematical statistics methods, output the corresponding results [53]. It has excellent performance for extensive image processing and has been applied to various fields in recent years, such as face recognition, medical diagnosis, voice recognition, malware detection, etc.
The configuration of the experiment running environment and the main packages adopted in this study are presented in Table 9. We use 30% of the dataset samples as a test dataset, 70% as a training dataset. To assess the accuracy of our algorithm, some metrics such as true positive (TP), false positive (FP), true negative (TN), and false negative (FN) are introduced. DT, RF, SVM [54][55][56] are chosen as classifiers to compare with our model. For the CNN algorithm, the convolutional layer parameters sets are given in Table 10, and ReLU is utilized as the activation function. For the XGBoost algorithm, the parameter sets are given in Table 11. The first dense of the fully connected layer is 512, and the activation function uses ReLU. The output dimension of the second dense of the fully connected layers is 2, the activation function uses softmax, and DropOut sets 0.5. Ratio of the Creation Tree from all Columns 0.9 10 Learning Rate 0.01

Features Analysis
In this subsection, two experiments are set to evaluate the detection performance based on the extracted authorization-sensitive features: (1) We evaluated the detection rates based on KFCG.
(2) We compared the detection performance using the extracted features.

Detection Results Based on KFCG
Samples are categorized using the NANO antivirus engine, and if a category contains more than 450 malicious samples, it will be used to experiment. The threshold for similarity is set to 0.7. The detection results using the sequence of API calls are shown in Table 12.
To verify the classification results, we select six commercial antivirus softwares, F-Secure, BitDefender, AhnLab-V3, TrendMicro, Kaspersky, and Avast, to analyze the classification results. If the antivirus engine from this family detects the more samples belonging to the family, the more influential the similar module extraction method is proposed. Therefore, the larger the ratio R (as Equation (6)) in Table 13, the better the detection rate of the similar module extraction method proposed, that is to say, the higher the classification accuracy of similar modules and the classification accuracy is over 91% on average.

R =
the number o f similar samples detected f rom the f amily total number o f f amily samples × 100% (6)

Detection Performance Evaluation Using Extracted Features
We evaluate the performance of the selected permissions and the API calls by using XGBoost. We use CNN to assess the performance of extracted basic block features. The classification results are as shown in Table 14. We found that the hierarchical authorizationsensitive features (permissions, API calls, basic blocks) achieved better classification accuracy than the features used separately.

Classifiers Analysis
The paper chooses DT, RF, SVM [54][55][56] as classifiers to compare with CNNXGB. The results of the experiments are shown in Figure 13. From the figure, we can see that the recall rate of SVM is significantly higher than that of other methods. Still, the precision, accuracy, and AUC of SVM are substantially lower than those of different methods. DT has the best effect on precision, and the recall rate is the same as CNNXGB. Still, it is weaker than CNNXGB in accuracy and AUC, and RF is weaker than CNNXGB in all indexes. Therefore, through experimental analysis, we can prove that the CNNXGB model proposed in this paper is the best. The results show that the classification accuracy of the CNNXGB model increases to 98%.

Conclusions
In order to detect Android malware efficiently and effectively, we build a hierarchical Android malware detection system using authorization-sensitive features. We transform basic blocks that represent binary code into a multichannel picture, in which A channel is utilized to deal with mapping conflict. On behalf of the application's local features, we extract 22 permissions and 40 API calls selected by API Calls Frequency Difference method. Key functions reflect the primary interaction relationship between the application and the Android system. According to the sequence of API calls, we order key functions to deal with the key function call graph (KFCG). We present a hierarchical Android malware detection framework based on the extracted features, which introduces similar module feature detection and a deep learning model. In the first layer, we propose to select a comparison subset from the similar module feature library using an inverted index, and it can avoid using too long time to traverse the library one by one. In the second layer, CNNXGB integrates XGBoost and CNN to improve the detection accuracy. Simultaneously, according to the detection results, we update the similar module feature library of Android malware to realize the database's dynamic self-growth. Then we conduct an extensive evaluation of our dataset to compare the detection results, which demonstrate that our proposed approach is practical. The classification accuracy is over 91% on average through the similarity comparison of similar modules, and it has been increased to 98% by the CNNXGB model.
In the future, we plan to extend our work to the following aspects: (1) increase the diversity of Android sample features such as native layer code features to improve the model detection ability, (2) research the decompiling technology of the Android program to enhance the decompiling ability, (3) optimize the deep learning model integrated XGBoost and CNN to reduce the training time.  Data Availability Statement: The data and codes used in this work are available at https://github. com/Joyce-hui/CNNXGB (accessed on 7 February 2021).

Conflicts of Interest:
The authors declare no conflict of interest.