You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

17 February 2023

GA-StackingMD: Android Malware Detection Method Based on Genetic Algorithm Optimized Stacking

,
and
1
School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China
2
Jilin Province Key Laboratory of Network and Information Security, Changchun 130022, China
3
Information Center, Changchun University of Science and Technology, Changchun 130022, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Information Security and Privacy

Abstract

With the rapid development of network and mobile communication, intelligent terminals such as smartphones and tablet computers have changed people’s daily life and work. However, malware such as viruses, Trojans, and extortion applications have introduced threats to personal privacy and social security. Malware of the Android operating system has a great variety and updates rapidly. Android malware detection is faced with the problems of high feature dimension and unsatisfied detection accuracy of single classification algorithms. In this work, an Android malware detection framework GA-StackingMD is presented, which employs Stacking to compose five different base classifiers, and Genetic Algorithm is applied to optimize the hyperparameters of the framework. Experiments show that Stacking could effectively improve malware detection accuracy compared with single classifiers. The presented GA-StackingMD achieves 98.43% and 98.66% accuracies on CIC-AndMal2017 and CICMalDroid2020 data sets, which shows the effectiveness and feasibility of the proposed method.

1. Introduction

The rapid development of the Internet has promoted intelligent mobile terminals such as smartphones and tablet computers to play a more and more important role in people’s life and work. Up to October 2022, the Android operating system accounts for 70.96% of the global smartphone market share [1]. At the same time, malware has led to a huge impact on personal privacy and social security. Kaspersky [2] found a total of 3,464,756 malicious installation packages, 97,661 new mobile banking Trojans, and 17,372 new mobile ransomware Trojans in 2021. The security of mobile intelligent terminals is facing severe challenges.
Malware mainly includes software that infringes upon legitimate rights and runs on users’ computers or mobile terminals without permission, such as computer viruses, Trojans, worms, blackmail software, spyware, and so on. Rapidly growing malware is not only leading to a threat to users’ privacy but also playing an important role in most computer intrusions. In present software developments and applications, open source code is regarded as a common practice. However, reusing other code allows bad actors to access a wide range of developer communities to obtain different kinds of malware [3]. The development of code obfuscation and anti-tracking technology makes traditional malware such as viruses and worms evolve into more threatening variants by means of polymorphism and deformation, and they even escape security defense scanning, which reflects the limitations of existing detection methods [4].
The research on Android malware detection mainly focuses on two aspects. On the one hand, malware analysis extracts different types of features, and they are used to detect malware. On the other hand, how to select or build an appropriate classification detection model to achieve the purpose of malware detection, which is to have a good and stable detection performance for different forms of software [5,6], so that the stability, robustness, and effectiveness of the classification model have attracted the attention of researchers. The existing malware detection methods usually extract the data set features and train the learning models, then classify the malicious samples from normal ones. The detection accuracy of a single feature or single detection algorithm is not satisfied in practice, while the combinations of multi-dimensional features or various classifiers are faced with high computational consumption.
Considering the above problems, we try to propose solutions. The contributions of this paper are as follows.
  • An Android malware detection framework based on Stacking is proposed. The framework mainly includes three parts: data set construction, feature dimension reduction, and optimization method GA-StackingMD. By Stacking technology, five base classifiers are integrated to combine their advantages in order to improve malware detection performance;
  • A two-step feature dimension reduction method is realized. The extracted multiple-category features have high dimensions, which may lead to redundancy, excessive computational consumption, and even over-fitting. The proposed method uses InfoGain for first feature detection and then applies Chi-square Test to reduce redundancy; finally, the key features subset with better distinguishing ability is constructed;
  • A Stacking hyperparameters’ optimization method, GA-Stacking MD, is presented to improve the classifiers’ combination performance. Experiments show it achieves better accuracy than the original combination of base classifiers. The proposed method is not only applicable to the optimization of stacking hyperparameters but also can be extended to apply to different integrated classifiers’ optimization, so as to play a role in different base classifier algorithm environments.

3. Android Malware Detection Method

3.1. The GA-StackingMD Framework

In consideration of the high feature dimension and low detection efficiency of single classifiers, an Android malware detection method GA-StackingMD based on Stacking is presented, and GA is used to optimize the hyperparameters of the base models. The GA-StackingMD Framework is shown in Figure 2.
Figure 2. The GA-StackingMD Framework.
The proposed malware detection framework consists of three parts.
  • Data set construction. The Android application is decompiled and static features are extracted, including Permission, API, Dalvikopcode, Intent, and Hardware. Then the extracted features are digitized to construct the original feature set with high dimensions;
  • Feature dimension reduction. A two-step feature selection method is proposed to reduce the original feature dimension. InfoGain is used for the primary election, then Chi-square Test is applied for further reduction to remove redundant and irrelevant features. The final key feature subset contains the selected features with low dimensional and good differentiation;
  • GA-StackingMD. An optimization algorithm based on GA is presented to select the hyperparameters of base classifiers of Stacking. After selecting the better combination of base classifiers, GA is used to adaptively adjust the hyperparameters in a given scope, and finally, improve the malware detection performance.

3.2. Feature Processing

3.2.1. Feature Extraction

The two data sets used in this work are open by the Canadian Institute for Cybersecurity. CIC-AndMal2017 [33] includes 426 malicious applications and 1700 benign ones, and CIC-MalDroid 2020 [34] includes 13,204 malicious samples and 4039 benign ones. The malware includes advertising, malicious SMS, blackmail, and so on. There is a large difference in the number of samples between the two data sets, which also tests the robustness of our proposed method.
Androguard is used to extract Permissions, Dalvik opcode, API, Intent, and Hardware features. The extracted features are firstly converted to numerical values. The features from all samples are sorted by category and arranged in dictionary order in the same category. If the feature appears in a sample, then it is marked “1”, otherwise marked “0”. A sample is represented by a vector composed of “1” and “0” so that all the samples compose a feature matrix.

3.2.2. Two-Step Feature Selection

Single features are difficult to comprehensively describe the differences between malware and normal applications, but the combination of multiple features has a high dimension which leads to a significant increase in computation and feature redundancy. In order to improve the performance of the classifiers, reducing the feature dimension effectively and constructing a good feature subset are important prerequisites.
InfoGain is used to implement the first step selection. The InfoGain value is defined by information entropy and conditional entropy, which indicates the degree of reduction of information complexity or uncertainty under specific conditions. Feature selection indicates how much information a feature brings to the whole classification system. The higher the InfoGain value, the more information it brings, and the more important the feature is. With all the features sorted by this standard, we select the features ranked in front to construct the candidate feature set.
In the second step, Chi-square Test is applied to optimize the feature set to further remove the low correlation features. Chi-square Test describes the correlation between two variables by calculating the X2 distribution. Take features as variables, the Chi-square values between features and the class labels are calculated respectively and sorted. The features with high correlation are retained, and finally, the key feature subsets for detection are selected.

3.3. GA-StackingMD

The process of GA-StackingMD makes the hyperparameters of the combination better, including three steps.

3.3.1. Training the First Layer Classifiers

In this paper, the first layer classifier is selected by the following criteria: one is that the algorithm can belong to different categories, thus covering different aspects from the point of view of classification. For example, some algorithms are based on probability, some algorithms are based on distance, so they may be able to train and learn from the different aspects of the data set. The other is that the implementation process of the algorithm is relatively simple and less time-consuming so the rapid consumption of system resources can be reduced as much as possible when a variety of algorithms are combined. It is preliminarily distinguished as follows: Decision Tree-based RF, probability-based linear classifier SVM, distance-based KNN, ensemble learning-based LGBM, and CatBoost. This paper chooses these five classifiers as the first layer classifiers and trains them by 5-fold cross-validation. The output of each classifier and the sample labels are combined to form a new data set for the second layer. Figure 3 shows the first layer process.
Figure 3. Training of the first layer base classifiers.
The training and testing samples are divided according to the proportion of 7:3. By 5-fold cross-validation, each base classifier uses about 20% training set in prediction and combines the results of each iteration to form new feature vectors. At the same time, each base classifier predicts a result of the testing set. Therefore, in the new data set constructed, the columns are composed of base classifiers, the number of rows is equal to the number of training data and test data, and it is labeled by the average predicted results.

3.3.2. Training the Second Layer Classifier

In order to prevent over-fitting, the Logical Regression model is used to process the data set in the previous step. Different combinations of classifiers are compared according to classification accuracy. For example, if we have m base classifiers, the number of possible combinations is (2m − 1). The combination with the highest accuracy is selected and then the optimized combination of the integrated model of this data set is obtained.

3.3.3. Optimizing Hyperparameters by GA

After determining the best combination of base classifiers, the combination of multiple algorithms can give full play to the advantages by selecting the appropriate hyperparameter. GA is used to optimize the hyperparameter combinations of the classifiers. The optimization process is shown in Figure 4.
Figure 4. Flow chart of GA optimization.
First of all, the initial population is randomly generated. Several chromosomes form the first generation of feasible solutions to start the iteration. Taking accuracy as the fitness function, the next-generation population is produced by crossover, mutation, and selection. Then the best individual in the population is selected generation by generation. If the iteration meets the termination condition, the chromosome with the highest fitness generated in this iteration is output as the optimized solution. Otherwise, the next iteration is continued execution until it meets the stop condition or reaches the maximum number of iterations.
Assuming the selected optimized combination contains n base classifiers, the parameter j of classifier i is pij with the value range [aij, cij]. The number of parameters of classifier i is si, then the parameter vector of classifier i can be expressed as  p i = [ p i 1 , p i 2 p i s i ] , the parameter matrix is  P   = p 11 , p 12 , p 1 s i , p 21 , p 22 , p 2 s 2 p i 1 , p i 2 , p i s i . The optimization method takes each pi in the matrix and carries out iterations within the set range of parameters. Finally, the parameter setting achieves optimal when the accuracy is maximum.

3.4. GA-StackingMD Description

Input: Data set represented by feature matrix.
Output: Optimized parameter combination, and fitness function value.
Step 1: Integrating all possible base model combinations by Stacking, setting accuracy as the fitness function f(x), and selecting the combinations with the highest accuracy.
Step 2: Setting the hyperparameters and their value ranges of each base model, the thresholds of population number S, iteration number M, and fitness function T.
Step 3: Encoding hyperparameters in binary, and initialing the population.
Step 4: Carrying out natural selection, cross-reproduction, and gene mutation, and keeping the population number no more than S.
Step 5: Generating new populations iteratively, and the parameters of optimized combination and fitness function are selected.
Step 6: If the two adjacent iterations of the algorithm meet the stop condition:   | f i x f i 1 x |     t , the algorithm stops. Otherwise, continue to Step 7.
Step 7: If the algorithm reaches the maximum iteration M, the algorithm stops, otherwise, it goes back to Step 4.
The stop condition in Step 6  | f i x f i 1 x |     t  is the stop condition of the genetic algorithm, which has nothing to do with the classifiers.  f i x  represents the value of accuracy when the number of iterations is i. When the difference between the number of iterations is i and i − 1 reaches a certain value, or when it is stable, it reaches the stop condition of the genetic algorithm. If it does not tend to be stable, it will be iterated until the maximum number of iterations. To sum up, there are two stop conditions for GA-StackingMD, accuracy tends to be stable, and the maximum number of iterations M.

4. Experiments

4.1. Experimental Environment and Evaluation Index

The experiments are performed on the 64-bit Windows operating system, with 16G memory and Python 3.8.3. The data sets are CIC-AndMal2017 and CICMalDroid2020.
Define four parameters: TP (True Positive), FP (False positive), FN (False Negative), and TN (True Negative), to calculate the following evaluation indexes.
(1) Precision. It indicates the proportion of correctly identified positive samples in all positive samples.
Precision = TP/(TP + FP),
(2) Accuracy. It indicates the proportion of correctly identified positive and negative samples in all samples.
Accuracy = (TP + TN)/(TP + TN + FP + FN),
(3) Recall. It indicates the proportion of correctly identified positive samples in all identified positive samples.
Recall = TP/(TP + FN),
(4) F1-score. It is the harmonic mean of precision and recall that comprehensively evaluates the classification performance.
F1-score = 2 × Precision × Recall/(Precision + Recall),

4.2. Experiment Steps

4.2.1. Key Feature Subset Construction

We extract Permission, API, Dalvik, Intent, and Hardware from CIC-AndMal2017 and CICMalDroid2020, and respectively get 7666 and 14,600 raw features. After the two-step feature selection, we select seven feature subsets to compare their detection performance. The feature dimension ranges from 800 to 1400 with an interval of 100. The results are shown in Figure 5.
Figure 5. (a) CIC-AndMal2017: Comparison of seven feature sets; (b) CICMalDroid2020: Comparison of seven feature sets.
The evaluation indexes on the two data sets show a similar trend. Among the seven groups, accuracy increases at first and then decreases with the increase of feature numbers. The comprehensive indexes reach the best when it is 1000. Finally, these selected features are reserved to construct key feature subsets and are used in subsequent experiments.

4.2.2. Compare Stacking with Single Classifiers

Five independent algorithms SVM, KNN, LGBM, CatBoost, and RF are applied to malware detection on the two data sets, and then they are composed as the base classifiers of Stacking. Table 1 and Table 2 show the results of the two data sets, respectively.
Table 1. CIC-AndMal2017 detection results.
Table 2. CICMalDroid2020 detection results.
From the perspective of single algorithms, LGBM has advantages over others. But after combining the five methods, Stacking achieves the best detection performance with 96.55% and 98.56% accuracy. The results show that compared with single algorithms, the integrated Stacking effectively improves malware detection performance.

4.2.3. The Optimized Combination of Base Classifier Selection

The first step of GA-StackingMD is to select the best combination of the base classifiers. The five algorithms have 31 combinations. Set fixed parameters to compare the performances of different combinations. Taking CIC-AndMal2017 as an example, the results are shown in Table 3.
Table 3. Comparison of different combinations on CIC-AndMal2017.
The 31 groups are sorted according to accuracy, and the combination of “KNN+LGBM” achieves the best accuracy of 94.67% and has the best performance in F1-score, precision, and recall. It should be noted that some of the combinations have the same accuracy. In this step, eight groups of combinations get the same highest accuracy. Considering all of the combinations including the two algorithms, KNN and LGBM, we choose the KNN and LGBM combination as the best one. In the same way, the selected combination of CICMalDroid2020 is “RF+KNN+LGBM”, with 94.30% accuracy and 96.18% F1-score.

4.2.4. Hyperparameters’ Optimization by GA

The hyperparameters of the base classifiers are different, and some hyperparameters affect the performance of the algorithm, while others do not. Thus GA is used to optimize the specific hyperparameters which are influential. Taking CICMalDroid2020 as an example, we set the value range of these parameters as in Table 4, and the parameters selected after GA are shown in Table 5.
Table 4. The ranges of several hyperparameters.
Table 5. The selected hyperparameters by GA.
After parameter selection, the detection results on the two data sets are shown in Figure 6.
Figure 6. (a) CIC-AndMal2017: Comparison of GA_StackingMD and Initial Stacking Model; (b) CIC-MalDroid2020: Comparison of GA_StackingMD and Initial Stacking Model.
The first step is setting the parameter selection ranges, then in the iterative process, the classifiers run with the given parameters, and the accuracies of various parameter combinations are compared in order to select the best combination. After this process, the best combination of these parameters is selected within the given range. The results in Figure 6 show that the optimized hyperparameters could effectively improve malware detection. Among them, in the data set CIC-AndMal2017, four evaluation indicators have significantly improved. In the data set CICMalDroid2020, although recall decreased by 0.03 percentage points, accuracy, F1-scores, and precision have improved to varying degrees, of which accuracy has increased by 1.88 percentage points and 0.1 percentage points, respectively.

4.2.5. Comparison of GA-StackingMD and Other Classifiers

The proposed GA-StackingMD is compared with other algorithms which are widely used in state-of-art classification methods, including XGB [35] (Extreme Gradient Boosting), NB [36] (Naive Bayes), CART [37] (Classification And Regression Tree), MLP [38] (Multi-layer Perceptron), and ERT [39] (Extremely Randomized Trees). The results of the two data sets are shown in Figure 7.
Figure 7. (a) CIC-AndMal2017: Comparison of GA-StackingMD and other classifiers; (b) CICMalDroid2020: Comparison of GA-StackingMD and other classifiers.
As shown in Figure 7, GA-StackingMD performs best on both data sets. In addition, XGB achieves a similar detection result.
We analyze the presented method from five aspects, including selections of key feature subsets and best combinations of base classifiers, comparison of initial Stacking with single algorithms and the proposed algorithms with other widely used algorithms, and optimization of the hyperparameters by GA. The experiment results show that the proposed GA-StackingMD could achieve better detection results.

4.3. Comparison with Literature

In order to compare with the existing literature, this paper summarizes some studies that have used the same data set and summarizes the methods they use, including the features used, classification methods, and experimental results.
As shown in Table 6, this paper summarizes the proposed model and relevant research and performance results in the literature. It is noteworthy that Reference [40] also uses the Stacking integration model. Three machine learning models, ET (ExtraTree), XGB, and RF are used for analysis and comparison to select the most effective integration model. The highest value was obtained with the ensemble model in the voting structure, and the accuracy was to be 90.4%. The main difference between this study and this paper is reflected in two aspects. On the one hand, the selection of the base classifier and meta-classifier is relatively random, and it does not propose how to choose a better model combination method and the setting of hyperparameters. On the other hand, the number of data set samples is relatively small.
Table 6. GA-StackingMD comparison with the relevant studies in the literature.
In the study [41], firstly, the extracted features are processed by CNN, and then an adaptive network-based fuzzy inference system (ANFIS) model was used to classify the features obtained. With the proposed model, 94.67% accuracy was achieved on the CICMalDroid2020 data set. Reference [42] proposed a dynamic method for Android malware classification based on network traffic, F2DC. The byte sequence of application data transmitted over the TCP/IP network is called raw payload. This method characterizes an Android malware from its raw payload and uses CNN to learn the potential representation of the raw payload for effective classification. Reference [43] extracted system calls as the characteristics of Android malware detection through dynamic analysis, analyzed and compared five different machine learning algorithms, and finally, discovered that KNN performed best, reaching 85% accuracy. Reference [44] first analyzed the text features and visual features of the samples and then used the CNN network to mine the deep features. After such a complex multi-stage feature engineering, the balanced features were input into the Voting-Based Extensible Learning model, and the accuracy rate was 97.76% on the data set CICAlDroid 2020. The presented method was carefully compared with the other five machine learning methods. Ksibi et al. [45] explored the way to detect Android malware based on images other than features. First, convert the sample file into color images and develop deep CNN using produced images that extract higher-level semantics associated with malware. Compare and analyze the customized CNN and a deep convolution neural network model VGG-16 to detect Android malware, with an accuracy of 97.81%. Compared with relevant models, the GA-StackingMD, which was proposed as a GA-based ensemble approach in our work, demonstrated more competitive performance.

5. Conclusions

The open architecture and wide usage of Android make it vulnerable to malware attacks. In recent years, with the improvement of code confusion, shelling, and other technologies, malware is easier to produce and the variants are increasing. The research on Android malware detection should be in consideration of practical applications. How to reduce the calculation consumption and improve the detection accuracy is the problem that needs to be considered.
In order to reduce the high dimension caused by multiple features fusion, a two-step feature selection method based on InfoGain and Chi-square Test is proposed, which reduces the original feature dimensions of the two data sets to 13% and 10%. A Stacking model with five base classifiers is implemented which significantly improves the detection accuracy compared with single classifiers. Furthermore, GA is used to optimize the hyperparameters of the Stacking, and finally achieves the accuracies of 98.43% and 98.66% on the two data sets.
In future work, we will further optimize the selection of base classifiers, so that they can give full play to their respective advantages, and further improve the selection range of hyperparameters, in order to improve the effectiveness and scientific nature of the algorithm proposed in this paper and promote the application in the virtual environment. We are trying to implement the stacking algorithm on a distributed platform, thus significantly reducing the running consumption of the algorithm, which can play a better role in real-time applications.

Author Contributions

Conceptualization, N.X. and Z.Q.; methodology, N.X. and Z.Q.; software, X.D.; validation, N.X. and Z.Q.; formal analysis, N.X., Z.Q. and X.D.; investigation, Z.Q.; resources, N.X.; data curation, N.X. and Z.Q.; writing—original draft preparation, N.X. and Z.Q.; writing—review and editing, Z.Q.; visualization, X.D.; supervision, X.D.; project administration, N.X.; funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Science and Technology Research Project of the Education Department of Jilin Province (Grant No. JJKH20231539KJ) and the Opening Project of Guangdong Province Key Laboratory of Information Security Technology (Grant No. 2020B1212060078).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All data used in this paper can be obtained by contacting the authors of this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mobile Operating System Market Share Worldwide. Available online: https://gs.statcounter.com/os-market-share/mobile/worldwide (accessed on 30 October 2022).
  2. Mobile Malware Evolution. 2021. Available online: https://securelist.com/mobile-malware-evolution-2021/105876/ (accessed on 15 November 2022).
  3. Tsfaty, C.; Fire, M. Malicious Source Code Detection Using Transformer. arXiv 2022, arXiv:2209.07957. [Google Scholar]
  4. Gao, Y.; Lu, Z.; Luo, Y. Survey on malware anti-analysis. In Proceedings of the Fifth International Conference on Intelligent Control and Information Processing, Dalian, China, 18–20 August 2014; pp. 270–275. [Google Scholar]
  5. Singh, J.; Singh, J. A survey on machine learning-based malware detection in executable files. J. Syst. Archit. 2021, 112, 101861. [Google Scholar] [CrossRef]
  6. Qiang, W.; Yang, L.; Jin, H. Efficient and Robust Malware Detection Based on Control Flow Traces Using Deep Neural Networks. Comput. Secur. 2022, 122, 102871. [Google Scholar] [CrossRef]
  7. Lindorfer, M.; Neugschwandtner, M.; Platzer, C. Marvin: Efficient and comprehensive mobile app classification through static and dynamic analysis. In Proceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference, Taichung, Taiwan, 1–5 July 2015; pp. 422–433. [Google Scholar]
  8. Zhu, H.; Li, Y.; Li, R.; Li, J.; Song, H. SEDMDroid: An enhanced stacking ensemble of deep learning framework for Android malware detection. IEEE Trans. Netw. Sci. Eng. 2020, 8, 984–994. [Google Scholar] [CrossRef]
  9. Wang, X.; Zhang, L.; Zhao, K.; Ding, X.; Yu, M. MFDroid: A Stacking Ensemble Learning Framework for Android Malware Detection. Sensors 2022, 22, 2597. [Google Scholar] [CrossRef]
  10. Cen, L.; Gates, C.S.; Si, L.; Li, N. A probabilistic discriminative model for android malware detection with decompiled source code. IEEE Trans. Dependable Secur. Comput. 2014, 12, 400–412. [Google Scholar] [CrossRef]
  11. Saxe, J.; Berlin, K. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), Fajardo, PR, USA, 20–22 October 2015; pp. 11–20. [Google Scholar]
  12. Singh, D.; Karpa, S.; Chawla, I. “Emerging Trends in Computational Intelligence to Solve Real-World Problems” Android Malware Detection Using Machine Learning. In Proceedings of the International Conference on Innovative Computing and Communications: Proceedings of ICICC 2021, Delhi, India, 20–21 February 2021; pp. 329–341. [Google Scholar]
  13. Vashishtha, L.K.; Chatterjee, K.; Sahu, S.K.; Mohapatra, D.P. A Random Forest-Based Ensemble Technique for Malware Detection. In Proceedings of the Information Systems and Management Science: Conference Proceedings of 4th International Conference on Information Systems and Management Science (ISMS) 2021, Msida, Malta, 14–15 December 2021; pp. 454–463. [Google Scholar]
  14. Wang, Z.; Li, K.; Hu, Y.; Fukuda, A.; Kong, W. Multilevel permission extraction in android applications for malware detection. In Proceedings of the 2019 International Conference on Computer, Information and Telecommunication Systems (CITS), Beijing, China, 28–31 August 2019; pp. 1–5. [Google Scholar]
  15. Peiravian, N.; Zhu, X. Machine learning for android malware detection using permission and api calls. In Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; pp. 300–305. [Google Scholar]
  16. Han, W.; Xue, J.; Wang, Y.; Huang, L.; Kong, Z.; Mao, L. MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics. Comput. Secur. 2019, 83, 208–233. [Google Scholar] [CrossRef]
  17. de la Puerta, J.G.; Sanz, B. Using dalvik opcodes for malware detection on android. Log. J. IGPL 2017, 25, 938–948. [Google Scholar] [CrossRef]
  18. Zhang, J.; Qin, Z.; Zhang, K.; Yin, H.; Zou, J. Dalvik opcode graph based android malware variants detection using global topology features. IEEE Access 2018, 6, 51964–51974. [Google Scholar] [CrossRef]
  19. Sewak, M.; Sahay, S.K.; Rathore, H. Comparison of deep learning and the classical machine learning algorithm for the malware detection. In Proceedings of the 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Busan, Republic of Korea, 27–29 June 2018; pp. 293–296. [Google Scholar]
  20. Feizollah, A.; Anuar, N.B.; Salleh, R.; Suarez-Tangil, G.; Furnell, S. Androdialysis: Analysis of android intent effectiveness in malware detection. Comput. Secur. 2017, 65, 121–134. [Google Scholar] [CrossRef]
  21. Santos, I.; Devesa, J.; Brezo, F.; Nieves, J.; Bringas, P.G. Opem: A static-dynamic approach for machine-learning-based malware detection. In Proceedings of the International Joint Conference CISIS’12-ICEUTE’ 12-SOCO’ 12 Special Sessions, Ostrava, Czech Republic, 5–7 September 2012; pp. 271–280. [Google Scholar]
  22. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  23. Maulik, U.; Bandyopadhyay, S. Genetic algorithm-based clustering technique. Pattern Recognit. 2000, 33, 1455–1465. [Google Scholar] [CrossRef]
  24. Mala, C.; Sridevi, M. Multilevel threshold selection for image segmentation using soft computing techniques. Soft Comput. 2016, 20, 1793–1810. [Google Scholar] [CrossRef]
  25. Cpalka, K.; Łapa, K.; Przybył, A. A new approach to design of control systems using genetic programming. Inf. Technol. Control 2015, 44, 433–442. [Google Scholar] [CrossRef]
  26. Qiang, X.J. Computer application under the management of network information security technology using genetic algorithm. Soft Comput. 2022, 26, 7871–7876. [Google Scholar] [CrossRef]
  27. Changxing, Q.; Yiming, B.; Yong, L. Improved BP neural network algorithm model based on chaos genetic algorithm. In Proceedings of the 2017 3rd IEEE International Conference on Control Science and Systems Engineering (ICCSSE), Beijing, China, 17–19 August 2017; pp. 679–682. [Google Scholar]
  28. Elhefnawy, R.; Abounaser, H.; Badr, A. A hybrid nested genetic-fuzzy algorithm framework for intrusion detection and attacks. IEEE Access 2020, 8, 98218–98233. [Google Scholar] [CrossRef]
  29. Yildiz, O.; Doğru, I.A. Permission-based android malware detection system using feature selection with genetic algorithm. Int. J. Softw. Eng. Knowl. Eng. 2019, 29, 245–262. [Google Scholar] [CrossRef]
  30. Sesmero, M.P.; Ledezma, A.I.; Sanchis, A. Generating ensembles of heterogeneous classifiers using stacked generalization. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2015, 5, 21–34. [Google Scholar] [CrossRef]
  31. Zheng, R.; Wang, Q.; Lin, Z.; Jiang, Z.; Fu, J.; Peng, G. Cryptocurrency malware detection in real-world environment: Based on multi-results stacking learning. Appl. Soft Comput. 2022, 124, 109044. [Google Scholar] [CrossRef]
  32. Jiang, W.; Chen, Z.; Xiang, Y.; Shao, D.; Ma, L.; Zhang, J. SSEM: A novel self-adaptive stacking ensemble model for classification. IEEE Access 2019, 7, 120337–120349. [Google Scholar] [CrossRef]
  33. Lashkari, A.H.; Kadir, A.; Taheri, L.; Ghorbani, A.A. Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification. In Proceedings of the 2018 International Carnahan Conference on Security Technology (ICCST), Montreal, QC, Canada, 22–25 October 2018; pp. 1–7. [Google Scholar]
  34. Mahdavifar, S.; Kadir, A.F.A.; Fatemi, R.; Alhadidi, D.; Ghorbani, A.A. Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning. In Proceedings of the 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 515–522. [Google Scholar]
  35. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  36. Webb, G.I.; Keogh, E.; Miikkulainen, R. Naïve Bayes. Encycl. Mach. Learn. 2010, 15, 713–714. [Google Scholar]
  37. Rutkowski, L.; Jaworski, M.; Pietruczuk, L.; Duda, P. The CART decision tree for mining data streams. Inf. Sci. 2014, 266, 1–15. [Google Scholar] [CrossRef]
  38. Taud, H.; Mas, J. Multilayer perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
  39. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
  40. Arslan, R.S. Identify Type of Android Malware with Machine Learning Based Ensemble Model. In Proceedings of the 2021 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 21–23 October 2021; pp. 628–632. [Google Scholar]
  41. Atacak, İ.; Kılıç, K.; Doğru, İ.A. Android malware detection using hybrid ANFIS architecture with low computational cost convolutional layers. PeerJ Comput. Sci. 2022, 8, e1092. [Google Scholar] [CrossRef]
  42. Lu, T.; Wang, J. F2DC: Android malware classification based on raw traffic and neural networks. Comput. Netw. 2022, 217, 109320. [Google Scholar]
  43. Shakya, S.; Dave, M. Analysis, Detection, and Classification of Android Malware using System Calls. arXiv 2022, arXiv:2208.06130. [Google Scholar]
  44. Ullah, F.; Alsirhani, A.; Alshahrani, M.M.; Alomari, A.; Naeem, H.; Shah, S.A. Explainable malware detection system using transformers-based transfer learning and multi-model visual representation. Sensors 2022, 22, 6766. [Google Scholar] [CrossRef]
  45. Ksibi, A.; Zakariah, M.; Almuqren, L.A.; Alluhaidan, A.S. Deep Convolution Neural Networks and Image Processing for Malware Detection. Preprint (Version 1). 27 January 2023. Available online: https://www.researchsquare.com/article/rs-2508967/v1 (accessed on 4 February 2023).
  46. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.