Tree-Based Classifier Ensembles for PE Malware Analysis: A Performance Revisit
Abstract
:1. Introduction
- (a)
- (b)
- The performance differences between classifier ensembles over the most recent datasets, i.e., BODMAS [27], Kaggle, and CIC-MalMem-2022 [28] are benchmarked using statistical significance tests. This study is among the first to utilize the most recent malware BODMAS and CIC-MalMem-2022 datasets. On the BODMAS and CIC-Malmem-2022 datasets, our proposed approaches outperform other baselines with a 99.96% and 100% accuracy rate, respectively.
- (c)
- An in-depth exploratory analysis of each malware dataset is presented to better understand the characteristics of each malware dataset. The analysis includes a feature correlation analysis and t-SNE visualization of pairs of samples’ similarities.
2. Related Work
3. Materials and Methods
3.1. Datasets
- (a)
- BODMAS [27]The dataset contains 57,293 malicious and 77,142 benign samples (134,435 in total). The malware samples were arbitrarily picked each month from the internal malware database of a security company. The data were collected between 29 August 2019 and 30 September 2020. The benign samples were gathered between 1 January 2007 and 30 September 2020. In order to reflect benign PE binary distribution in real-world traffic, the database of the security company is also processed for benign samples. In addition, SHA-256 hash, the actual PE binary, and a pre-extracted feature vector were given for each malicious sample, whereas only SHA-256 hash and the pre-extracted feature vector were provided for each benign sample. BODMAS is comprised of 2381 input feature vectors and 1 class label feature, of which 0 is labeled as benign and 1 is labeled as malicious.
- (b)
- Kaggle (https://tinyurl.com/22z7u898, access on 25 August 2022)The dataset was developed using a Python library called (https://tinyurl.com/w75zewvr, accessed on 25 August 2022), which is a multi-platform module used to parse and work with PE files. Kaggle dataset contains 14,599 malicious and 5012 benign samples (19,611 in total). The dataset is comprised of 78 input features, denoting PE header files and one class label attribute.
- (c)
- CIC-MalMem-2022 [28]Unlike the two above-mentioned datasets, CIC-MalMem-2022 is an obfuscated malware dataset that is intended to evaluate memory-based obfuscated malware detection algorithms. The dataset was designed to mimic a realistic scenario as accurately as possible using reowned malware. Obfuscated malware comprises malicious software that conceals itself to escape detection and eradication. The dataset consists of an equal ratio of malicious and benign memory dumps (58,596 samples in total). In addition, CIC-MalMem-2022 is made up of 56 features that serve as inputs for machine learning algorithms.
3.2. Tree-Based Ensemble Learning
- (a)
- Random forest [19]As its name implies, a random forest is a tree-based ensemble in which each tree is dependent on a set of random variables. The original formulation of random forest algorithm provided by Breiman [19] is as follows. A random forest employs trees as its base learners. For training data , where represents the p predictors and represents the response, and a specific manifestation of , the fitted tree is given as . More precisely, the steps involved in the random forest algorithm are described in Algorithm 1.We use a fast random forest implementation called [46], available in R, which is suitable for high-dimensional data such as ours. The list of random forest’s hyperparameters for each malware dataset is provided in Table 2. We set the search space for each hyperparameter tuning is as follows. Number of trees = {50, 100, 250, 500, 750, 1000}, split rule = {‘gini’,‘extratrees’}, minimum node size = {1, 2, …, 10}, = number of features × {0.05, 0.15, 0.25, 0.333, 0.4}, sample fraction = {0.5, 0.63, 0.8}, and = {TRUE, FALSE}.
| Algorithm 1: A common procedure of random forest algorithm for classification task. | 
| Training: | 
| Require: Original training set , with | 
| 1. for to J | 
| 2. Perform a bootstrap sample of size from . | 
| 3. Using binary recursive partitioning, fit a tree on . | 
| 4. end for | 
| Testing: | 
| Require: An instance to be classified . | 
| 1. | 
| where denotes the response variable at using the j-th tree. | 
- (b)
- Gradient Boosting Decision TreesIn this paper, we also considered various tree-based boosting ensemble approaches for malware detection, such as XGBoost [23], CatBoost [24], GBM [25], and LightGBM [26]. As a rule, GBDT ensembles are a linear additive model, where a tree-based classifier (e.g., CART) was utilized as their base model. Let denote the malware dataset comprising features and samples. Considering a collection of j trees, the prediction output for an input is obtained by calculating the predictions from each tree , as shown in the following formula.where represents the output of the i-th regression tree of the j-tree ensemble. GBDTs minimize a regularized objective function in order to create the ()-th tree, as follows.where represents loss function and is a regularization function to control over-fitting. The loss function measures the difference between the prediction and the target . On the other hand, the regularization function is defined as , where T and w indicate the number of leaves and leaf weights in the tree, respectively.- (i)
- XGBoost [23]XGBoost is a scalable end-to-end tree-boosting strategy that generates a large number of sequentially trained trees. Each succeeding tree corrects the errors made by the preceding one, resulting in an efficient classification model. Through sparsity-aware metrics and multi-threading approaches, XGBoost not only addresses the algorithm’s overfitting problem, but also boosts the speed of most real-world computational tasks. This study utilizes two different XGBoost implementations, such as native implementation in R [47] and [48]. We set the search space of native XGBoost’s hyperparameters are as follows. Maximum depth = {2, 3, …, 24}, eta = {0, 0.1, 0.2, …, 1.0}, subsample = {0.5, 0.6, 0.7, 0.8}, and column sample by tree = {0.5, 0.6, 0.7, 0.8, 0.9}. Moreover, we set the search space of XGBoost’s hyperparameters implemented in are as follows. Maximum depth = {1, 3, 5, …, 29}, sample rate = {0.2, 0.3, …, 1}, column sample rate = {0.2, 0.21, 0.22, …, 1}, column sample rate per tree = {0.2, 0.21, 0.22, …,1}, and minimum rows = {0, 1, …, log number of rows-1}. The final learning parameters for both XGBoost implementations are presented in Table 3.
- (ii)
- CatBoost [24]CatBoost is built with symmetric decision trees. It is acknowledged as a classification algorithm that is capable of producing an excellent performance and ten times the prediction speed of methods that do not employ symmetric decision trees. CatBoost, unlike other GBDT algorithms, is able to accommodate gradient bias and prediction shift to increase the accuracy of predictions and generalization ability of large datasets. In addition, CatBoost is comprised of two essential algorithms: ordered boosting, which estimates leaf values during tree structure selection to avoid overfitting, and a unique technique for handling categorical data throughout the training process. An implementation of CatBoost in R is employed in this paper, whereas the search space of each hyperparameter is considered as follows. Depth = {1, 2, …, 10}, learning rate = {0.03, 0.001, 0.01, 0.1, 0.2, 0.3}, l2 leaf regularization = {3, 1, 5, 10, 100}, border count = {32, 5, 10, 20, 50, 100, 200}, and boosting type = {“Ordered”, “Plain”}. The final learning parameters of CatBoost for each malware dataset are given in Table 4.
- (iii)
- Gradient boosting machine [25]GBM is the first implementation of GBDT to utilize a forward learning technique. Trees are generated in a sequential manner, with future trees being dependent on the results of the preceding trees. Formally, GBM is achieved by iteratively constructing a collection of functions , given a loss function . We can optimize our estimates of by discovering another function , such that reduces the estimated value of the loss function. In this study, we adopt GBM implementation in H2O, whereas the hyperparameters’ search space is specified as follows. Maximum depth = {1, 3, 5, …, 29}, sample rate = {0.2, 0.3, …, 1}, column sample rate per tree = {0.2, 0.21, 0.22, …, 1}, column sample rate change per level = {0.9, 0.91, …, 1.1}, number of bins = 2, and minimum rows = {0, 1, …, log number of rows − 1}. Table 5 shows a list of all the final GBM hyperparameters that were used on each malware dataset.
- (iv)
- LightGBM [26]LightGBM is an inexpensive gradient boosting tree implementations that employs histogram and leaf-wise techniques to increase both processing power and prediction precision. The histogram method is used to combine features that are incompatible with each another. Before generating a n-width histogram, the core idea is to discretize continuous features into n integers. Based on the discretized values of the histogram, the training data are scanned to locate the decision tree. The histogram method considerably reduces the runtime complexity. In addition, in LightGBM, the leaf with the greatest splitting gain was found and then divided using a leaf-by-leaf strategy. Leaf-wise optimization may result in overfitting and a deeper decision tree. To ensure great efficiency and prevent overfitting, LightGBM includes a maximum depth constraint to leaf-wise. In this study, we employed a LightGBM implementation in R with the following hyperparameter search space; Maximum bin = {100, 255}, maximum depth = {1, 2, …, 15}, number of leaves = 2, minimum data in leaf = {100, 200, …, 1000}, learning rate = {0.01, 0.3, 0.01}, lambda l1 = {0, 10, 20, …, 100}, lambda l2 = {0, 10, 20, …, 100}, feature fraction = {0.5, 0.9}, bagging fraction = {0.5, 0.9}, path smooth = {, }, and minimum gain to split = {0, 1, 2, …, 15}. Table 6 contains the list of all final LightGBM hyperparameters used for each malware dataset.
 
4. Result and Discussion
4.1. Exploratory Analysis
4.2. Comparison Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
List of Acronyms
| AUC | Area Under ROC Curve. | 
| CART | Classification and Regression Tree. | 
| CNN | Convolutional Neural Network. | 
| CPS | Cyber-Physical Systems. | 
| CV | Cross Validation. | 
| DBN | Deep Belief Network. | 
| DR | Detection Rate. | 
| DT | Decision Tree. | 
| FPR | False Positive Rate. | 
| GBDT | Gradient Boosting Decision Tree. | 
| GBM | Gradient Boosting Machine. | 
| IoT | Internet of Things. | 
| MCC | Matthews Correlation Coefficient. | 
| NB | Naive Bayes. | 
| PE | Portable Executable. | 
| RF | Random Forest. | 
| SAEs | Stacked AutoEncoders. | 
| SVM | Support Vector Machine. | 
| t-SNE | t-Stochastic Neighbor Embedding. | 
| TPR | True Positive Rate. | 
References
- Kleidermacher, D.; Kleidermacher, M. Embedded Systems Security: Practical Methods for Safe and Secure Software and Systems Development; Elsevier: Amsterdam, The Netherlands, 2012. [Google Scholar]
- Xhafa, F. Autonomous and Connected Heavy Vehicle Technology; Academic Press: Cambridge, MA, USA, 2022. [Google Scholar]
- Smith, D.J.; Simpson, K.G. Safety Critical Systems Handbook; Elsevier: Amsterdam, The Netherlands, 2010. [Google Scholar]
- Damodaran, A.; Troia, F.D.; Visaggio, C.A.; Austin, T.H.; Stamp, M. A comparison of static, dynamic, and hybrid analysis for malware detection. J. Comput. Virol. Hacking Tech. 2017, 13, 1–12. [Google Scholar] [CrossRef]
- Mahdavifar, S.; Ghorbani, A.A. Application of deep learning to cybersecurity: A survey. Neurocomputing 2019, 347, 149–176. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, H.; He, H.; Liu, P. DAMBA: Detecting android malware by ORGB analysis. IEEE Trans. Reliab. 2020, 69, 55–69. [Google Scholar] [CrossRef]
- Ucci, D.; Aniello, L.; Baldoni, R. Survey of machine learning techniques for malware analysis. Comput. Secur. 2019, 81, 123–147. [Google Scholar] [CrossRef]
- Milosevic, J.; Malek, M.; Ferrante, A. Time, accuracy and power consumption tradeoff in mobile malware detection systems. Comput. Secur. 2019, 82, 314–328. [Google Scholar] [CrossRef]
- Zhang, H.; Xiao, X.; Mercaldo, F.; Ni, S.; Martinelli, F.; Sangaiah, A.K. Classification of ransomware families with machine learning based onN-gram of opcodes. Future Gener. Comput. Syst. 2019, 90, 211–221. [Google Scholar] [CrossRef]
- Maniriho, P.; Mahmood, A.N.; Chowdhury, M.J.M. A study on malicious software behaviour analysis and detection techniques: Taxonomy, current trends and challenges. Future Gener. Comput. Syst. 2022, 130, 1–18. [Google Scholar] [CrossRef]
- Montes, F.; Bermejo, J.; Sanchez, L.; Bermejo, J.; Sicilia, J.A. Detecting Malware in Cyberphysical Systems Using Machine Learning: A Survey. KSII Trans. Internet Inf. Syst. (TIIS) 2021, 15, 1119–1139. [Google Scholar]
- Singh, J.; Singh, J. Detection of malicious software by analyzing the behavioral artifacts using machine learning algorithms. Inf. Softw. Technol. 2020, 121, 106273. [Google Scholar] [CrossRef]
- Amer, E.; Zelinka, I. An ensemble-based malware detection model using minimum feature set. Mendel 2019, 25, 1–10. [Google Scholar]
- Atluri, V. Malware Classification of Portable Executables using Tree-Based Ensemble Machine Learning. In Proceedings of the 2019 SoutheastCon, Huntsville, AL, USA, 11–14 April 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Azeez, N.A.; Odufuwa, O.E.; Misra, S.; Oluranti, J.; Damaševičius, R. Windows PE Malware Detection Using Ensemble Learning. Informatics 2021, 8, 10. [Google Scholar] [CrossRef]
- Damaševičius, R.; Venčkauskas, A.; Toldinas, J.; Grigaliūnas, Š. Ensemble-Based Classification Using Neural Networks and Machine Learning Models for Windows PE Malware Detection. Electronics 2021, 10, 485. [Google Scholar] [CrossRef]
- Mills, A.; Spyridopoulos, T.; Legg, P. Efficient and Interpretable Real-Time Malware Detection Using Random-Forest. In Proceedings of the International Conference on Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA), Oxford, UK, 3–4 June 2019; pp. 1–8. [Google Scholar] [CrossRef]
- Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, C.; Zhang, Q.; Ma, Q.; Yu, B. LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion. Chemom. Intell. Lab. Syst. 2019, 191, 54–64. [Google Scholar] [CrossRef]
- Cai, J.; Li, X.; Tan, Z.; Peng, S. An assembly-level neutronic calculation method based on LightGBM algorithm. Ann. Nucl. Energy 2021, 150, 107871. [Google Scholar] [CrossRef]
- Csizmadia, G.; Liszkai-Peres, K.; Ferdinandy, B.; Miklósi, Á.; Konok, V. Human activity recognition of children with wearable devices using LightGBM machine learning. Sci. Rep. 2022, 12, 5472. [Google Scholar] [CrossRef]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. Xgboost: Extreme gradient boosting. In R Package Version 0.4–2; R Foundation for Statistical Computing: Vienna, Austria, 2015; Volume 1, pp. 1–4. [Google Scholar]
- Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 1–9. [Google Scholar]
- Yang, L.; Ciptadi, A.; Laziuk, I.; Ahmadzadeh, A.; Wang, G. BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. In Proceedings of the IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 27 May 2021; pp. 78–84. [Google Scholar] [CrossRef]
- Carrier, T.; Victor, P.; Tekeoglu, A.; Lashkari, A.H. Detecting Obfuscated Malware using Memory Feature Engineering. In Proceedings of the ICISSP, Online, 9–11 February 2022; pp. 177–188. [Google Scholar]
- Singh, J.; Singh, J. A survey on machine learning-based malware detection in executable files. J. Syst. Archit. 2021, 112, 101861. [Google Scholar] [CrossRef]
- Albishry, N.; AlGhamdi, R.; Almalawi, A.; Khan, A.I.; Kshirsagar, P.R. An Attribute Extraction for Automated Malware Attack Classification and Detection Using Soft Computing Techniques. Comput. Intell. Neurosci. 2022, 2022, 5061059. [Google Scholar] [CrossRef]
- Vadrevu, P.; Rahbarinia, B.; Perdisci, R.; Li, K.; Antonakakis, M. Measuring and detecting malware downloads in live network traffic. In Proceedings of the European Symposium on Research in Computer Security, Egham, UK, 9–13 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 556–573. [Google Scholar]
- Uppal, D.; Sinha, R.; Mehra, V.; Jain, V. Malware detection and classification based on extraction of API sequences. In Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Delhi, India, 34–27 September 2014; pp. 2337–2342. [Google Scholar]
- Kwon, B.J.; Mondal, J.; Jang, J.; Bilge, L.; Dumitraş, T. The dropper effect: Insights into malware distribution with downloader graph analytics. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, 12–16 October 2015; pp. 1118–1129. [Google Scholar]
- Mao, W.; Cai, Z.; Towsley, D.; Guan, X. Probabilistic inference on integrity for access behavior based malware detection. In Proceedings of the International Symposium on Recent Advances in Intrusion Detection, Tokyo, Japan, 2–4 November 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 155–176. [Google Scholar]
- Wüchner, T.; Ochoa, M.; Pretschner, A. Robust and effective malware detection through quantitative data flow graph metrics. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Milan, Italy, 9–10 July 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 98–118. [Google Scholar]
- Ahmadi, M.; Ulyanov, D.; Semenov, S.; Trofimov, M.; Giacinto, G. Novel feature extraction, selection and fusion for effective malware family classification. In Proceedings of the sixth ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 1–9 March 2016; pp. 183–194. [Google Scholar]
- Dener, M.; Ok, G.; Orman, A. Malware Detection Using Memory Analysis Data in Big Data Environment. Appl. Sci. 2022, 12, 8604. [Google Scholar] [CrossRef]
- Azmee, A.; Choudhury, P.P.; Alam, M.A.; Dutta, O.; Hossai, M.I. Performance Analysis of Machine Learning Classifiers for Detecting PE Malware. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 510–517. [Google Scholar] [CrossRef]
- Liu, X.; Lin, Y.; Li, H.; Zhang, J. A novel method for malware detection on ML-based visualization technique. Comput. Secur. 2020, 89, 101682. [Google Scholar] [CrossRef]
- Asam, M.; Hussain, S.J.; Mohatram, M.; Khan, S.H.; Jamal, T.; Zafar, A.; Khan, A.; Ali, M.U.; Zahoora, U. Detection of Exceptional Malware Variants Using Deep Boosted Feature Spaces and Machine Learning. Appl. Sci. 2021, 11, 10464. [Google Scholar] [CrossRef]
- Hao, J.; Luo, S.; Pan, L. EII-MBS: Malware Family Classification via Enhanced Instruction-level Behavior Semantic Learning. Comput. Secur. 2022, 112, 102905. [Google Scholar]
- Hou, S.; Saas, A.; Ye, Y.; Chen, L. Droiddelver: An android malware detection system using deep belief network based on api call blocks. In Proceedings of the International Conference on Web-Age Information Management, Nanchang, China, 3–5 June 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 54–66. [Google Scholar]
- Hou, S.; Saas, A.; Chen, L.; Ye, Y.; Bourlai, T. Deep neural networks for automatic android malware detection. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Sydney, Australia, 31 July–3 August 2017; pp. 803–810. [Google Scholar]
- Lu, Q.; Zhang, H.; Kinawi, H.; Niu, D. Self-Attentive Models for Real-Time Malware Classification. IEEE Access 2022, 10, 95970–95985. [Google Scholar] [CrossRef]
- Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
- Wright, M.N.; Ziegler, A. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv 2015, arXiv:1508.04409. [Google Scholar] [CrossRef]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V. Package ‘xgboost’. R Version 2019, 90, 1–66. [Google Scholar]
- Cook, D. Practical Machine Learning with H2O: Powerful, Scalable Techniques for Deep Learning and AI; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2016. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Chicco, D.; Tötsch, N.; Jurman, G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioBata Min. 2021, 14, 13. [Google Scholar] [CrossRef] [PubMed]
- Conover, W.J. Practical Nonparametric Statistics; John Wiley & Sons: Hoboken, NJ, USA, 1999; Volume 350. [Google Scholar]
- Arik, S.Ö.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virsual Conference, 2–9 February 2021; Volume 35, pp. 6679–6687. [Google Scholar]




| Study | Algorithm(s) | Data Set | Validation Technique | Best Result | 
|---|---|---|---|---|
| Mills et al. [17] | RF | Private | 7-CV | - | 
| Vadrevu et al. [31] | RF | Private | CV and Holdout | TPR: 90%, FPR: 0.1% | 
| Uppal et al. [32] | NB, DT, RF, and SVM | Private | 10-CV | Accuracy: 98.5% | 
| Kwon et al. [33] | RF | Private | 10-CV | TPR: 98.0%, FPR: 2.00%, F1: 98.0%, AUC: 99.8% | 
| Mao et al. [34] | RF | Private | Repeated hold-out | TPR: 99.88%, FPR: 0.1% | 
| Wüchner et al. [35] | RF | Malicia | 10-CV | DR: 98.01%, FPR: 0.48% | 
| Ahmadi et al. [36] | XGBoost | Kaggle | 5-CV | Accuracy: 98.62% | 
| Amer and Zelinka [13] | RF and extra trees | Kaggle | Hold-out | Accuracy: 99.8%, FPR: 0.2% | 
| Liu et al. [39] | CNN and autoencoder | MS BIG and Ember | 10-CV | Accuracy: 96.25% | 
| Asam et al. [40] | CNN and SVM | MalImg | Hold-out | Accuracy: 98.61%, precision: 96.27%, recall: 96.30%, F1: 96.32% | 
| Azeez et al. [15] | 1D CNN and Extra trees | Kaggle | 10-CV | Accuracy: 100%, precision: 100%, recall: 100%, F1: 100% | 
| Damaševičius et al. [16] | Stacked CNN | ClaMP | 10-CV | Accuracy: 99.9%, precision: 99.9%, recall: 99.8%, F1: 99.9% | 
| Hou et al. [42] | DBN | Comodo cloud | 10-CV | Accuracy; 96.66% | 
| Hou et al. [43] | DBN and SAEs | Comodo cloud | 10-CV | Accuracy: 96.66% | 
| Azmee et al. [38] | XGBoost | Kaggle | 10-CV | Accuracy: 98.6%, AUC: 0.99, TPR: 99.0%, FPR: 3.7% | 
| Jingwei et al. [41] | CNN | MS BIG and BODMAS | 10-CV | (MS BIG) accuracy: 99.40%, (BODMAS) accuracy: 99.26% | 
| Lu et al. [44] | Transformer | BODMAS and MS BIG | Hold-out | (MS BIG) accuracy: 98.17%, F1: 98.14%, (BODMAS) accuracy:96.96%, F1: 96.96% | 
| Dener et al. [37] | Logistic regression | CIC-MalMem-2022 | Repeated hold-out | Accuracy: 99.97% | 
| Hyperparameter | BODMAS | Kaggle | CIC-MalMem-2022 | 
|---|---|---|---|
| Number of trees | 100 | 1000 | 500 | 
| Split rule | ‘gini’ | ‘gini’ | ‘extratrees’ | 
| Minimum node size | 4 | 8 | 6 | 
| 119 | 30 | 18 | |
| Sample fraction | 0.63 | 0.80 | 0.63 | 
| FALSE | FALSE | TRUE | 
| Hyperparameter | BODMAS | Kaggle | CIC-MalMem-2022 | 
|---|---|---|---|
| Native | |||
| Maximum depth | 11 | 19 | 19 | 
| eta | 0.2 | 0.3 | 0.1 | 
| Subsample | 0.6 | 0.8 | 0.6 | 
| Column sample by tree | 0.7 | 0.5 | 0.6 | 
| H2O | |||
| Maximum depth | 24 | 23 | 26 | 
| Sample rate | 0.52 | 0.94 | 0.99 | 
| Column sample rate | 0.42 | 0.62 | 0.6 | 
| Column sample rate per tree | 0.6 | 0.25 | 0.5 | 
| Minimum rows | 2 | 2 | 2 | 
| Hyperparameter | BODMAS | Kaggle | CIC-MalMem-2022 | 
|---|---|---|---|
| Depth | 10 | 4 | 2 | 
| Learning rate | 0.2 | 0.2 | 0.2 | 
| L2 leaf regularization | 5 | 3 | 5 | 
| Border count | 100 | 100 | 50 | 
| Boosting type | “Plain” | “Plain” | “Ordered” | 
| Hyperparameter | BODMAS | Kaggle | CIC-MalMem-2022 | 
|---|---|---|---|
| Maximum depth | 24 | 25 | 27 | 
| Sample rate | 0.52 | 0.44 | 0.72 | 
| Column sample rate per tree | 0.42 | 0.64 | 0.61 | 
| Column sample rate change per level | 1.02 | 1.04 | 0.92 | 
| Number of bins | 64 | 512 | 1024 | 
| Minimum rows | 2 | 2 | 8 | 
| Hyperparameter | BODMAS | Kaggle | CIC-MalMem-2022 | 
|---|---|---|---|
| Maximum bin | 100 | 100 | 255 | 
| Maximum depth | 10 | 9 | 3 | 
| Number of leaves | 8192 | 8 | 512 | 
| Minimum data in leaf | 1000 | 800 | 700 | 
| Learning rate | 0.29 | 0.27 | 0.07 | 
| Lambda l1 | 40 | 0 | 0 | 
| Lambda l2 | 90 | 20 | 90 | 
| Feature fraction | 0.5 | 0.9 | 0.9 | 
| Bagging fraction | 0.5 | 0.5 | 0.5 | 
| Path smooth | 0.001 | 0.001 | |
| Minimum gain to split | 2 | 15 | 11 | 
| Ensemble Algorithms | Accuracy | MCC | Precision | Recall | AUC | F1 | 
|---|---|---|---|---|---|---|
| CatBoost | 0.9940 | 0.9851 | 0.9943 | 0.9856 | 0.9988 | 0.9899 | 
| XGBoost (native) | 0.9968 | 0.9922 | 0.9975 | 0.9923 | 0.9994 | 0.9949 | 
| LightGBM | 0.9927 | 0.9823 | 0.9945 | 0.9828 | 0.9977 | 0.9885 | 
| Random forest | 0.9961 | 0.9906 | 0.9959 | 0.9921 | 0.9994 | 0.9940 | 
| GBM (H2O) | 0.9967 | 0.9920 | 0.9964 | 0.9978 | 0.9995 | 0.9971 | 
| XGBoost (H2O) | 0.9960 | 0.9902 | 0.9956 | 0.9977 | 0.9994 | 0.9967 | 
| CatBoost | XGBoost (Native) | LightGBM | Random Forest | GBM (H2O) | GBM (H2O) | 
|---|---|---|---|---|---|
| 0.250153 | - | 0.125201 | 0.7014781 | 0.6092802 | 0.8983268 | 
| Study | Accuracy (%) | Precision (%) | Recall (%) | F1 (%) | 
|---|---|---|---|---|
| Kaggle | ||||
| Hou et al. [42] | 93.68 | 93.96 | 93.36 | 93.68 | 
| Hou et al. [43] | 96.66 | 96.55 | 96.76 | 96.66 | 
| Azmee et al. [38] | 98.60 | 96.30 | 99.00 | - | 
| This study (GBM (H2O)) | 99.39 | 99.27 | 99.92 | 99.59 | 
| BODMAS | ||||
| Jingwei et al. [41] | 99.29 | 98.07 | 98.26 | 94.23 | 
| Lu et al. [44] | 96.96 | - | - | 96.96 | 
| This study (XGBoost (native)) | 99.96 | 99.65 | 99.81 | 99.73 | 
| CIC-MalMem-2022 | ||||
| Dener et al. [37] | 99.97 | 99.98 | 99.97 | 99.97 | 
| This study (Random forest) | 100.00 | 100.00 | 99.99 | 100.00 | 
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. | 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Louk, M.H.L.; Tama, B.A. Tree-Based Classifier Ensembles for PE Malware Analysis: A Performance Revisit. Algorithms 2022, 15, 332. https://doi.org/10.3390/a15090332
Louk MHL, Tama BA. Tree-Based Classifier Ensembles for PE Malware Analysis: A Performance Revisit. Algorithms. 2022; 15(9):332. https://doi.org/10.3390/a15090332
Chicago/Turabian StyleLouk, Maya Hilda Lestari, and Bayu Adhi Tama. 2022. "Tree-Based Classifier Ensembles for PE Malware Analysis: A Performance Revisit" Algorithms 15, no. 9: 332. https://doi.org/10.3390/a15090332
APA StyleLouk, M. H. L., & Tama, B. A. (2022). Tree-Based Classifier Ensembles for PE Malware Analysis: A Performance Revisit. Algorithms, 15(9), 332. https://doi.org/10.3390/a15090332
 
        




 
       