Static Malware Detection and Classification Using Machine Learning: A Random Forest Approach †
Abstract
1. Introduction
1.1. Background
1.2. Research Objectives
- Develop an automated malware dataset collection system from MalwareBazaar to build a rich and representative database for research [12].
- Perform static feature extraction from malware samples to identify patterns that can be used in the classification process. The extracted features include the PE header, entropy, API calls, and PE sections [8].
- Build a classification model based on the Random Forest Classifier that can effectively distinguish between malware and benign files [10].
- Evaluate the model’s performance using standard evaluation metrics in classification, such as accuracy, precision, recall, F1-score, and AUC-ROC curve, to ensure optimal detection effectiveness [11].
1.3. Research Contribution
2. Literature Review
2.1. Malware Detection Approaches
- Static analysis inspects files without execution, relying on attributes such as Portable Executable (PE) headers, imported functions, and entropy values [1,8]. This method is lightweight and safe, making it suitable for large-scale analysis. However, it can be bypassed through obfuscation or packing techniques [5].
- Dynamic analysis, in contrast, executes the program in a controlled environment to observe its runtime behavior, including API calls and system interactions [13]. While it provides higher accuracy against obfuscated samples, it requires significant computational resources and may be evaded by sophisticated malware [14].
2.2. Static Analysis in Malware Detection
2.3. Application of Machine Learning in Malware Detection
2.3.1. Malware Detection Process with Machine Learning
- Feature ExtractionFeatures obtained from static analysis, such as the PE header, list of called APIs, file size, entropy, and a set of strings found in the file, are converted into a numerical format to be used in machine learning models.
- Data preprocessingThe collected data often contains irrelevant or redundant values. Therefore, normalization, cleaning, and dimensionality reduction processes are carried out to improve the model’s efficiency.
- Feature SelectionNot all extracted features have a significant impact on malware classification. Therefore, feature selection is performed using techniques such as SHAP (SHapley Additive Explanations) and Permutation Importance to retain only the features that have the most significant impact on the classification results.
- Training Machine Learning ModelsThe machine learning model is trained using a processed dataset. The model used in this research is the Random Forest Classifier, which is known for its ability to handle large datasets with complex features.
- Model EvaluationThe trained model is tested using validation data to ensure its performance. Evaluations were conducted using metrics such as precision, recall, F1-score, and AUC-ROC to assess the extent to which the model could distinguish between malware and benign files.
2.3.2. Algoritma
- Can handle many features with high efficiency.
- Able to handle overfitting better than other models.
- Has better interpretability compared to deep learning.
2.4. Random Forest Algorithm
- Efficiency: Can process datasets with many features effectively.
- Reduced Overfitting: Less prone to overfitting compared to traditional decision trees.
- Interpretability: Offers insights into feature importance, unlike most deep learning models.
3. Research Methodology
3.1. Research Design
- Data Collection: Data is obtained from the MalwareBazaar API based on the specified malware category [17].
- Data Preprocessing: Data is cleaned, features are extracted, and encoding is performed if necessary.
- Model Training: The machine learning model is trained with the processed dataset.
- Model Evaluation: The model is evaluated using metrics such as accuracy, precision, recall, and F1-score.
- Result Analysis: The model’s results were compared with other methods and further analyzed.
3.2. Statistical Analysis
- Column “filename” (A)—Contains the name or unique representation of each analyzed file. The values in this column appear to be hash strings or file names that have been transformed into a scrambled form, often used in security systems or data management to identify files without revealing their original names.
- Column “size” (B)—Displays the size of each file in bytes. From the visible data, the file sizes vary, ranging from 96,811 bytes to 3,386,334 bytes. This indicates a significant difference in file sizes, which may be relevant in further analysis processes, such as storage management or anomaly detection in the system.
- Column “type” (C)—Identifies the file type based on the MIME (Multipurpose Internet Mail Extensions) format. All files in the table are classified as “application/zip,” which means the files are ZIP archives. This may indicate that the analyzed file is a result of compression or has been collected for storage and distribution purposes.
3.3. Visualisasi Data
3.4. Data Analysis Techniques
- Data Preprocessing: It is a crucial step in preparing the dataset before it is used in a machine learning model. This stage begins with cleaning the data by removing irrelevant information or missing values, thereby ensuring consistent data quality [22]. Next, categorical features need to be transformed through an encoding process, such as one-hot encoding or label encoding, so that they can be processed by machine learning algorithms. Additionally, data normalization is often applied to adjust the scale of numerical features, which can enhance model performance by speeding up the training process and avoiding bias due to scale differences. By performing proper preprocessing, the dataset becomes more ready and optimal for producing an accurate and reliable model [23].
- Model Selection: It is a critical stage in building a machine learning system. In this case, the Random Forest Classifier was chosen due to its superior ability to handle datasets with many features, as well as providing easily interpretable results thanks to its decision tree structure. Additionally, Random Forest is also known to be robust against overfitting, making it a solid choice. To ensure optimality, the performance of this model is compared with other methods such as Support Vector Machine (SVM) and Neural Network in order to evaluate its relative advantages in terms of accuracy, speed, and complexity. Thus, the selection of the model is not only based on performance but also on the balance between accuracy and ease of interpretation [6].
- Model Training and Evaluation: Conducted by dividing the dataset into two parts: training data and test data with an 80:20 ratio, ensuring the model can learn well while being tested independently. To assess performance, the model is evaluated using several key metrics, such as the confusion matrix, classification report, and accuracy score, which provide a comprehensive overview of precision, recall, and accuracy. In addition, the model’s performance is also compared with the baseline or other methods using the AUC-ROC curve and cross-validation, which helps measure the model’s resilience to overfitting and its ability to consistently classify data. With this approach, the evaluation becomes more comprehensive and in-depth, ensuring that the resulting model is not only accurate but also reliable [7].
4. Results and Discussion
4.1. Model Performance
4.2. Analysis and Interpretation of Results
- Model Performance in Malware Classification:The developed model shows quite good performance in classifying certain types of malware, especially in malware classes that have sufficient data representation in the dataset. This is evident from the high accuracy and evaluation metrics for those classes. However, there are several malware classes that have a higher prediction error rate. These errors are most likely caused by two main factors: (1) the lack of data for these classes, which makes it difficult for the model to learn patterns well, and (2) the presence of overlapping features between malware classes, making it difficult for the model to distinguish between classes with similar characteristics.
- The Influence of Data Quantity on Model Performance:Models tend to perform better in detecting malware classes with a larger amount of data. This is reasonable because machine learning models require sufficient data to effectively learn patterns. Classes with little data often produce less accurate predictions, which can affect the overall performance of the model.
- Potential Performance Improvement:The model’s performance can be enhanced with several strategies. First, the addition of more relevant and informative features can help the model better distinguish between different classes of malware. Second, if there is class imbalance, data balancing techniques such as oversampling, undersampling, or the use of methods like SMOTE (Synthetic Minority Over-sampling Technique) can be applied to balance class distribution and improve the model’s ability to predict the minority class.
- Comparison with Other Models:Compared to other models such as Support Vector Machine (SVM) or Neural Network, Random Forest shows more stable and consistent results.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tahir, R. A Study on Malware and Malware Detection Techniques. IJEME 2018, 8, 20–30. [Google Scholar] [CrossRef]
- Shatnawi, A.S.; Yassen, Q.; Yateem, A. An Android Malware Detection Approach Based on Static Feature Analysis Using Machine Learning Algorithms. Procedia Comput. Sci. 2022, 201, 653–658. [Google Scholar] [CrossRef]
- Almobaideen, W.; Abu Alghanam, O.; Abdullah, M.; Hussain, S.B.; Alam, U. Comprehensive Review on Machine Learning and Deep Learning Techniques for Malware Detection in Android and IoT Devices. Int. J. Inf. Secur. 2025, 24, 110. [Google Scholar] [CrossRef]
- Qomariah, N.; Alwi, E.I.; Asis, M.A. Analisis Malware Hummingbad dan Copycat pada Android Menggunakan Metode Hybrid. Cyber Secur. Forensik Digit. 2024, 6, 39–47. [Google Scholar] [CrossRef]
- Ferdous, J.; Islam, R.; Mahboubi, A.; Islam, M.Z. AI-Based Ransomware Detection: A Comprehensive Review. IEEE Access 2024, 12, 136666–136695. [Google Scholar] [CrossRef]
- Aslan, O.; Yilmaz, A.A. A New Malware Classification Framework Based on Deep Learning Algorithms. IEEE Access 2021, 9, 87936–87951. [Google Scholar] [CrossRef]
- Marais, B.; Quertier, T.; Morucci, S. AI-based Malware and Ransomware Detection Models. arXiv 2022, arXiv:2207.02108. [Google Scholar] [CrossRef]
- Wu, Y.; Chang, Y. Ransomware Detection on Linux Using Machine Learning with Random Forest Algorithm. TechRxiv 2024. [Google Scholar] [CrossRef]
- Dolesi, K.; Steinbach, E.; Velasquez, A.; Whitaker, L.; Baranov, M.; Atherton, L. A Machine Learning Approach to Ransomware Detection Using Opcode Features and K-Nearest Neighbors on Windows. TechRxiv 2024. [Google Scholar] [CrossRef]
- Argene, M.; Ravenscroft, C.; Kingswell, I. Ransomware Detection via Cosine Similarity-Based Machine Learning on Bytecode Representations. Authorea 2024. [Google Scholar] [CrossRef]
- Rafapa, J.; Konokix, A. Ransomware Detection Using Aggregated Random Forest Technique with Recent Variants. Authorea 2024. Available online: https://www.authorea.com/users/816233/articles/1216996 (accessed on 8 September 2025).
- Ispahany, J.; Islam, M.R.; Khan, M.A.; Islam, M.Z. A Sysmon Incremental Learning System for Ransomware Analysis and Detection. arXiv 2025, arXiv:2501.01089. [Google Scholar] [CrossRef]
- Alhogail, A.; Alharbi, R.A. Effective ML-Based Android Malware Detection and Categorization. Electronics 2025, 14, 1486. [Google Scholar] [CrossRef]
- Hadiprakoso, R.B.; Aditya, W.R.; Pramitha, F.N. Static Analysis of Android Malware Detection Using Supervised Machine Learning Algorithm. Cyber Secur. Forensik Digit. 2022, 5, 1–5. (In Indonesian) [Google Scholar] [CrossRef]
- Syeda, D.Z.; Asghar, M.N. Dynamic Malware Classification and API Categorisation of Windows Portable Executable Files Using Machine Learning. Appl. Sci. 2024, 14, 1015. [Google Scholar] [CrossRef]
- Hasan, R.; Biswas, B.; Samiun, M.; Saleh, M.A.; Prabha, M.; Akter, J.; Joya, F.H.; Abdullah, M. Enhancing Malware Detection with Feature Selection and Scaling Techniques Using Machine Learning Models. Sci. Rep. 2025, 15, 93447. [Google Scholar] [CrossRef]
- Akhtar, M.S.; Feng, T. Malware Analysis and Detection Using Machine Learning Algorithms. Symmetry 2022, 14, 2304. [Google Scholar] [CrossRef]
- Khalda, K.; Wibowo, D.K. Malware Behavior Analysis Using Static and Dynamic Analysis Approaches. J. Sains Nalar Apl. Teknol. Inf. 2025, 4, 1–8. [Google Scholar] [CrossRef]
- Yusirwan, S.; Prayudi, Y.; Riadi, I. Implementation of Malware Analysis using Static and Dynamic Analysis Method. Int. J. Comput. Appl. 2015, 117, 11–15. [Google Scholar] [CrossRef]
- Yuniati, T.; Tambunan, A.R.; Setyoko, Y.A. Implementation of Static Analysis and Background Process to Detect Malware in Android Applications with Mobile Security Framework. Ledger J. Inform. Inf. Technol. 2022, 1, 24–28. (In Indonesian) [Google Scholar] [CrossRef]
- Chowdhury, M.S. Comparison of Accuracy and Reliability of Random Forest, Support Vector Machine, Artificial Neural Network and Maximum Likelihood Method in Land Use/Cover Classification of Urban Setting. Environ. Chall. 2024, 14, 100800. [Google Scholar] [CrossRef]
- Bayazit, E.C.; Sahingoz, O.K.; Dogan, B. Deep Learning-Based Malware Detection for Android Systems: A Comparative Analysis. Tech. Vjesn. 2023, 30, 787–796. [Google Scholar] [CrossRef]
- Haque, M.A.; Ahmad, S.; Sonal, D.; Abdeljaber, H.A.M.; Mishra, B.K.; Eljialy, A.E.M.; Alanazi, S.; Nazeer, J. Achieving Organizational Effectiveness through Machine Learning-Based Approaches for Malware Analysis and Detection. Data Metadata 2023, 2, 139. [Google Scholar] [CrossRef]
- Gibert, D.; Mateu, C.; Planes, J. The Rise of Machine Learning for Detection and Classification of Malware: Research Developments, Trends and Challenges. J. Netw. Comput. Appl. 2020, 153, 102526. [Google Scholar] [CrossRef]
- Rathore, H.; Agarwal, S.; Sahay, S.K.; Sewak, M. Malware Detection Using Machine Learning and Deep Learning. In Lecture Notes in Computer Science, Proceedings of the 6th International Big Data Analytics Conference (BDA 2018), Warangal, India, 18–21 December, 2018; Springer: Cham, Switzerland, 2019; pp. 402–411. [Google Scholar] [CrossRef]
- Rele, M.; Samuel, J.; Patil, D.; Krishnan, U. Exploring Ransomware Detection Based on Artificial Intelligence and Machine Learning. Procedia Comput. Sci. 2025, 230, 548–556. [Google Scholar] [CrossRef]
- Chowdhury, R.R.; Idris, A.C.; Abas, P.E. Identifying SH-IoT Devices from Network Traffic Characteristics Using Random Forest Classifier. Wirel. Netw. 2024, 30, 405–419. [Google Scholar] [CrossRef]
- Kurniawan, F.; Stiawan, D.; Antoni, D.; Heriyanto, A.; Idris, M.Y.; Budiarto, R. Hybrid Machine Learning Model for Anticipating Cyber Crime Malware in Android: Work on Progress. In Proceedings of the 2024 11th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), Yogyakarta, Indonesia, 26–27 September 2024; pp. 499–505. [Google Scholar] [CrossRef]
- Ilham, K.F.; Ahmad, T.; Putra, M.A.R. Malware Analysis and Classification Using Grid Search Optimization. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Thakur, P.; Kansal, V.; Rishiwal, V. Hybrid Deep Learning Approach Based on LSTM and CNN for Malware Detection. Wirel. Pers. Commun. 2024, 136, 1879–1901. [Google Scholar] [CrossRef]
Classifier | Accuracy | F1-Score | AUC-ROC | TPR | FPR | Training Time (s) |
---|---|---|---|---|---|---|
Random Forest | 0.498049 | 0.502577 | 0.498086 | 0.507812 | 0.511688 | 0.236529 |
Decision Tree | 0.488947 | 0.610505 | 0.482745 | 0.802083 | 0.823377 | 0.036165 |
SVM | 0.531860 | 0.531250 | 0.478957 | 0.531250 | 0.467532 | 0.618762 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kamdan; Pratama, Y.; Munzi, R.S.; Mustafa, A.B.; Kharisma, I.L. Static Malware Detection and Classification Using Machine Learning: A Random Forest Approach. Eng. Proc. 2025, 107, 76. https://doi.org/10.3390/engproc2025107076
Kamdan, Pratama Y, Munzi RS, Mustafa AB, Kharisma IL. Static Malware Detection and Classification Using Machine Learning: A Random Forest Approach. Engineering Proceedings. 2025; 107(1):76. https://doi.org/10.3390/engproc2025107076
Chicago/Turabian StyleKamdan, Yoga Pratama, Rifki Sariful Munzi, Aqshal Bilnandzari Mustafa, and Ivana Lucia Kharisma. 2025. "Static Malware Detection and Classification Using Machine Learning: A Random Forest Approach" Engineering Proceedings 107, no. 1: 76. https://doi.org/10.3390/engproc2025107076
APA StyleKamdan, Pratama, Y., Munzi, R. S., Mustafa, A. B., & Kharisma, I. L. (2025). Static Malware Detection and Classification Using Machine Learning: A Random Forest Approach. Engineering Proceedings, 107(1), 76. https://doi.org/10.3390/engproc2025107076