Recent Advancements in Machine Learning Models for Malware Detection: A Systematic Literature Review

Hasanah, Nurul Islam; Insany, Gina Purnama; Kharisma, Ivana Lucia; Rahayu, Natasya Dewi

doi:10.3390/engproc2025107078

Open AccessProceeding Paper

Recent Advancements in Machine Learning Models for Malware Detection: A Systematic Literature Review^†

by

Nurul Islam Hasanah

¹,

Gina Purnama Insany

¹

,

Ivana Lucia Kharisma

^1,*

and

Natasya Dewi Rahayu

²

¹

Department of Informatics Engineering, Nusa Putra University, Sukabumi 43152, Indonesia

²

Department of Air Transportation, Trisakti Institute of Transportation and Logistics, Jakarta 13210, Indonesia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 78; https://doi.org/10.3390/engproc2025107078

Published: 10 September 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download

Browse Figures

Versions Notes

Abstract

Malware detection has become a critical area of research due to the increasing sophistication of cyberattacks targeting various platforms, including IoT devices, Android systems, and desktop environments. This study employed the systematic literature review (SLR) method, following PRISMA guidelines, to analyze recent advancements in malware detection using machine learning (ML) models. A total of six studies were selected based on strict inclusion and exclusion criteria, focusing on algorithms, datasets, performance metrics, and targeted platforms. The review reveals that ensemble methods like Gradient Boosting and XGBoost achieve high detection accuracy, with several models exceeding 90% on benchmark datasets such as VirusShare and MSCAD. Additionally, IoT platforms emerged as the most commonly targeted environment in malware detection research, emphasizing their vulnerability. Despite these advancements, the review identifies gaps in dataset diversity and platform-specific optimizations. This study provides insights into the current trends, challenges, and future directions for machine learning-based malware detection.

Keywords:

malware detection; machine learning; ensemble learning; IoT security; dataset diversity; systematic literature review

1. Introduction

The rapid development of information and communication technology in the last two decades has brought a lot of convenience [1,2], but it has also increased the risk of cybersecurity [3,4], especially related to malware [5,6]. Malware, which includes viruses, worms, trojans, and ransomware [7,8], has become one of the major threats to computer systems and networks around the world [9]. According to a report by Cybersecurity Ventures [10], global losses due to cyberattacks were estimated to reach six trillion dollars in 2021, and this figure continues to increase as attack techniques become more sophisticated [11]. This trend shows that malware threats affect not only large organizations [12], but also individuals and small businesses [13].

In this context, traditional approaches to malware detection, such as the use of signature-based detection [14], are beginning to show significant limitations [15,16]. This approach is ineffective against the new variants of malware [17] that are constantly emerging and use obfuscation techniques to evade detection [18]. In response, research and development of machine learning models has become a major focus in the field of cybersecurity [19]. Machine learning models offer the ability to learn from data [20] and identify invisible patterns [21], thus being able to detect new and unknown malware more effectively [22].

A number of studies have shown the potential of machine learning models in detecting malware with high accuracy [23,24]. For example, A. Kumar et al. [25] reported that the Random Forest algorithm can achieve up to 99.7% accuracy in detecting malware on a given dataset. However, the implementation of machine learning techniques also faces various challenges [26]. One of the main challenges is the selection of representative and varied datasets [27]. Datasets that do not reflect the diversity of threats in the real world can result in models that are not robust and less reliable [28]. This research departs from the urgent need to understand the latest advances in the development of machine learning models for malware detection [29,30]. In addition, this research also aims to identify solutions to various existing challenges, such as dataset limitations [31] and algorithm complexity [32].

This study aimed to conduct a systematic review of the latest advances in machine learning models for malware detection [33,34]. Other objectives were to identify the most effective machine learning algorithms in detecting malware [35], evaluate the types of datasets commonly used in training and testing machine learning models [36,37], and analyze the performance of machine learning models in detecting different types of malware on different platforms [38,39,40]. Thus, this research is expected to provide in-depth insights into the latest trends and challenges in machine learning-based malware detection [41]. The results of this study are also expected to be a practical guide for researchers and practitioners in the field of cybersecurity.

The ever-evolving malware threat has driven the need for more sophisticated and adaptive detection solutions [42,43]. Although much research has been performed in this area, there are several research gaps that need to be addressed [44]. One of them is the limitation of datasets [45], as many previous studies used datasets that were less representative or did not cover the latest malware variants [46]. This hinders the generalization of the model to real-world situations [47]. In addition, most of the research only focused on a specific platform [48]. There has been no systematic evaluation comparing the performance of various machine learning algorithms to detect malware in various scenarios [49]. This research seeks to fill in the gaps by analyzing available datasets, evaluating model performance across different platforms, and compiling a comprehensive comparison of the machine learning algorithms used [50].

To achieve the research objectives, the following research question was formulated: What are some recent advances in machine learning algorithms for malware detection? What types of datasets are commonly used in training and evaluating machine learning models for malware detection? How do machine learning models perform in detecting malware across different platforms and types of malware? By answering these questions, this research is expected to make a significant contribution to the development of more effective and efficient machine learning-based malware detection technology.

2. Methodology

The methodology used in this study is a systematic literature review (SLR) conducted with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline. This methodology aims to identify, evaluate, and analyze relevant research in the field of malware detection using machine learning. The review process involves several stages, starting from identifying the relevant literature sources, sorting based on predetermined inclusion and exclusion criteria [51], to systematically presenting data according to the PRISMA framework [52].

2.1. Search Strategy and Criteria

Article search and selection process: the author searched for articles using the Scopus database with the following Boolean operators: (“Advancements” OR “Advances” OR “Developments”) AND (“Machine Learning” OR “ML”) AND (“Machine Learning Models”) AND (“Malware Detection” OR “Malware Classification” OR “Threat Detection” OR “Malicious Software” OR “Cyber Threat”). The search process was carried out on December 16, 2024 and resulted in 125 articles. By applying the criteria below, the number of articles selected for further analysis was 24 articles. These articles were then studied in depth to answer the main research questions related to the advancement of machine learning models in malware detection on various platforms, as summarized in Table 1.

2.2. Eligibility Criteria

The research feasibility criteria in the SLR were designed to ensure that only studies that were relevant, high-quality, and fit for the research objectives were included in the analysis [53,54]. First, the selected article should specifically address the application of machine learning models or algorithms in malware detection. Studies that only included the theory or development of algorithms without real implementation in the context of malware detection were not included. Second, the included studies should present measurable empirical data, such as accuracy levels, precision, recall, F1-scores, or other performance metrics reported quantitatively to allow comparisons between studies [55]. Third, the article should discuss or implement at least one machine learning algorithm, such as K-Nearest Neighbors (KNN), Random Forest, or Gradient Boosting [56]. In addition, the article should use a well-described dataset, both from public sources and independent collection results, with detailed information regarding the size, type, and source of the dataset. Only articles published in English and within the last 5 years were considered, to ensure the relevance and up-to-date quality of the information. After the selection process based on these criteria, only 6 articles were found that were eligible for further analysis, as shown in Figure 1. These articles were selected because they made a significant contribution to the understanding and development of machine learning-based malware detection methods.

2.3. Data Synthesis

The synthesis stage in this study was carried out with a descriptive synthesis approach, which aims to organize, analyze, and summarize information from studies that have passed the appraisal stage. This process began by categorizing the studies based on four main criteria to ensure a systematic and in-depth analysis. The first criterion was the machine learning techniques used in malware detection. These techniques included various methods such as Support Vector Machine (SVM), Random Forest, deep learning (e.g., convolutional and recursive neural networks), as well as hybrid approaches that combine multiple algorithms to improve detection performance. The second criterion was the dataset used in the research. The dataset included commonly used datasets, such as VirusShare [57], CICIDS2017 [58], and other datasets that are public or self-collected by researchers. These datasets were further analyzed to understand the data source, dataset size, and representation of malware and non-malware samples.

The third criterion was the performance evaluation metrics reported in each study. These metrics included accuracy, to evaluate the effectiveness of the algorithm in detecting malware [59]. The fourth criterion was the platform that is the focus of malware detection, such as desktop-based systems, mobile devices, cloud-based environments, and Internet of Things (IoT) devices. This information helps identify how a particular approach is applied in a variety of operational contexts.

The data collected from these studies were then organized into tables and graphs. Tables were used to present a detailed comparison, such as the algorithms used, datasets, performance metrics, and relevant platforms. Meanwhile, graphs helped visualize research trends, the distribution of the most widely used machine learning techniques, as well as changes in the research focus over time. With this descriptive synthesis approach, the research was able to map trends, evaluate key contributions, and identify similarities and differences among the studies reviewed. The results of this synthesis process provide comprehensive insights to understand the progress of research related to malware detection using machine learning.

2.4. Analysis

The analysis stage in this study was designed to evaluate the synthesis results that have been obtained, focusing on answering the three main research questions. The approach used was qualitative analysis, which involved three main processes according to each research question (Research Question/RQ).

First, to answer RQ1 (What are the recent advancements in machine learning algorithms for malware detection?), the analysis process was carried out by identifying patterns and trends in the application of machine learning techniques. The studies analyzed were grouped based on the type of algorithm used, such as Random Forest, Support Vector Machine (SVM), deep learning (including convolutional and recursive neural networks), as well as hybrid models that combined several techniques to improve detection performance. This analysis highlighted the latest innovations in algorithm development, such as the adoption of transfer learning, ensemble learning, and explainable AI methods that are beginning to be applied in the context of malware detection. In addition, temporal trends were also evaluated to see a shift in engineering preferences from traditional algorithms to deep learning-based models.

Second, to answer RQ2 (What types of datasets are commonly used in training and evaluating machine learning models for malware detection?), an evaluation of the source, size, and characteristics of the dataset used in the study was carried out. Commonly used datasets, such as VirusShare, CICIDS2017, and MalwareBazaar, were analyzed to understand how they support model training and evaluation. In addition, the study also evaluated the representation of the sample in the dataset, including the proportion between malware and non-malware, the types of malware covered (such as ransomware, trojans, and adware), as well as the presence of datasets that reflected the latest malware threats. The research gap was found in the lack of dynamic and representative datasets, especially for new types of malware and certain operational environments such as IoT devices.

Third, to answer RQ3 (How do machine learning models for malware detection perform across different platforms and types of malware?), the analysis was focused on evaluating the performance of the model in various application contexts. The analyzed studies were differentiated based on platforms, such as desktops, mobile devices, cloud environments, and the Internet of Things (IoT). In addition, the algorithm’s performance was evaluated based on metrics such as accuracy. This analysis aimed to identify which algorithms were showing superior performance on a particular platform and for a specific type of malware. For example, deep learning algorithms tend to excel on IoT devices due to their ability to handle unstructured data, while traditional algorithms such as Random Forest are more commonly used on desktop platforms that have more stable data.

Through this analysis, the research was able to provide structured answers to the research questions, by highlighting the latest algorithm innovations, commonly used dataset characteristics, and model performance in various platforms and types of malware. The results of this analysis not only support a deep understanding, but also provide recommendations for future research directions.

3. Result and Discussion

The summary of the six reviewed articles is presented in the form of data that included machine learning techniques, datasets, accuracy, and platforms used in malware detection. This information provides an overview of the approaches applied in malware detection research.

Based on the data in the Table 2, research on malware detection using machine learning models has shown significant growth in recent years, particularly between 2022 and 2024. Countries such as China, Saudi Arabia, South Korea, India, and Mexico dominate research in this field. This dominance can be attributed to the rapid development of digital technologies and the rising cybersecurity threats accompanying the growth of digital ecosystems in these regions. For instance, China and India have vast internet user populations and continuously expanding technological infrastructures, driving the demand for advanced security systems. Meanwhile, South Korea is renowned as a hub of technological innovation, with a rapidly growing startup ecosystem focused on IoT and artificial intelligence (AI). Saudi Arabia stands out with its Saudi Vision 2030, which prioritizes digital transformation and reinforces the need for cybersecurity to protect national digital infrastructure. On the other hand, Mexico’s contribution reflects the increasing adoption of IoT technologies in Latin America, necessitating more effective security solutions.

This trend highlights that countries with advanced digital ecosystems and large internet user bases tended to be more proactive in developing cybersecurity technologies based on machine learning. Additionally, the focus on research in these regions is driven by government initiatives and industry commitments to safeguard critical infrastructure and support the growth of digital ecosystems. These observations underscore the importance of global collaboration and ongoing research to address the evolving cybersecurity challenges in the era of digital transformation.

3.1. Recent Advancements in Machine Learning Algorithms for Malware Detection

In recent years, significant developments in machine learning (ML) algorithms for malware detection have been seen through various studies that propose traditional classification-based methods, ensemble learning, as well as deep learning-based approaches.

In paper ID 1, traditional algorithms such as K-Nearest Neighbors (KNN) were used to detect cyberattacks based on the multi-step cyberattack dataset (MSCAD) with an accuracy of up to 82.75%. KNN offers simplicity and effectiveness in low-dimensional data, making it suitable for preliminary malware detection tasks. However, its performance often degrades with high-dimensional datasets due to the curse of dimensionality, leading to slower computation and reduced accuracy in complex scenarios.

Ensemble algorithms such as Random Forest, Gradient Boosting, LightGBM, and XGBoost are trending in modern malware detection. For example, in paper ID 2, the use of Random Forest for IoT malware detection on the VirusShare dataset recorded an accuracy of 75.50%. Random Forest provides robustness against overfitting by combining multiple decision trees, making it effective for handling imbalanced data and reducing variance. Meanwhile, in paper ID 3, the combination of LightGBM and XGBoost in multi-platform research using the PE dataset and ELF dataset managed to achieve an accuracy of 96.57%, demonstrating the efficiency of the ensemble algorithm for more complex platforms. LightGBM and XGBoost offer faster training and prediction speeds due to their gradient-boosting mechanisms, enabling scalable performance on large datasets and improved generalization.

In addition, deep learning and optimization-based approaches also dominated the latest research. In paper ID 4, the study, which used a combination of Gradient Descent, Gradient Descent with Momentum, Quasi-Newton, and Deep Neural Networks (DNNs) optimizations, achieved the highest performance of 98.80% on the Android platform with app datasets from the Google Play Store. Deep Neural Networks excel in learning hierarchical features from raw data, enabling automatic feature extraction and superior performance in recognizing complex patterns. Furthermore, in paper ID 6, Artificial Neural Networks (ANNs) were used alongside SVMs and Gradient Boosting Machines (GBMs) in IoT malware detection, recording an accuracy of 93.44%. ANNs are particularly advantageous for their adaptability in non-linear data modeling and ability to capture intricate relationships within datasets.

A hybrid approach that combined multiple algorithms also showed good performance. In paper ID 5, the study, which used Logistic Regression, SVM, Random Forest, and XGBoost to analyze the PE Header feature, achieved 94.6% accuracy using the 2018 Information Protection R&D Data Challenge dataset for the desktop platform. Hybrid approaches leverage the strengths of different algorithms to improve predictive accuracy and robustness, effectively compensating for the limitations of individual techniques.

Overall, these advancements highlight the shift from traditional ML techniques to more sophisticated ensemble and deep learning methods, driven by the need for higher accuracy, scalability, and adaptability to diverse malware behaviors and platforms. The distribution of the machine learning algorithms used in the reviewed studies is illustrated in Figure 2.

Support Vector Machines (SVM), XGBoost, and Random Forest are widely used in research due to their high performance, flexibility, and accuracy [60]. SVM is well-known for its effectiveness in handling high-dimensional data and its ability to optimally separate classes by maximizing the margin between them [61]. With the support of the kernel trick, SVM can also handle non-linear data, making it a reliable choice for various datasets. Meanwhile, XGBoost stands out for its computational efficiency and fast execution, especially with large datasets [62]. This algorithm includes built-in mechanisms to handle missing values and applies effective regularization to prevent overfitting, which is why it is often used in data science competitions. On the other hand, Random Forest offers exceptional stability and accuracy by combining multiple decision trees through an ensemble learning approach [63]. It is capable of dealing with complex and non-linear data while providing interpretability through feature importance analysis [64]. The combination of resistance to overfitting, the ability to handle unstructured data, and consistent performance makes these three algorithms highly popular in various studies and applications, including malware detection and complex data classification.

3.2. Common Datasets Used in Machine Learning for Malware Detection

The use of varied datasets also plays a key role in training and evaluating ML models. From the results of the review, some of the most frequently used datasets include the following:

VirusShare: This dataset is the most frequently used and provides a wide range of real-world malware samples, particularly for IoT platforms. It ensures access to diverse and up-to-date malware threats, enabling comprehensive model training.
VirusTotal: Commonly used alongside VirusShare, it offers extensive malware samples for evaluation. It supports quick analysis and comparison due to its wide adoption and integration capabilities.
PE Dataset and ELF Dataset: These datasets are frequently used for detecting malware across platforms using advanced methods like wavelet transforms. They focus on platform-specific characteristics, allowing tailored detection approaches.
Android App Dataset: This dataset includes applications from Google Play Store and third-party sources, useful for mobile malware detection research. It aids in analyzing Android-specific malware patterns and supports optimization-based deep learning techniques.
Multi-Step Cyberattack Dataset (MSCAD): This dataset is useful for evaluating sequential attacks and traditional algorithms like KNN. It is particularly effective for testing multi-step threat detection models.
2018 Information Protection R&D Data Challenge: Focused on PE Header feature-based studies, this dataset supports malware detection on desktop platforms. It offers rich metadata for feature-based analysis, improving binary classification models.

The VirusShare dataset is more frequently used in AI-based malware detection research compared to other datasets due to several advantages. VirusShare provides a vast and diverse collection of malware samples, encompassing various types of threats across multiple platforms, making it highly valuable for training AI models with broad coverage [57,65]. Additionally, VirusShare is relatively easy to access for qualified researchers, with a registration process that is less restrictive compared to some other datasets with stricter access limitations. Its popularity within the cybersecurity community further supports its widespread adoption, as it is often cited in various studies, encouraging other researchers to use it for ensuring the relevance and comparability of their work. Furthermore, VirusShare is typically available in standard formats compatible with commonly used malware analysis tools, facilitating analysis and model training. These factors combined make VirusShare a preferred choice in AI-based malware detection research [66].

As shown in Figure 3, the distribution of datasets used in the reviewed studies demonstrates the dominance of VirusShare compared to other datasets.

3.3. Performance of Machine Learning Models Across Platforms

The performance of ML models varies depending on the targeted platform and the dataset used. On the Android platform, the combination of gradient descent and DNN optimization achieves the highest performance with an accuracy of 98.80%, making it the best-performing platform for malware detection.

On the other hand, the multi-platform targeted study with LightGBM and XGBoost approaches recorded an accuracy of 96.57%, demonstrating the effectiveness of the ensemble method in handling PE and ELF datasets. For IoT platforms, research with Random Forest on the VirusShare dataset recorded an accuracy of 75.50%, while ANNs and GBMs-based approaches improved accuracy to 93.44%. This shows that deep learning and optimization-based methods have the potential to significantly improve IoT malware detection performance.

On the desktop platform, the hybrid approach using Logistic Regression, SVM, Random Forest, and XGBoost managed to achieve an accuracy of 94.6%, proving the effectiveness of the combination of methods for detecting malware based on the PE Header feature.

The distribution of platforms targeted for malware detection is visualized in the chart, highlighting the prevalence of IoT-focused studies at 50.0%, followed by Desktop and multi-platform studies at 16.7% each, and Android studies at 16.6%, as shown in Figure 4.

This indicates that IoT remains a significant area of concern due to its rapid adoption and vulnerability, emphasizing the need for enhanced security mechanisms in IoT environments.

The IoT (Internet of Things) platform has become a major target for malware detection in recent research due to several key reasons. First, the number of IoT devices is rapidly increasing across various domains, such as smart homes, healthcare, industry, and transportation [67]. This makes IoT an attractive target for attackers, as many IoT devices have weak security measures and often lack regular system updates [68]. Second, IoT devices are typically connected to broader networks, meaning a compromise in one device can have significant impacts, such as spreading malware to other devices in the network [69]. Third, malware attacks on IoT devices can lead to serious consequences, including physical damage to infrastructure or disruptions to critical services [70]. Lastly, the diversity of IoT devices from different vendors presents unique challenges in malware detection, which draws researchers’ interest in developing more adaptive AI-based solutions. As a result, the focus on malware detection in IoT platforms reflects the effort to address the growing security threats in the increasingly complex IoT ecosystem [71].

4. Conclusions

This study identified the latest developments in the application of machine learning (ML) for malware detection across various platforms. Deep Neural Network (DNN) algorithms showed the best performance on the Android platform with an accuracy of 98.80%, while XGBoost and Support Vector Machines (SVM) excelled on the desktop platform with an accuracy of 94.60%. On IoT platforms, the Artificial Neural Network (ANN) and Gradient Boosting Machine (GBM) approaches managed to achieve 93.44% accuracy, demonstrating the effectiveness of ensemble learning in detecting malware. The most commonly used datasets included VirusShare, VirusTotal, PE Dataset, and ELF Dataset, which support the evaluation and development of ML models. This study emphasizes that the selection of the right ML algorithm as well as the quality of the datasets play key roles in improving the accuracy of malware detection. Going forward, a hybrid approach to AI and model optimization is expected to be able to face challenges such as limited public datasets and the evolution of increasingly complex malware.

Author Contributions

Conceptualization, N.I.H. and G.P.I.; methodology, N.I.H.; software, N.D.R.; validation, N.I.H., G.P.I. and I.L.K.; formal analysis, N.I.H.; investigation, G.P.I.; resources, I.L.K.; data curation, N.D.R.; writing—original draft preparation, N.I.H.; writing—review and editing, G.P.I. and I.L.K.; visualization, N.D.R.; supervision, I.L.K.; project administration, I.L.K.; funding acquisition, I.L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Nusa Putra University through the Nutral project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are not publicly available due to privacy or ethical restrictions.

Acknowledgments

The authors would like to express their gratitude to Nusa Putra University for providing the resources and support necessary to conduct this research. Special thanks are extended to colleagues in the Department of Informatics Engineering for their valuable insights and encouragement throughout the study. The authors also acknowledge the researchers and organizations whose studies formed the foundation of this systematic review. This work would not have been possible without the collaborative efforts and contributions of all involved.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, X.; Wang, D.; Li, M. Convenience Analysis of Sustainable E-Agriculture Based on Blockchain Technology. J. Clean. Prod. 2020, 271, 122503. [Google Scholar] [CrossRef]
Lee, C.C.; Yuan, Z.; Wang, Q. How Does Information and Communication Technology Affect Energy Security? International Evidence. Energy Econ. 2022, 109, 105969. [Google Scholar] [CrossRef]
Venkatachary, S.K.; Prasad, J.; Alagappan, A.; Andrews, L.J.B.; Raj, R.A.; Duraisamy, S. Cybersecurity and Cyber-Terrorism Challenges to Energy-Related Infrastructures–Cybersecurity Frameworks and Economics–Comprehensive Review. Int. J. Crit. Infrastruct. Prot. 2024, 45, 100677. [Google Scholar] [CrossRef]
Geer, D.; Jardine, E.; Leverett, E. On Market Concentration and Cybersecurity Risk. J. Cyber Policy 2020, 5, 9–29. [Google Scholar] [CrossRef]
Caviglione, L.; Choras, M.; Corona, I.; Janicki, A.; Mazurczyk, W.; Pawlicki, M.; Wasielewska, K. Tight Arms Race: Overview of Current Malware Threats and Trends in Their Detection. IEEE Access 2021, 9, 5371–5396. [Google Scholar] [CrossRef]
Lallie, H.S.; Shepherd, L.A.; Nurse, J.R.C.; Erola, A.; Epiphaniou, G.; Maple, C.; Bellekens, X. Cyber Security in the Age of COVID-19: A Timeline and Analysis of Cyber-Crime and Cyber-Attacks during the Pandemic. Comput. Secur. 2021, 105, 102248. [Google Scholar] [CrossRef]
Prasad, R.; Rohokale, V. Cyber Security: The Lifeline of Information and Communication Technology; Springer International Publishing: Cham, Switzerland, 2020; pp. 67–81. ISBN 978-3-030-31703-4. [Google Scholar]
Kara, I. A Basic Malware Analysis Method. Comput. Fraud Secur. 2019, 2019, 11–19. [Google Scholar] [CrossRef]
Saba, T.; Rehman, A.; Sadad, T.; Kolivand, H.; Bahaj, S.A. Anomaly-Based Intrusion Detection System for IoT Networks through Deep Learning Model. Comput. Electr. Eng. 2022, 99, 107810. [Google Scholar] [CrossRef]
Tasheva, I. Cybersecurity Post-COVID-19: Lessons Learned and Policy Recommendations. Eur. View 2021, 20, 140–149. [Google Scholar] [CrossRef]
Shaukat, K.; Luo, S.; Varadharajan, V.; Hameed, I.A.; Xu, M. A Survey on Machine Learning Techniques for Cyber Security in the Last Decade. IEEE Access 2020, 8, 222310–222354. [Google Scholar] [CrossRef]
Beaman, C.; Barkworth, A.; Akande, T.D.; Hakak, S.; Khan, M.K. Ransomware: Recent Advances, Analysis, Challenges and Future Research Directions. Comput. Secur. 2021, 111, 102490. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Liu, Q. A Comprehensive Review Study of Cyber-Attacks and Cyber Security; Emerging Trends and Recent Developments. Energy Rep. 2021, 7, 8176–8186. [Google Scholar] [CrossRef]
Aslan, O.; Samet, R. A Comprehensive Review on Malware Detection Approaches. IEEE Access 2020, 8, 6249–6271. [Google Scholar] [CrossRef]
Chenet, C.P.; Savino, A.; Di Carlo, S. A Survey on Hardware-Based Malware Detection Approaches. IEEE Access 2024, 12, 54115–54128. [Google Scholar] [CrossRef]
Soja Rani, S.; Reeja, S.R. A Survey on Different Approaches for Malware Detection Using Machine Learning Techniques. In Sustainable Communication Networks and Application; Karrupusamy, P., Chen, J., Shi, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 389–398. [Google Scholar]
Murali, R.; Ravi, A.; Agarwal, H. A Malware Variant Resistant To Traditional Analysis Techniques. In Proceedings of the 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), Vellore, India, 24–25 February 2020; pp. 1–7. [Google Scholar]
Maiorca, D.; Ariu, D.; Corona, I.; Aresu, M.; Giacinto, G. Stealth Attacks: An Extended Insight into the Obfuscation Effects on Android Malware. Comput. Secur. 2015, 51, 16–31. [Google Scholar] [CrossRef]
Halbouni, A.; Gunawan, T.S.; Habaebi, M.H.; Halbouni, M.; Kartiwi, M.; Ahmad, R. Machine Learning and Deep Learning Approaches for CyberSecurity: A Review. IEEE Access 2022, 10, 19572–19585. [Google Scholar] [CrossRef]
Zhang, W.; Gu, X.; Tang, L.; Yin, Y.; Liu, D.; Zhang, Y. Application of Machine Learning, Deep Learning and Optimization Algorithms in Geoengineering and Geoscience: Comprehensive Review and Future Challenge. Gondwana Res. 2022, 109, 1–17. [Google Scholar] [CrossRef]
Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
Panker, T.; Nissim, N. Leveraging Malicious Behavior Traces from Volatile Memory Using Machine Learning Methods for Trusted Unknown Malware Detection in Linux Cloud Environments. Knowl.-Based Syst. 2021, 226, 107095. [Google Scholar] [CrossRef]
Rahul; Kedia, P.; Sarangi, S. Monika Analysis of Machine Learning Models for Malware Detection. J. Discret. Math. Sci. Cryptogr. 2020, 23, 395–407. [Google Scholar] [CrossRef]
Kamboj, A.; Kumar, P.; Bairwa, A.K.; Joshi, S. Detection of Malware in Downloaded Files Using Various Machine Learning Models. Egypt. Inform. J. 2023, 24, 81–94. [Google Scholar] [CrossRef]
Kumar, A.; Abhishek, K.; Shandilya, S.K.; Ghalib, M.R. Malware Analysis Through Random Forest Approach. J. Web Eng. 2020, 19, 795–818. [Google Scholar] [CrossRef]
Alqhatani, M.A. Machine Learning Techniques for Malware Detection with Challenges and Future Directions. IJCNIS 2021, 13, 258–270. [Google Scholar] [CrossRef]
Estay, H.; Lois-Morales, P.; Montes-Atenas, G.; Ruiz del Solar, J. On the Challenges of Applying Machine Learning in Mineral Processing and Extractive Metallurgy. Minerals 2023, 13, 788. [Google Scholar] [CrossRef]
Hindy, H.; Brosset, D.; Bayne, E.; Seeam, A.K.; Tachtatzis, C.; Atkinson, R.; Bellekens, X. A Taxonomy of Network Threats and the Effect of Current Datasets on Intrusion Detection Systems. IEEE Access 2020, 8, 104650–104675. [Google Scholar] [CrossRef]
Gibert, D.; Mateu, C.; Planes, J. The Rise of Machine Learning for Detection and Classification of Malware: Research Developments, Trends and Challenges. J. Netw. Comput. Appl. 2020, 153, 102526. [Google Scholar] [CrossRef]
Gorment, N.Z.B.; Selamat, A.; Krejcar, O. A Recent Research on Malware Detection Using Machine Learning Algorithm: Current Challenges and Future Works. In Advances in Visual Informatics; Badioze Zaman, H., Smeaton, A.F., Shih, T.K., Velastin, S., Terutoshi, T., Jørgensen, B.N., Aris, H., Ibrahim, N., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 469–481. [Google Scholar]
Bansal, A.; Sharma, R.; Kathuria, M. A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications. ACM Comput. Surv. 2022, 54, 1–29. [Google Scholar] [CrossRef]
Alzubaidi, L.; Bai, J.; Al-Sabaawi, A.; Santamaría, J.; Albahri, A.S.; Al-dabbagh, B.S.N.; Fadhel, M.A.; Manoufali, M.; Zhang, J.; Al-Timemy, A.H.; et al. A Survey on Deep Learning Tools Dealing with Data Scarcity: Definitions, Challenges, Solutions, Tips, and Applications. J. Big Data 2023, 10, 46. [Google Scholar] [CrossRef]
Pachhala, N.; Jothilakshmi, S.; Battula, B.P. A Comprehensive Survey on Identification of Malware Types and Malware Classification Using Machine Learning Techniques. In Proceedings of the 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 7–9 October 2021; pp. 1207–1214. [Google Scholar]
Gorment, N.Z.; Selamat, A.; Cheng, L.K.; Krejcar, O. Machine Learning Algorithm for Malware Detection: Taxonomy, Current Challenges, and Future Directions. IEEE Access 2023, 11, 141045–141089. [Google Scholar] [CrossRef]
Giannakas, F.; Kouliaridis, V.; Kambourakis, G. A Closer Look at Machine Learning Effectiveness in Android Malware Detection. Information 2023, 14, 2. [Google Scholar] [CrossRef]
Chua, T.H.; Salam, I. Evaluation of Machine Learning Algorithms in Network-Based Intrusion Detection Using Progressive Dataset. Symmetry 2023, 15, 1251. [Google Scholar] [CrossRef]
Thambawita, V.; Jha, D.; Hammer, H.L.; Johansen, H.D.; Johansen, D.; Halvorsen, P.; Riegler, M.A. An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning Applied to Gastrointestinal Tract Abnormality Classification. ACM Trans. Comput. Healthc. 2020, 1, 1–29. [Google Scholar] [CrossRef]
Maniriho, P.; Mahmood, A.N.; Chowdhury, M.J.M. A Survey of Recent Advances in Deep Learning Models for Detecting Malware in Desktop and Mobile Platforms. ACM Comput. Surv. 2024, 56, 1–41. [Google Scholar] [CrossRef]
Lee, Y.-T.; Ban, T.; Wan, T.-L.; Cheng, S.-M.; Isawa, R.; Takahashi, T.; Inoue, D. Cross Platform IoT-Malware Family Classification Based on Printable Strings. In Proceedings of the 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 29 December–1 January 2020; pp. 775–784. [Google Scholar]
Al-Janabi, M.; Altamimi, A.M. A Comparative Analysis of Machine Learning Techniques for Classification and Detection of Malware. In Proceedings of the 2020 21st International Arab Conference on Information Technology (ACIT), Giza, Egypt, 28–30 November 2020; pp. 1–9. [Google Scholar]
AliAhmad, A.; Eleyan, D.; Eleyan, A.; Bejaoui, T.; Zolkipli, M.F.; Al-Khalidi, M. Malware Detection Issues, Future Trends and Challenges: A Survey. In Proceedings of the 2023 International Symposium on Networks, Computers and Communications (ISNCC), Doha, Qatar, 23–26 October 2023; pp. 1–6. [Google Scholar]
Darem, A.A.; Ghaleb, F.A.; Al-Hashmi, A.A.; Abawajy, J.H.; Alanazi, S.M.; Al-Rezami, A.Y. An Adaptive Behavioral-Based Incremental Batch Learning Malware Variants Detection Model Using Concept Drift Detection and Sequential Deep Learning. IEEE Access 2021, 9, 97180–97196. [Google Scholar] [CrossRef]
Krishna, G.B.; Kumar, G.S.; Ramachandra, M.; Pattem, K.S.; Rani, D.S.; Kakarla, G. Adapting to Evasive Tactics through Resilient Adversarial Machine Learning for Malware Detection. In Proceedings of the 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 28 February–1 March 2024; pp. 1735–1741. [Google Scholar]
Chand, R. Framework for Identifying Research Gaps for Future Academic Research. IRA Int. J. Educ. Multidiscip. Stud. 2023, 19, 160. [Google Scholar] [CrossRef]
Madukwe, K.J.; Gao, X.; Xue, B. In Data We Trust: A Critical Analysis of Hate Speech Detection Datasets. In Proceedings of the Fourth Workshop on Online Abuse and Harms, Online, 20 November 2020; pp. 150–161. [Google Scholar]
Miranda, T.C.; Gimenez, P.-F.; Lalande, J.-F.; Tong, V.V.T.; Wilke, P. Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? IEEE Trans. Inf. Forensics Secur. 2022, 17, 2182–2197. [Google Scholar] [CrossRef]
Serizel, R.; Turpault, N.; Shah, A.; Salamon, J. Sound Event Detection in Synthetic Domestic Environments. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Virtual, 4–8 May 2020; pp. 86–90. [Google Scholar]
Sabbah, A.; Taweel, A.; Zein, S. Android Malware Detection: A Literature Review. In Communications in Computer and Information Science, Proceedings of the Second International Conference (UbiSec 2022), Zhangjiajie, China, 28–31 December 2022; Wang, G., Choo, K.-K.R., Wu, J., Damiani, E., Eds.; Springer Nature: Singapore, 2023; Volume 1768, pp. 263–278. [Google Scholar]
Martins, N.; Cruz, J.M.; Cruz, T.; Abreu, P.H. Adversarial Machine Learning Applied to Intrusion and Malware Scenarios: A Systematic Review. IEEE Access 2020, 8, 35403–35419. [Google Scholar] [CrossRef]
Raju, G.S.B.; Manasa, C.; Bhavani, N.D.; Amulya, J.; Shirisha, D. Comparative Analysis of Different Machine Learning Algorithms on Different Datasets. In Proceedings of the 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 17–19 May 2023; pp. 104–109. [Google Scholar]
Khan, K.S.; Bueno-Cavanillas, A.; Zamora, J. Revisiones Sistemáticas En Cinco Pasos: II. Cómo Identificar Los Estudios Relevantes. Med. Familia. Semer. 2022, 48, 431–436. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. J. Clin. Epidemiol. 2021, 134, 178–189. [Google Scholar] [CrossRef]
Pérez, J.; Díaz, J.; Garcia-Martin, J.; Tabuenca, B. Systematic Literature Reviews in Software Engineering—Enhancement of the Study Selection Process Using Cohen’s Kappa Statistic. J. Syst. Softw. 2020, 168, 110657. [Google Scholar] [CrossRef]
Okesola, M.; Okesola, J.; Ogunlana, O.; Afolabi, I. Quality Assessment of Systematic Literature on Uterine Fibroids: A Systematic Review. F1000Research 2024, 11, 1050. [Google Scholar] [CrossRef]
Kumar, Y.; Gupta, S.; Singla, R.; Hu, Y.-C. A Systematic Review of Artificial Intelligence Techniques in Cancer Prediction and Diagnosis. Arch. Comput. Methods Eng. 2022, 29, 2043–2070. [Google Scholar] [CrossRef] [PubMed]
Maity, S.; Hossain, M.A.; Maji, K.; Mishra, S.; Nath, S.; Gupta, S. ANALYZING and COMPARING Random Forest and K-Nearest Neighbours for Effective Heart Disease Prediction. In Proceedings of the 2024 4th International Conference on Intelligent Technologies (CONIT), Bangalore, India, 21–23 June 2024; pp. 1–6. [Google Scholar]
Bruzzese, R. Building Visual Malware Dataset Using VirusShare Data and Comparing Machine Learning Baseline Model to CoAtNet for Malware Classification. In Proceedings of the 2024 16th International Conference on Machine Learning and Computing (ICMLC ’24), Shenzhen, China, 2–5 February 2025; pp. 185–193. [Google Scholar] [CrossRef]
Azalmad, M.; El Ayachi, R.; Biniz, M. Unveiling the Performance Insights: Benchmarking Anomaly-Based Intrusion Detection Systems Using Decision Tree Family Algorithms on the CICIDS2017 Dataset. In Lecture Notes in Business Information Processing, Proceedings of the 8th International Conference on Business Intelligence (CBI 2023), Istanbul, Turkey, 19–21 July 2023; El Ayachi, R., Fakir, M., Baslam, M., Eds.; Springer Nature: Cham, Switzerland, 2023; Volume 484, pp. 202–219. [Google Scholar]
Baghirov, E. Evaluating the Performance of Different Machine Learning Algorithms for Android Malware Detection. In Proceedings of the 2023 5th International Conference on Problems of Cybernetics and Informatics (PCI), Baku, Azerbaijan, 28–30 August 2023; pp. 1–4. [Google Scholar]
Azmee, A.A.; Choudhury, P.P.; Alam, A.M.; Dutta, O.; Hossai, M.I. Performance Analysis of Machine Learning Classifiers for Detecting PE Malware. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 510–517. [Google Scholar] [CrossRef]
Hsu, D.; Muthukumar, V.; Xu, J. On the Proliferation of Support Vectors in High Dimensions. J. Stat. Mech. Theory Exp. 2022, 2022, 114011. [Google Scholar] [CrossRef]
Velarde, G.; Sudhir, A.; Deshmane, S.; Deshmunkh, A.; Sharma, K.; Joshi, V. Evaluating XGBoost for Balanced and Imbalanced Data: Application to Fraud Detection. arXiv 2023, arXiv:2303.15218. [Google Scholar] [CrossRef]
Kari, T.; Leelavani, N.; Sayeera Banu, A.; DhanuShree, R.; Jagannatha, K.B.; Natarajan, S. An Accelerated Approach to Parallel Ensemble Techniques Targeting Healthcare and Environmental Applications. In Proceedings of the 2020 3rd International Conference on Energy, Power and Environment: Towards Clean Energy Technologies, Shillong, India, 5–7 March 2021; pp. 1–6. [Google Scholar]
Orlenko, A.; Moore, J.H. A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Min. 2021, 14, 9. [Google Scholar] [CrossRef]
Düzgün, B.; Çayır, A.; Demirkiran, F.; Kayha, C.N.; Gençaydın, B.; Dag, H. New Datasets for Dynamic Malware Classification. arXiv 2021, arXiv:2111.15205. [Google Scholar]
Gaber, M.G.; Ahmed, M.; Janicke, H. Malware Detection with Artificial Intelligence: A Systematic Literature Review. ACM Comput. Surv. 2024, 56, 1–33. [Google Scholar] [CrossRef]
Chataut, R.; Phoummalayvane, A.; Akl, R. Unleashing the Power of IoT: A Comprehensive Review of IoT Applications and Future Prospects in Healthcare, Agriculture, Smart Homes, Smart Cities, and Industry 4.0. Sensors 2023, 23, 7194. [Google Scholar] [CrossRef]
Aziz Al Kabir, M.; Elmedany, W.; Sharif, M.S. Securing IoT Devices Against Emerging Security Threats: Challenges and Mitigation Techniques. J. Cyber Secur. Technol. 2023, 7, 199–223. [Google Scholar] [CrossRef]
Almohri, H.M.J.; Watson, L.T.; Evans, D. An Attack-Resilient Architecture for the Internet of Things. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3940–3954. [Google Scholar] [CrossRef]
Aamerkhan, G.; Sharma, U. IoT Under Siege: The Dark Side of Internet-Connected Devices. Int. J. Multidiscip. Res. 2024, 6, 1–6. [Google Scholar] [CrossRef]
Gopinath, M.; Sethuraman, S.C. A Comprehensive Survey on Deep Learning Based Malware Detection Techniques. Comput. Sci. Rev. 2023, 47, 100529. [Google Scholar] [CrossRef]

Figure 1. PRISMA diagram.

Figure 2. Distribution of the machine learning algorithms used in the studies.

Figure 3. Distribution of datasets used in studies.

Figure 4. Distribution of the targeted platforms in studies.

Table 1. Inclusion and exclusion criteria.

Criteria	Inclusion	Exclusion
Publication Year	2020–2024	Outside this time range
Document Type	Articles (not conferences or book reviews)	Conferences, book reviews, or literature reviews without implementation
Keywords	Contains keywords “machine learning” and “malware”	Does not contain relevant keywords
Accessibility	Open access	Those not publicly accessible
Methodology Approach	Utilizes machine learning methods for malware detection	Uses traditional techniques without machine learning
Data Analysis	Provides empirical data or in-depth analysis related to the methods used	Does not provide empirical data or focuses solely on theoretical discussions without experiments

Table 2. Review result table.

Title	Algorithm	Dataset	Accuracy	Platform	Year	Country
Detection And Prevention Of Cyber Defense Attacks Using Machine Learning Algorithms	K-Nearest Neighbors (KNN)	Multi-step cyberattack dataset (MSCAD)	82.75%	IoT	2024	China
A Deep Reinforcement Learning Framework to Evade Black-Box Machine Learning Based IoT Malware Detectors Using GAN-Generated Influential Features	Random Forest, Gradient Boosting, Multi-Layer Perceptron (MLP), Decision Tree	VirusShare and VirusTotal	75.5%	IoT	2023	Saudi Arabia
HMLET: Hunt Malware Using Wavelet Transform on Cross-Platform	LGBM (LightGBM) and XGBoost	PE dataset, dan ELF dataset	96.57%	Multi-platform	2022	South Korea
PermDroid: a framework developed using proposed feature selection approach and machine learning techniques for Android malware detection	Gradient Descent, Quasi-Newton, Gradient Descent with Momentum, Levenberg–Marquardt, Gradient Descent with Adaptive learning rate, and Deep Neural Network	Android applications (Google Play Store and third-party app stores)	98.8%	Android	2024	India
Static Analysis and Machine Learning-based Malware Detection System using PE Header Feature Values	Logistic Regression, SVM, Random Forest, XGBoost	2018 information protection R&D Data Challenge AI--based malware detection track	94.6%	Desktop	2022	South Korea
Static Malware Analysis Using Low-Parameter Machine Learning Models	Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), and Gradient Boosting Machines (GBMs).	VirusShare	93.44%	IoT	2024	Mexico

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hasanah, N.I.; Insany, G.P.; Kharisma, I.L.; Rahayu, N.D. Recent Advancements in Machine Learning Models for Malware Detection: A Systematic Literature Review. Eng. Proc. 2025, 107, 78. https://doi.org/10.3390/engproc2025107078

AMA Style

Hasanah NI, Insany GP, Kharisma IL, Rahayu ND. Recent Advancements in Machine Learning Models for Malware Detection: A Systematic Literature Review. Engineering Proceedings. 2025; 107(1):78. https://doi.org/10.3390/engproc2025107078

Chicago/Turabian Style

Hasanah, Nurul Islam, Gina Purnama Insany, Ivana Lucia Kharisma, and Natasya Dewi Rahayu. 2025. "Recent Advancements in Machine Learning Models for Malware Detection: A Systematic Literature Review" Engineering Proceedings 107, no. 1: 78. https://doi.org/10.3390/engproc2025107078

APA Style

Hasanah, N. I., Insany, G. P., Kharisma, I. L., & Rahayu, N. D. (2025). Recent Advancements in Machine Learning Models for Malware Detection: A Systematic Literature Review. Engineering Proceedings, 107(1), 78. https://doi.org/10.3390/engproc2025107078

Article Menu

Recent Advancements in Machine Learning Models for Malware Detection: A Systematic Literature Review^†

Abstract

1. Introduction