Next Article in Journal
Molecular Response of Bacteria Exposed to Wastewater-Borne Nanoparticles
Previous Article in Journal
Augmented Reality for Learning Algorithms: Evaluation of Its Impact on Students’ Emotions Using Artificial Intelligence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Systematic Review: Malware Detection and Classification in Cybersecurity

by
Sebastian Berrios
1,*,
Dante Leiva
1,
Bastian Olivares
1,
Héctor Allende-Cid
1,2,3,* and
Pamela Hermosilla
1
1
Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Valparaíso 2340025, Chile
2
Knowledge Discovery, Fraunhofer-Institute of Intelligent Analysis and Information Systems (IAIS), 53757 Sankt Augustin, Germany
3
Lamarr Institute for Machine Learning and Artificial Intelligence, 44227 Dortmund, Germany
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(14), 7747; https://doi.org/10.3390/app15147747
Submission received: 10 March 2025 / Revised: 21 April 2025 / Accepted: 1 July 2025 / Published: 10 July 2025

Abstract

Malicious Software, commonly known as Malware, represents a persistent threat to cybersecurity, targeting the confidentiality, integrity, and availability of information systems. The digital era, marked by the proliferation of connected devices, cloud services, and the advancement of machine learning, has brought numerous benefits; however, it has also exacerbated exposure to cyber threats, affecting both individuals and corporations. This systematic review, which follows the PRISMA 2020 framework, aims to analyze current trends and new methods for malware detection and classification. The review was conducted using data from Web of Science and Scopus, covering publications from 2020 and 2024, with over 47 key studies selected for in-depth analysis based on relevance, empirical results and citation metrics. These studies cover a variety of detection techniques, including machine learning, deep learning and hybrid models, with a focus on feature extraction, malware behavior analysis and the application of advanced algorithms to improve detection accuracy. The results highlight important advances, such as the improved performance of ensemble learning and deep learning models in detecting sophisticated threats. Finally, this study identifies the main challenges and outlines opportunities of future research to improve malware detection and classification frameworks.

1. Introduction

Malicious software, better known as malware, are programs that perform unwanted interventions within victim computers or devices [1]. Whether with or without the consent of the affected parties, malicious programs carry out harmful actions on computer systems, causing significant damage to the compromised machines. Understanding that malware is a key link in cyberattacks, the objective of these programs is to affect any of the three pillars of information: confidentiality, integrity, or availability of data. The impact has been so significant in recent years that science and business journals report that malware has cost the global economy trillions of dollars [1,2].
In recent decades, we have witnessed a rapid growth and evolution in technology, radically changing the way we live and manage our digitalization. The proliferation of internet-connected devices, cloud services, the growth of Machine Learning (ML) [3], and the expansion of social networks have created a rich and complex digital environment. This new digital era has brought various benefits, ranging from greater global connectivity to large-scale business operations. However, it has also introduced a series of threats and vulnerabilities that target the confidentiality, integrity, and availability of information.
Since the beginning of the pandemic, malware has increased significantly, with an estimated 1.5 billion new samples by 2020 [4]. This surge has impacted organizations, which have been forced to enhance their security measures. Malware is categorized by behavior, functionality, and propagation methods. Below are the main types of malware along with their characteristics:
The attacks are not limited to individuals but also affect businesses of all kinds [5]. These attacks can have devastating consequences, including the theft of personal and financial data, significant operational disruptions, and severe reputational damage [6,7]. Companies face constant challenges in protecting their systems and data from rapidly evolving cyber threats, such as ransomware, phishing, and software supply chain vulnerabilities [8,9].
The challenge of securing new technologies mainly falls within the area of cybersecurity. This field is dedicated to protecting computer systems and everything related to them. Over the years, cybersecurity has been able to contain cyber threats by employing a sophisticated arsenal of techniques and methodologies that evolve to enhance their robustness and effectiveness. In recent years, the advancement of machine learning has stood out as a powerful tool for detecting cyberattacks [10,11].
As shown in Figure 1, the number of publications on malware in cybersecurity has grown exponentially in the last 15 years (2010–2024). This increase reflects the need for new techniques and methodologies to address current problems, where the results of systematic searches reveal that malware represents a significant threat to today’s technology [12,13]. Without the ingenuity of cybercriminals to combine various methods, these attacks would not be possible. Additionally, human vulnerability plays a crucial role in the spread of these attacks [14,15].
Machine Learning (ML) plays a fundamental role in malware detection and technological advancement. With the increasing sophistication of cyberattacks, it is becoming more difficult to detect and combat malware using traditional methods based on static rules. This is where machine learning becomes crucial, as it enables the development of algorithms and models capable of autonomously learning and adapting to new threats [16,17].
Machine Learning-based malware detection systems can analyze large volumes of data and complex patterns, identifying malicious behaviors and anomalies that traditional approaches may miss. These systems continuously learn from exposure to new malware samples, improving their accuracy and detection capability over time [18,19].
The objective of this article is to conduct a systematic literature review on malware detection and classification in the field of cybersecurity. The aim is to provide an overview of the current technologies that address this issue, as well as to offer various techniques and methodologies focused on malware detection across different areas of technology. This article aims to provide a comprehensive view of the available solutions and highlight the importance of research in this field to protect systems and data from cyber threats [20,21]. In Figure 1, the number of annual publications associated with the review is presented.
Malware detection in cybersecurity involves analyzing the services and propagation platforms used by cybercriminals, as well as using analysis methods to understand the nature and behavior of malware. Feature extraction techniques are fundamental to identifying malicious patterns, while accurate malware detection and classification enable appropriate measures to mitigate associated risks. Together, these areas contribute to more robust cybersecurity [22,23].

2. Methodology

Systematic reviews aim to generate consensus and address specific problems, establishing a solid foundation for investigations [24,25]. Systematic Literature Reviews (SLR) help researchers identify the state of the art and answer unresolved questions.
To conduct a systematic search, it is necessary to clearly define the methodology to be implemented. For this reason, the SLR for this report was conducted under the PRISMA 2020 framework [24]. PRISMA is a methodology that provides a checklist for publishing structured abstracts of systematic reviews in journals and conferences, ensuring transparency and reproducibility, allowing a comprehensive review of the most relevant research in the field. As the first step in the methodology, the following research questions, focused on malware detection and classification in the field of cybersecurity, were established:
  • RQ1. What is the goal of malware detection in the field of cybersecurity?
  • RQ2. What techniques are implemented for malware detection?
  • RQ3. How has deep learning contributed to malware detection and classification?
  • RQ4. What datasets are currently used for detection and classification techniques?

2.1. Search Strategy

The SLR search was conducted in two major academic databases: SCOPUS and Web of Science (WOS). The publication period for the review ranged from 2020 to 2024, with no year restriction for the state of the art. The search queries were designed to cover a wide range of malware detection and classification studies, using keywords such as “malware” and “cybersecurity”, yielding 578 documents from Web of Science and 2126 from SCOPUS, as shown in Table 1.
The exclusive selection of SCOPUS and WOS as primary sources for this systematic review responds to their recognized scientific rigor, extensive indexing of peer-reviewed literature, and their broad thematic scope in cybersecurity and computer science. These databases ensure high-quality bibliographic records and support reproducibility, a key aspect of systematic reviews. Although other sources such as IEEE Xplore or ACM Digital Library contain valuable contributions in the field, the scope of this study was limited to WoS and Scopus to maintain consistency in indexing standards, citation metrics, and access to empirical evidence. Future work may expand this scope to incorporate specialized repositories.

2.2. Selection and Filtering Process

To ensure the quality and relevance of the selected studies, the following criteria were taken into account:

2.2.1. Inclusion Criteria (IC)

The inclusion criteria (IC) were defined to ensure the relevance, quality, and applicability of the studies selected for the systematic review. Each criterion was designed to support the identification of contributions directly related to the research objectives.
  • IC1: Focus on malware detection and classification methods: This criterion ensures that the selected studies specifically focus on the topic of interest, avoiding the inclusion of irrelevant research. IC2: Presentation of empirical results: The requirement of empirical results ensures that the included studies provide practical evidence on the effectiveness of detection and classification methods, strengthening the validity and relevance of the review. IC3: Publications in English and Spanish: Including studies in multiple languages facilitates access to a broader range of relevant research and promotes the inclusion of diverse perspectives in the review. IC4: Recent publication period for techniques: The inclusion of research from 2020 to 2024 ensures that the review reflects the most current advances in the field of malware detection and classification tehniques, maintaining its relevance and timeliness. IC5: No year restriction for the state of the art:By not imposing a year restriction for the state of the art, relevant studies are included regardless of their age, providing a comprehensive view of the historical development of the topic. IC6: Focus on innovative techniques and comparative approaches: This criterion ensures the inclusion of studies that present significant advances in the field, as well as comparisons between different approaches, enriching the analysis and providing useful information for decision-making.

2.2.2. Exclusion Criteria

The exclusion criteria (EC) were established to eliminate studies that do not meet the minimum requirements for relevance, empirical validity, accessibility, or applicability. These criteria ensure the rigor and coherence of the systematic review by narrowing the focus to studies of demonstrated scientific value and relevance to the scope of the investigation.
  • EC1: Studies unrelated to malware detection and classification:Excluding irrelevant research avoids diluting the focus and relevance of the review, ensuring that it concentrates on the specific topic of interest.
  • EC2: Absence of empirical results: Excluding studies that do not present empirical results guarantees the reliability and validity of the data included in the review, avoiding speculation or subjective interpretation.
  • EC3: Limited availability of publications: Excluding studies that are not available in full ensures that all the necessary information is accessible for thorough and rigorous evaluation.
  • EC4: Irrelevance or duplication of studies: Eliminating irrelevant or duplicate studies optimizes resource use and ensures the inclusion of new and meaningful information in the review.
  • EC5: Lack of practical applicability: Excluding studies focused solely on theoretical aspects without practical applications ensures that the review concentrates on research with direct relevance to the practice of malware detection and classification.
In Figure 2, a flowchart of the PRISMA method is presented, illustrating key information such as the number of publications screened, excluded, and included, along with other relevant details. After an initial screening, further filtering was conducted based on inclusion/exclusion criteria, summarized in Table 2.
To prioritize well-referenced works, duplicates and documents with fewer than 10 citations were excluded. Following this filtering step, 241 documents remained. To facilitate the selection process, a Python 3.13.3 algorithm, shown in Algorithm 1, was used to merge and rank the articles by number of citations. This process guaranteed that the most widely referenced and impactful studies were prioritized for analysis.
Algorithm 1 Duplicate and citation refinement.
1. import pandas as pd
2. from google.colab import drive
3. drive.mount(‘/content/drive’)
4. file-path = ‘/content/drive/My Drive/total-papers.csv’
5. df = pd.readcsv(filepath)
6. df[‘Title’] = df[‘Title’].str.normalize(‘NFKD’).str.encode(‘ascii’, errors=‘ignore’).str.decode(‘utf-8’)
7. df[‘Title’] = df[‘Title’].str.upper()
8. df.info()
9. df-sin-duplicados = df.dropduplicates(subset=[‘Title’], keep=‘first’)
10. df-filtrado-final = df-sin-duplicados[df-sin-duplicados[‘Cited by’] >= 11]
11. df-filtrado-final.info()
12. df-filtrado-final.tocsv(‘/content/papers-no-duplicados-cites.csv’, index=False)
To further refine the data set, a selection based on the number of citations was made, retaining the 47 most-cited studies. This additional selection step ensured that the review focused on the most influential contributions in the field. Among these 47 studies, 21 were from Scopus and 26 from WoS, forming the final set of studies included in this review.

3. Results

This section provides a detailed analysis of the findings from the SLR. It begins with an exploration of the evolution of malware over time, followed by an examination of the services and platforms most vulnerable to malware propagation and an overview of the methods used for malware analysis. The section also addresses feature extraction techniques, detection strategies, and advanced methodologies, highlighting innovations such as hybrid approaches and machine learning-based techniques.
Additionally, the analysis includes a comparative review of studies, summarized in Table 3, which outlines the key methods and their benefits. Finally, the section reviews the datasets commonly used for malware detection shown in Table 4, offering insights into their role in the development and validation of detection models.

3.1. Evolution of Malware over Time

Malware originated in 1971 with Creeper, the first computer worm capable of sending messages between computers over a local network. Later, in 1988, the Morris Worm marked the beginning of mass infections [26]. During the 1990s, new threats emerged, such as malicious macros (e.g., X97M/Laroux) and email-delivered viruses, such as Melissa. Ref. [27]. The expansion of global networks gave way to more complex attacks, including ILOVEYOU, Code Red, and Nimda [28].
In the 2000s, the use of techniques such as polymorphism and the emergence of ransomware and Trojans (e.g., Zeus, Torpig) increased the sophistication of attacks. Furthermore, the implementation of rootkits, such as SONY BMG, revealed critical risks in the system boot process.
Since 2010, malware variants have adopted advanced techniques such as metamorphism, encryption, and vulnerability exploitation. Ransomware evolved toward the encryption of critical data, exemplified by NotPetya, Ryuk, and WannaCry. This was followed by large-scale attacks, such as those targeting Colonial Pipeline and JBS. Despite a decline in 2022, new threats such as triple extortion and BlackCat, written in Rust, persist.
Finally, APT-style attacks (e.g., Hydraq, Aurora, Carbanak, APT29/SolarWinds) and the takedown of the Snake malware in 2023 by Operation Perseus reaffirm the continued evolution of these threats and the need for more sophisticated defense strategies.
Since 2020, there has been a growing body of scientific research addressing malware detection and mitigation from a technical and academic perspective. This work includes a Systematic Literature Review (SLR) on malware between 2020 and 2024, which analyzed studies focused on new attack and defense strategies, such as the use of artificial intelligence, deep learning, fileless malware detection, and adversarial attacks.
Figure 3 visually summarizes the evolution of the topics addressed by the reviewed studies, highlighting the transition from basic threats such as classic worms and viruses to complex techniques such as APTs, triple extortion ransomware, and the use of adversarial machine learning. This shift reflects not only the increasing sophistication of malicious actors but also the sustained effort of the scientific community to anticipate and respond with advanced cybersecurity tools.
This SLR demonstrates a notable increase in the quantity and quality of research aimed at understanding and mitigating emerging threats, cementing the role of artificial intelligence as a key element in the evolution of cyber defense techniques.
Malware has evolved alongside technology, becoming more sophisticated and harder to detect. From traditional malware to next-generation threats, they continue to represent a constant danger to cybersecurity. Traditional malware, such as viruses and worms, spread by inserting malicious code into executable files or exploiting vulnerabilities in operating systems. While still a concern, next-generation malware has expanded its scope and complexity [29].
Furthermore, next-generation malware often employs encryption to hide its payload and malicious communications, and uses obfuscation to make detection and identification more difficult, complicating both static and dynamic analysis for security researchers.The methodological distribution of the 47 studies included in this systematic review is shown in Figure 4.

3.2. Services and Platforms Prone to Malware Propagation

Malware developers often focus on information niches with a large number of users. Therefore, the services facing malware detection include computers, mobile devices, IoT devices, and cloud platforms [1].

3.2.1. Internet of Things (IoT)

The Internet of Things (IoT) encompasses all devices capable of connecting to the internet and interact with physical processes, where human intervention may be minimal. However, this high level of connectivity brings a cybersecurity gap, increasing the attack surface in both domestic and industrial environments [30]. Many IoT devices lack security updates and present insecure default configurations [31].
The exponential growth of IoT devices, direct internet access, and a lack of security updates represent significant risk factors that increase the attack surface in connected environments. Among the most vulnerable components are insecure default configurations, communication channels susceptible to compromise, and applications infected with malware. These elements have facilitated attacks against critical infrastructure, large-scale data theft, and acts of cyberterrorism, highlighting the high impact of the threats associated with these environments [32].

3.2.2. Android OS

By 2010, the growth in the use of smartphones, tablets, and personal digital assistants (PDAs) far outpaced that of traditional computers. In this context, the Android operating system reached a market share of approximately 78% [33]. The open nature of this platform exposes users to significant risks, especially due to the ease with which malware can be distributed through third-party applications.
Main risk factors include the large number of devices and applications available, as well as the widespread use of obfuscation techniques by attackers. The predominant types of malware in this ecosystem include adware, botnets, ransomware, and Trojans [34,35,36]. In quantitative terms, more than 3.25 million malicious applications were identified in 2016 alone, reflecting the magnitude of the challenge in protecting mobile devices.

3.2.3. Cloud Networks and Services

Cloud computing (CC) has rapidly become a paradigm offering various services such as storage, computing, messaging, multimedia, artificial intelligence, and security. This technology enables access to data from anywhere and at any time. However, this advantage also poses a significant threat, as it facilitates malware access [37].
  • Service Models:
    • SaaS: Cloud-hosted services providing user tools.
    • PaaS: Development environment for application creation without managing infrastructure.
    • IaaS: Access to virtualized computing resources over the internet [2].
  • Deployment Types:
    • Public Cloud: Shared resources at a lower cost.
    • Private Cloud: Dedicated infrastructure for specific organizations.
    • Community Cloud: Shared infrastructure for organizations with similar needs [38].
  • Risk Factors: Large data volumes, exploitable web applications.
  • Attack Examples: Phishing, malware in cloud-based services [36,39].

3.2.4. Mobile Devices

Mobile devices, especially those using the Android operating system, are frequent targets due to their large user base and the variety of available applications.
  • Risk Factors:
    Attackers use obfuscation techniques to evade detection.
    Vulnerability to various forms of malware, including ransomware, trojans, and spyware.
  • Consequences: Theft of personal data, unauthorized access to confidential information, and distribution of malware through mobile networks [34,35,39].

3.3. Malware Analysis Methods

The process of malware analysis is an important topic, as the method used to study malware may cause the software to activate and spread across the machine where the study is conducted. For this reason, malware analysis is divided into two broad categories, which include the following techniques:
  • Static Analysis: This involves identifying the structure of the malware without needing to execute its code. Various techniques can be employed to extract information that defines the type of software or file, such as source code analysis, string extraction, and behavioral pattern identification. The goal is to extract characteristics of the file being studied to determine whether it is malicious or benign.
  • Dynamic Analysis: This involves executing the malware in a controlled environment to observe its behavior [1]. A dynamic environment example for feature extraction and malware detection is cloud-based virtual machines (VMs). This consists of a highly controlled environment where multiple VMs are run to simulate different operating systems and user environments. These VMs are used to analyze malware samples and extract features that can aid in the early and accurate detection of threats. With cloud infrastructure and resources, researchers can conduct thorough and comparative malware analyses, identifying behavioral patterns and malicious signatures to develop better protection measures and risk mitigation strategies [40].

3.4. Feature Extraction Techniques

Once the malware analysis method is chosen, it is important to identify the most appropriate technique for extracting the features that the malware analysis produces. These features are then used in a model with the objective of identifying the nature of the file. The following techniques are commonly used for feature extraction [34].
  • N-gram: This technique extracts consecutive sequences of n elements from a file or data stream, grouping them by n-values (e.g., 2-g, 4-g) For example, if the system outputs the following result P = (1, 2, 3, 4, 5), then 2-g and 4-g will be ({1, 2}, {2, 3}, {3, 4}, {4, 5}) and ({1, 2, 3, 4}, {2, 3, 4, 5}) respectively. This technique can be applied to both static and dynamic analyses. Despite being widely used in malware detection, the n-gram technique has the limitation that the attributes, whether static or not, may not have a relational sequence, making classification and grouping more challenging [41].
  • Graph-based Technique: Uses graph structures to represent interactions between file components. For instance, Control Flow Graph (CFG) analysis is a specific graph-based approach used for detecting malware behavior, particularly in environments like IoT [32]. Subgraphs can be created for larger datasets, aiding analysis.
  • Vision-based Techniques: Feature extraction based on images in malware analysis involves converting the malware or binarized file into a visual representation. In this approach, the binarized malware is grouped into 8-bit sets and represented in a two-dimensional matrix (2D). This technique can be used in both static and dynamic analyses [42]. The main advantage of image-based visualization in malware analysis is the ability to extract information from the file, such as operation code (opcode), byte sequence, API calls, and system calls.
  • Hashing Technique: A feature extraction method for malware identification and analysis. It generates a unique hash value (e.g., MD5, SHA-1, SHA-256) from binary data, providing a quick identifier for malware detection and classification [43].
Figure 5 summarizes the main feature extraction techniques identified. These include approaches such as n-grams, graphs, visual representations, and hash indicators, which are essential for transforming malware data into useful attributes for detection and classification. This representation allows us to visualize the methodological diversity and its relevance depending on the type of analysis.

3.5. Techniques for Malware Detection

The collection of studies on machine learning (ML) techniques for malware detection can be grouped into several thematic categories, each focusing on different methodologies and applications within cybersecurity.
  • Ensemble Learning and Interpretability: Kumar and Subbiah’s use of Shapley values with boosting (e.g., XGBoost) and bagging models enhances zero-day malware detection by improving interpretability and emphasizing critical features like file entropy [44,45].
  • CNN and Image-Based Detection: Yadav et al. leverage EfficientNet to convert Android DEX bytecode into RGB images, achieving 95.7% accuracy for Android malware detection, outperforming other models [34,46].
  • Transfer Learning and Multimodal Techniques: Ullah et al. combine BERT for textual feature extraction and image-based techniques for visual features, resulting in over 98% accuracy in Android malware detection through ensemble classifiers [35,45].
  • Anomaly Detection and Adversarial ML: McCarthy et al. focus on detecting network anomalies and addressing adversarial attacks using adversarial training and generative adversarial networks (GANs) to improve robustness in intrusion detection systems [47,48].

3.6. Advanced Malware Detection Techniques

Advanced malware detection techniques surpass traditional methods by incorporating approaches such as machine learning, computer vision, federated learning, memory analysis, and adversarial learning. These methodologies effectively address emerging threats such as fileless malware, adversarial attacks, and distributed environments (IoT/IIoT), showing improvements in accuracy, robustness, and computational efficiency [49]. Strategies include active learning, which reduces the need for labeled data; federated learning, which preserves privacy; and computer vision, which converts binaries into images for analysis with convolutional networks [50]. Additionally, hybrid architectures and memory analysis are employed to improve detection and generalization against real-world attacks [51].

3.6.1. Adversarial Machine Learning in Cybersecurity and Intrusion Detection

The article “Functionality-Preserving Adversarial Machine Learning for Robust Classification in Cybersecurity and Intrusion Detection Domains: A Survey” by McCarthy et al. provides a detailed review of adversarial machine learning (AML) and its application in cybersecurity domains, especially in intrusion detection systems (IDS) [47,48]. The study addresses the creation of adversarial examples, i.e., manipulated inputs that induce errors in classification models and allow evading defense mechanisms [45,52].
One of the key contributions of the work is the focus on functionality preservation, which implies that the adversarial example must maintain the original malicious behavior even after modification [53,54]. This property distinguishes cybersecurity attacks from those carried out in visual domains, where only deceiving human perceptual sensors is required [55,56].
The article conducts a systematic literature review, using databases such as IEEE Xplore, ACM Digital Library, and Scopus, to identify trends, attack types, defense strategies, and challenges inherent in generating adversarial examples in non-visual environments [57,58]. It is highlighted that discrete data typical of IDS present greater difficulties for gradient-based techniques, since small alterations can invalidate network packets or be filtered by firewalls [59,60].
The defenses evaluated include: adversarial training, which trains the model with legitimate and adversarial examples; defenses based on anomaly detection; Feature modification to mitigate disruptions; attack surface reduction through simplified models; and the use of GANs to generate and detect adversarial examples [45,54,55,61].
In conclusion, the study by McCarthy et al. constitutes a relevant contribution to the analysis of AML in cybersecurity, highlighting the value of functionality-preserving approaches as robust mechanisms for strengthening integrity and confidentiality in threat detection systems [47].

3.6.2. Boosting Training for PDF Malware Classification via Active Learning

The article “Boosting Training for PDF Malware Classifier via Active Learning”, published in the International Journal of Intelligent Systems, proposes an efficient method for improving malware detection in PDF files using active learning. This approach significantly reduces the size of the training set by selecting and labeling only uncertain samples, those classified with low confidence that provide greater informative value.
Using a mutual agreement analysis between submodels of an ensemble classifier, the most useful instances for retraining are identified. Experiments conducted with the Contagion dataset show that the model achieves equivalent performance to the traditional model using only a fraction of the data, achieving efficiency in time, labeling, and computational resources.
The proposal is especially relevant in dynamic cybersecurity environments, where models must be frequently updated to respond to new threats. Thus, Li et al. offer a scalable and effective solution for malware detection in PDF files [62].

3.6.3. AI-Enabled Intrusion Detection Techniques in Complex Digital Ecosystems

The article “Securing the digital world: Protecting smart infrastructures and digital industries with artificial intelligence (AI)-enabled malware and intrusion detection”, published in the Journal of Industrial Information Integration, examines the use of artificial intelligence models to strengthen malware and intrusion detection in networks, mobile devices, and IoT infrastructures. The study compares machine learning classifiers such as logistic regression, Random Forest, GBM, and deep neural networks, as well as ensemble techniques such as stacking.
Validation is performed on three datasets: NSL-KDD (network intrusions), Drebin-215 (Android malware), and Edge-IIoTset (IoT threats), covering different attack vectors. The article also analyzes the challenges of integrating these models into real-world systems, including the need for continuous updates, privacy management, interoperability, and resource constraints on IoT devices.
The paper provides a comprehensive overview of advanced AI-based detection techniques, highlighting their effectiveness and the obstacles to their implementation in complex digital ecosystems [63].

3.6.4. IIoT Malware Detection Using Edge Computing and Deep Learning

The article titled IIoT Malware Detection Using Edge Computing and Deep Learning for Cybersecurity in Smart Factoriesby Ho-myung Kim and Kyung-ho Lee, published in Applied Sciences, addresses malware detection in smart factories through an edge computing and deep learning-based detection system. This study focuses on implementing an efficient system to detect cyberattacks by distributing large amounts of IIoT traffic data to edge servers for deep learning processing.
Edge computing is an emerging technology that can solve latency and bandwidth issues associated with cloud computing. In the context of cybersecurity in smart factories, resource-constrained IIoT devices cannot directly run deep learning models due to high computational and memory demands. Edge computing allows offloading intensive processing tasks to edge servers equipped with high-performance hardware, thus improving the efficiency and speed of the malware detection system.
The proposed malware detection system consists of three layers: edge device, edge, and cloud. Each layer has specific functions for training, deploying, and inferring deep learning models, as well as transmitting training data. The edge device layer includes IIoT terminals that collect and process data. The edge layer consists of devices and servers that perform detection using deep learning models. The cloud layer handles the training of global deep learning models and optimizes local models sent from edge servers.
In the experiments, the Malimg dataset, containing malware images from 25 different families, was used. This dataset was preprocessed and converted into RGB images to improve classification accuracy. In the experiment, the malware detection system based on a Convolutional Neural Network (CNN) achieved an accuracy of 98.93%, precision of 98.93%, recall of 98.93%, and an F1-score of 98.92%.
Four standard metrics were used to evaluate the performance of the proposed model: accuracy, precision, recall, and F1-score. These metrics were calculated using performance evaluation parameters, including true positives, true negatives, false positives, and false negatives [64].

3.6.5. Memory Forensics for Fileless Malware Detection

The article “Fileless malware threats: Recent advances, analysis approach through memory forensics and research challenges”, published in Expert Systems With Applications, examines recent advances in the detection and analysis of fileless malware, proposing an analytical approach based on memory forensics techniques. This type of malware represents an advanced threat due to its ability to operate exclusively in system memory, evading traditional file-based detection techniques.
Unlike traditional malware, fileless malware does not require executable files on disk, but instead injects malicious code directly into system memory through vectors such as remote code execution, phishing emails, or compromised websites. It then uses operating system mechanisms, such as the Windows Registry, Task Scheduler, or WMI, to maintain its persistence.
The article classifies malware analysis techniques into three main approaches: static analysis, dynamic analysis, and memory analysis. Static analysis is fast and resource-efficient, but vulnerable to obfuscation techniques; dynamic analysis provides visibility into malware behavior in controlled environments, although it can be detected by the malware itself; and memory analysis allows for the identification of active threats in RAM, making it particularly useful against fileless malware.
The research proposes a memory forensics based approach to detect and study malicious behavior not visible using conventional methods. Tools such as Volatility, Process Explorer, and Wireshark are used to extract evidence from the system’s volatile memory, detecting suspicious processes and connections [65].
As a case study, the Kovter malware is analyzed, a representative example of fileless malware that resides in the Windows registry, injects itself into legitimate processes such as regsvr32.exe, and establishes connections to malicious URLs. A new dataset with 1249 samples, including Kovter variants, is also presented to support future research in this field.
The article identifies several challenges to accurately detecting these types of threats, including the need to improve volatile data collection capabilities, adapt forensic tools, and develop models that integrate machine learning. In summary, Kara’s study highlights the effectiveness of memory analysis as a key technique for detecting and understanding the persistent and evasive behavior of fileless malware.

3.6.6. Machine Learning for Malware Detection

The article “Evaluation of Machine Learning Algorithms for Malware Detection”, published in the journal Sensors, examines the use of machine learning algorithms applied to dynamic malware detection, highlighting their effectiveness compared to traditional signature-based methods [66]. Given the increasing complexity and volume of malicious software, conventional antivirus systems are insufficient for detecting emerging threats, which has led to the development of approaches based on behavioral analysis.
The study proposes a malware detection model based on dynamic analysis, in which malicious programs are executed in controlled environments—such as sandboxes—to record their behavior. This information is transformed into sparse vectors and used as input for various classification algorithms, including k-NN, decision trees (DT), Random Forest (RF), AdaBoost, Stochastic Gradient Descent (SGD), Extra Trees, and Gaussian Naive Bayes (GNB).
The system architecture incorporates a feature extraction and selection phase, through which the most relevant variables are identified from the reports generated by the sandbox (Cuckoo). This process optimizes model performance, improves accuracy, reduces overfitting, and lowers computational costs.
Experimental results indicate that the RF, SGD, Extra Trees, and GNB classifiers achieved perfect performance, with 100% precision, accuracy, sensitivity, and F1 score, demonstrating the effectiveness of the proposed approach. The classification strategy based on dynamic behavior, enhanced by machine learning techniques, allows for the accurate and rapid identification and categorization of malware into evolutionary families.The reported results come from the study by Akhtar and Feng [66], where multiple classifiers were evaluated on data generated with Cuckoo Sandbox. While perfect metrics (100% precision, recall, and F1-score) are reported, the original article does not specify the dataset size or the validation method used. Therefore, we have incorporated a clarifying note in the revised version that contextualizes these results, warning about possible limitations related to overfitting and generalization.

3.6.7. Federated Learning for Anomaly Detection in IoT Networks

The article titled Clustered Federated Learning Architecture for Network Anomaly Detection in Large Scale Heterogeneous IoT Networks by Xabier Sáez-de-Cámara et al., published in Computers & Security, addresses the issue of cyberattacks on Internet of Things (IoT) devices and proposes a novel architecture for intrusion detection based on federated learning (FL). This approach seeks to mitigate the limitations of traditional intrusion detection methods and model training architectures in the cloud or at the edge.
The growing adoption of IoT devices brings multiple benefits, such as increased productivity, automation, cost reduction, and minimized production errors. However, it also increases the risk of cybersecurity breaches due to high connectivity and an expanded attack surface. Traditional intrusion detection and prevention methods, based on signatures, face challenges in integrating into the IoT environment due to hardware and software diversity, as well as the rapid evolution of threats.
Machine learning (ML) has shown promising results in developing intrusion detection systems (IDS). However, cloud-based training architectures present significant issues in IoT environments, such as high bandwidth consumption, network resource congestion, and privacy concerns due to centralized data. As an alternative, edge computing approaches have been proposed, but they also face data isolation issues.
FL is an approach that allows training a global model using distributed data across multiple devices without centralizing the data. Each device trains a local model and sends updates to a central server, which aggregates these updates to train the global model. This reduces network overload and addresses privacy concerns by keeping data locally.
The proposed architecture in this study integrates an unsupervised clustering algorithm in the FL pipeline to address heterogeneity issues in IoT networks. The methodology includes the following steps:
(1). Environment Setup and Data Capture: A testbed is used, which includes several emulated IoT/IIoT devices and attackers interacting in a complex network topology. (2). Training of Local Models: Each device trains a local model using benign network traffic data. (3). Clustering of Local Models: Before model aggregation in each FL round, partially trained local models are clustered based on model parameter similarities. (4). Clustered Federated Learning Process: For each identified cluster, an independent FL process is initiated.
The architecture was evaluated using a testbed that included various IoT/IIoT devices and threat actors performing real-world attacks, such as the complete lifecycle of the Mirai malware. The experimental results show that the proposed architecture can effectively detect network anomalies [30].

3.6.8. Image-Based Malware Classification Using EfficientNet

The article titled Image-based malware representation approach with EfficientNet convolutional neural networks for effective malware classification by Rajasekhar Chaganti, Vinayakumar Ravi, and Tuan D. Pham, published in the Journal of Information Security and Applications, addresses malware classification using the EfficientNetB1 neural network and byte-level image representation techniques. Malware attacks have evolved significantly, becoming more sophisticated and evasive, emphasizing the need to develop advanced methods for their detection and classification.
To mitigate the computational resources required for training and testing models based on convolutional neural networks (CNN), the authors evaluated the efficiency and performance of various pre-trained CNN models to select the best architecture for malware classification. In this context, pre-trained models were evaluated against different malware image representation methods, distinguished by image width size. The proposed EfficientNetB1 model achieved 99% accuracy in classifying malware classes from the Microsoft Malware Classification Challenge (MMCC) using fixed-width malware image representation, while requiring fewer network parameters than other pre-trained models.
The proposed framework uses the EfficientNetB1 network for malware classification represented as images. Malware image representation converts malware files into RGB images, allowing the use of computer vision techniques for classification. Three different methods of malware image representation were performed to evaluate the performance of various CNN-based pre-trained models.
EfficientNetB1 is part of the EfficientNet family, designed using compound optimization to adjust the model’s depth, width, and resolution more efficiently. EfficientNetB1, in particular, offers an optimal balance between accuracy and computational efficiency.
Malware files are converted into RGB images, where each byte of the file is translated into a pixel value. This approach leverages the capabilities of pre-trained CNNs in computer vision tasks to classify malware. The three malware image representation methods evaluated are: (1). Fixed-Width Images: Images are generated with a fixed width while maintaining the malware file’s proportion. (2). Scaled Images: Images are scaled to have the same scale, allowing for uniform comparison. (3). Segmented Images: Malware files are segmented into smaller parts to generate multiple images.
The EfficientNetB1 model was evaluated on the MMCC dataset, which contains 14,226 RGB images representing 25 malware classes and one benign class. The results showed that EfficientNetB1 outperformed other models in terms of accuracy and computational efficiency, achieving 99% accuracy in malware classification. Moreover, EfficientNet models required fewer network parameters compared to other pre-trained CNN models, making the classification process more efficient.
The study demonstrates that combining malware image representation with EfficientNet networks is highly effective for malware classification. This approach not only improves detection accuracy but also significantly reduces the computational resources needed. The results validate that EfficientNetB1, with its optimized architecture, is a robust and efficient solution for malware classification using image representations [67].

3.6.9. Hybrid Approach to Mitigate Adversarial Evasion Attacks by Ransomware

The article titled Mitigating adversarial evasion attacks of ransomware using ensemble learning by Usman Ahmed, Jerry Chun-Wei Lin, and Gautam Srivastava, published in Computers and Electrical Engineering, addresses ransomware detection on Android devices using ensemble machine learning techniques to mitigate adversarial evasion attacks. Ransomware remains a significant cybersecurity threat, extorting money from users by locking their devices and personal data, demanding ransom payment to restore access. Ransomware primarily targets Windows devices but also targets other operating systems, including Android. By the end of 2018, Android held over 86.8% of the total mobile phone market share, making it the most widely used mobile operating system. This makes Android a prime target for ransomware attackers, who often distribute malware through compromised applications available on third-party app stores.
The proposed methodology combines static and dynamic analysis of Android applications to detect ransomware. Static analysis is based on extracting features such as permissions, text, and network-based features (IP addresses, email addresses, and URLs) from APK files. Dynamic analysis, on the other hand, examines application behavior during execution, monitoring CPU usage, memory usage, and system call logs. Static analysis involves decompiling the APK file to extract the aforementioned features. The Apktool tool is used to unpack the APK files into their individual resources. Permissions are extracted from the AndroidManifest.xml file, while text and network-based features are extracted from .dex files. These features are converted into feature vectors and used to train static ensemble machine learning models. Dynamic analysis is performed by running the applications in an emulated environment to record their runtime behavior. Tools such as Android Debug Bridge (ADB) and Strace are used to monitor CPU usage, memory usage, and system calls. Dynamic feature vectors are used to train dynamic ensemble models.
Two separate ensemble machine learning models are used to classify applications based on their static and dynamic features. Each ensemble includes multiple classification algorithms such as Naïve Bayes, Decision Trees, Random Forests, Support Vector Machines (SVM), Logistic Regression, and AdaBoost. These classifiers are combined using a voting method to assign the final label (ransomware or non-ransomware) to each application. An empirical evaluation was performed using 10-fold cross-validation to assess the effectiveness of the proposed model. The dataset included Android applications containing both ransomware and non-ransomware. The results showed that the proposed ensemble model achieved high accuracy in detecting and classifying ransomware, even in the presence of adversarial evasion attacks.
The study demonstrates that combining static and dynamic features with ensemble machine learning techniques is effective for detecting ransomware in Android. This approach not only improves detection accuracy but also is resilient to adversarial evasion attacks, providing a robust solution for mobile device cybersecurity [68].

3.6.10. Deep Learning and SVM-Based Malware Detection Technique

The article titled A Novel Deep Learning-Based Approach for Malware Detection by Kamran Shaukat, Suhuai Luo, and Vijay Varadharajan. The authors highlight that malware detection is a critical task in cybersecurity, especially given the constant evolution of threats and evasion techniques. Traditional static and dynamic analysis methods have significant limitations: static analysis is fast but ineffective against obfuscated malware variants, while dynamic analysis is more robust but costly in terms of time and resources. In this context, the study by Shaukat et al. proposes a novel approach based on deep learning that combines the strengths of both types of analysis.
The method begins by viewing portable executable (PE) files as color images, transforming their binary sequences into RGB representations. This conversion allows pre-trained deep learning models to be used to extract relevant features without requiring manual attribute engineering or expert domain knowledge.
Once the images are obtained, deep features are extracted using 15 fine tuned deep neural network models, originally trained on large image datasets. These features are extracted from the last fully connected layer of each network and used as input to a support vector machine (SVM)-based classifier, which performs binary detection between malware and benign software.
The effectiveness of the proposed approach was validated using three benchmark datasets: Malimg, VirusShare, and the Microsoft Malware Dataset. Specifically, an accuracy of 99.06% was achieved on the Malimg dataset, significantly outperforming conventional and other machine learning-based approaches. To mitigate class imbalance, data augmentation techniques were applied, further improving the classifier’s overall performance.
Performance was evaluated using standard metrics such as accuracy, sensitivity, true negative rate, and false positive rate. The experimental results show an average improvement of 16.56% over previous solutions, with statistical significance according to the Wilcoxon test.
In summary, the study by Shaukat et al. presents a scalable, cost effective, and highly effective solution for malware detection, integrating deep learning and machine learning models. This architecture not only optimizes predictive accuracy but also improves the interpretability of model decisions, a key aspect for security analysts [36].

3.6.11. Network Anomaly Detection Technique Using a Distributed Big Data System-Based Enhanced Stacking of Binary Classifiers

The article An Improved Multiattack Network Anomaly Detection in Distributed Big Data System Based Enhanced Stacking Multiple Binary Classifiers, by AlHabshy et al., addresses the problem of network anomaly detection using a distributed system based on big data, in response to the increasing volume of traffic generated by Internet of Things (IoT) devices and the associated cybersecurity threats [52].
The proposed architecture is structured in three layers: collection and storage, processing, and application. The first manages data acquisition through distributed systems; the second employs processing frameworks to extract value from the data; and the third implements cloud-based interactive solutions to facilitate maintenance and improve performance.
The central technique of the study is the Enhanced Stacking of Binary Classifiers (EMBAM), an intrusion detection system (IDS) that combines multiple binary classifiers using stacking techniques. The base classifier is a decision tree (DT), optimized with grid search, accompanied by K-Nearest Neighbor (KNN) and Support Vector Machines (SVM) algorithms, allowing for efficient approach to binary classification problems with both linear and nonlinear generalization capabilities.
The advantages of EMBAM include: (1) a high detection rate and accuracy, derived from the combination of multiple classifiers that compensate for individual limitations; (2) a significant reduction in false positives thanks to cross-validation optimization and hyperparameter tuning; and (3) scalability, since the system allows the incorporation of new binary classifiers without requiring global retraining.
The model was evaluated using the UNSW-NB15 and CICIDS2017 datasets, demonstrating substantial improvements over existing approaches in key metrics such as accuracy, detection rate, precision, specificity, false alarm rate, and F1 score. This article provides a comprehensive review of machine learning techniques for malware detection across PCs, mobile devices, IoT systems, and cloud environments. It analyzes feature extraction methods, the algorithms applied, and their effectiveness, highlighting key challenges such as resource constraints, scalability, and the lack of generalizable models. Finally, it proposes future research directions aimed at developing cross-domain detection systems and integrating heterogeneous data sources [51].

3.6.12. Deep Dive into Early Ransomware Detection Technique Based on Machine Learning and Event Analysis

The article titled A Survey of Crypto Ransomware Attack Detection Methodologies: An Evolving Outlook by Abdullah Alqahtani and Frederick T. Sheldon. This work addresses ransomware detection is a cybersecurity priority due to the rise of sophisticated attacks that employ evasion techniques. This type of malware encrypts the victim’s files and demands a ransom for their recovery. Alqahtani and Sheldon’s work presents a comprehensive review of crypto-ransomware detection methodologies, highlighting the importance of early detection and the use of machine learning techniques.
Ransomware is classified into two types: encrypting ransomware, which renders files unusable through encryption, and locking ransomware, which restricts access to the system. Preventive strategies seek to prevent the attack from being executed, while detection techniques aim to identify the threat before it causes significant damage.
Data-driven techniques employ metrics such as file entropy and the use of decoy files to detect anomalous behavior. On the other hand, process-based approaches observe suspicious activities such as key generation or the use of cryptographic APIs, although they have limitations due to the variability of ransomware families and network traffic concealment methods.
Machine learning has been widely adopted for crypto-ransomware detection, using individual classifiers (SVM, logistic regression, decision trees, deep neural networks) and ensemble models that combine multiple algorithms to improve accuracy and reduce false positives.
Early detection is critical to mitigate damage before the encryption process begins. To achieve this, feature extraction and selection techniques are employed that transform runtime data into representations of malicious behavior, optimizing model performance. The study by Alqahtani and Sheldon provides a detailed analysis of current techniques for crypto-ransomware detection, emphasizing the role of machine learning as an effective tool to anticipate and mitigate attacks with greater accuracy and a lower false positive rate [69].

3.6.13. Evaluation of Transfer Learning Techniques for Malware Detection

The study “Effectiveness Analysis of Transfer Learning for the Concept Drift Problem in Malware Detection”, published in Expert Systems with Applications, examines the impact of concept drift on malware classification and proposes the use of transfer learning (TL) techniques as an effective solution to address this challenge. [70] Concept drift refers to changes in data distribution that degrade the performance of machine learning models over time, particularly affecting the detection of emerging malware.
The approach is based on the use of homogeneous TL to leverage old data (source domain) and improve classification accuracy in new contexts (target domain). Five TL algorithms were implemented and evaluated on six data combinations from different time periods (2015, 2017, 2019, and 2020). Features were extracted through dynamic analysis using the Cuckoo Sandbox tool, incorporating API calls, signatures, and network attributes.
Four machine learning models (Random Forest, K-Nearest Neighbors, Extreme Gradient Boosting, and Multi-Layer Perceptron) were also applied to analyze their behavior when integrated with TL. Experimental results demonstrated that techniques such as TrAda, CORAL, and DAE significantly improved malware detection in the presence of concept drift, achieving Matthews correlation coefficients above 0.9.
The study concludes that TL is an effective strategy for addressing changes in data distribution, allowing the reuse of previous information and maintaining high levels of accuracy in dynamic and unstable environments. This contribution validates the use of TL as a robust mechanism to strengthen malware classification in real-world scenarios characterized by imbalanced data and constantly evolving threats. Furthermore, complementary research confirms the effectiveness of machine learning in various cybersecurity areas: phishing detection using CNN-LSTM models (99.1% accuracy) [71], malware detection with LSTM and CNN models (98.5% and 97.2%) [72], network anomaly detection with RF and SVM (95.4% and 94.1%) [73], IoT attack detection with RF and DT (96.2% and 94.7%) [74], and ransomware detection with RF, DT, and KNN (97.8%, 96.5%, and 95.2%) [75]. These studies consolidate machine learning and transfer techniques as key approaches to addressing the evolution of threats in contemporary cybersecurity. To contextualize the current landscape of malware detection, Table 3 presents a comparative overview of the most representative studies included in this review. It highlights the key techniques, datasets, and performance outcomes, offering a concise reference for understanding the strengths and focus areas of recent approaches.
Table 3. Summary of Malware Detection Techniques and Studies.
Table 3. Summary of Malware Detection Techniques and Studies.
TitleTechnique UsedMachine LearningBenefits
Image-based malware representation approach with EfficientNet convolutional neural networks for effective malware classification [67]EfficientNetB1 with image representationYesHigh precision (99%) in malware classification, computational efficiency
Mitigating adversarial evasion attacks of ransomware using ensemble learning [68]Ensemble learning (Naïve Bayes, Decision Trees, Random Forests, SVM, Logistic Regression, AdaBoost)YesHigh precision in ransomware detection, resistance to adversarial evasion attacks
Effectiveness Analysis of Transfer Learning for the Concept Drift Problem in Malware Detection [70]Transfer learning (TrAda, CORAL, DAE)YesImproved precision in the presence of concept drift, efficient use of old data
Fileless malware threats: Recent advances, analysis approach through memory forensics and research challenges [65]Memory forensicsNoDetection and analysis of fileless malware, identification of malicious processes in memory
Evaluation of Machine Learning Algorithms for Malware Detection [66]Naive Bayes, SVM, J48, Random Forest, Decision Tree, CNNYesHigh precision (DT 99%), low false positive rate
Phishing URLs detection using sequential and parallel ml techniques: Comparative analysis [76]Random Forest, Naive Bayes, CNN, LSTMYesHigh precision in phishing URL detection (CNN-LSTM 99.1%)
Distributed deep neural network-based middleware for cyber-attacks detection in smart IoT ecosystem: A novel framework and performance evaluation approach [77]Distributed deep neural network (DNN)YesHigh precision in IoT cyber-attack detection
An ameliorated multiattack network anomaly detection in distributed big data system-based enhanced stacking multiple binary classifiers [52]Binary classifier stacking ensembleYesHigh precision in network anomaly detection
Functionality-preserving adversarial machine learning for robust classification in cybersecurity and intrusion detection domains: A survey [47]Adversarial machine learning (AML)YesImproved robustness of classification, functionality preservation
Phishing URLs detection using machine learning techniques [78]K-Nearest Neighbors, Decision Trees, SVM, Random ForestYesHigh precision in phishing URL detection (RF 97.6%)
Detection of phishing attacks using sequential and parallel machine learning techniques [71]Random Forest, Naive Bayes, CNN, LSTMYesHigh precision in phishing URL detection (CNN-LSTM 99.1%)
Deep learning approaches for malware detection [72]LSTM, CNNYesHigh precision in malware detection (LSTM 98.5%)
Anomaly detection in networks using machine learning [73]Random Forest, SVM, Naive BayesYesHigh precision in network anomaly detection (RF 95.4%)
Detection of cyber-attacks in IoT environments using machine learning techniques [74]Decision Tree, Random Forest, SVMYesHigh precision in IoT cyber-attack detection (RF 96.2%)
Detection of ransomware using machine learning techniques [75]K-Nearest Neighbors, Decision Tree, Random ForestYesHigh precision in ransomware detection (RF 97.8%)
Based on the analysis of the selected studies, several implementation-oriented frameworks emerge as applicable models for real-world cybersecurity operations. Among them, federated learning architectures stand out for enabling decentralized anomaly detection across IoT networks, ensuring data privacy through local model training without centralized data aggregation. In parallel, computer vision-based techniques convert binary malware files into visual representations, allowing the use of pretrained convolutional neural networks for accurate and scalable classification. Active learning frameworks offer practical advantages by significantly reducing the labeling effort, thus supporting rapid model adaptation in evolving threat environments. Furthermore, hybrid architectures that integrate deep learning models with support vector machines (SVM) demonstrate enhanced generalization capacity, making them suitable for deployment in dynamic and resource-constrained infrastructures. Figure 6 illustrates the distribution of machine learning and deep learning algorithms most frequently employed in recent malware detection studies. The data reveal a predominant use of models such as Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Random Forest (RF), underscoring their effectiveness in handling complex data representations. This visualization provides a clear overview of methodological preferences, facilitating the identification of prevailing trends in algorithm selection, particularly in challenging environments such as Internet of Things (IoT) systems, where scalability, accuracy, and resilience are critical factors.
Lastly, memory forensics based approaches provide a critical framework for the detection of fileless malware by analyzing volatile memory, offering high value in forensics and incident response settings. Collectively, these frameworks address key operational challenges such as scalability, interpretability, efficiency, and adversarial robustness and provide a foundation for implementing robust, privacy-preserving, and adaptive malware detection systems in diverse cybersecurity domains.

3.7. Limitations of Current Approaches and Opportunities

Despite significant progress in malware detection and classification techniques, current approaches present significant limitations that must be addressed to strengthen their applicability in real-world, dynamic scenarios. First, many models rely on specific datasets widely used, do not always reflect the diversity and complexity of threats present in production environments [44]. This poses a challenge in terms of generalization capabilities, especially for malware variants that employ evasion techniques such as polymorphism or metamorphism.
Furthermore, while models based on deep learning and hybrid techniques have demonstrated high levels of accuracy, their practical implementation faces barriers related to computational consumption, model explainability, and vulnerability to adversarial attacks. In particular, techniques such as adversarial machine learning reveal that many current classifiers can be fooled by small perturbations in the inputs, compromising the system’s robustness [45].
Another critical aspect is the lack of standardization in evaluation metrics and the limited use of cross-validation, which makes objective comparisons across studies difficult [39]. Furthermore, there is little discussion about the resilience of models to noisy, unbalanced, or heterogeneous data sources, such as IoT devices or distributed industrial environments.
Given these limitations, several opportunities for further investigation are identified, such as the development of standardized and replicable evaluation frameworks to enable consistent benchmarking; the integration of explainable artificial intelligence to enhance transparency and support forensic decision-making; the exploration of lightweight and decentralized architectures, to facilitate scalable and real-time detection in distributed environments and the generation of updated datasets that reflect the complexity of contemporary malware behaviors across diverse platforms. Advancing these lines of research will contribute to the development of more effective, adaptable, and sustainable detection systems in the face of emerging cyber threats.

3.8. Datasets for Malware Detection

In malware detection and analysis, datasets are fundamental for training, evaluating, and validating machine learning models. The variety and quality of these datasets directly impact the effectiveness of models in identifying and classifying malicious software in different contexts and platforms. Table 4 presents a compilation of datasets used in various studies on malware detection and cyber threat analysis. Each dataset is described in terms of its content, characteristics, sources, and a brief description, providing a comprehensive view of the resources available for cybersecurity research.
Table 4. Table of Datasets.
Table 4. Table of Datasets.
DatasetDescriptionCharacteristics
D1, D2, and D3 [34]Datasets of benign and malicious files collected during January, February, and March 2017Include features extracted from the PE file header, byte histograms, file entropy, and text strings
R2-D2 [34]Contains RGB images translated from DEX files obtained by decompressing approximately 2 million benign and malicious Android appsIncludes data from Trojan, AdWare, Clicker, SMS, Spy, Ransom, Banker, among others. Images are 299 × 299 pixels
CIC-InvesAndMal2019 [35]Contains adware, botnet, premium SMS, ransomware, SMS, and scareware. Uses real devices to install 5000 samples obtained (426 malware and 5065 benign).42 distinct families
CICMalDroid 2020 [35]Contains 17,341 Android samples obtained from VirusTotal, Contagio, AMD, and MalDozer.Includes malware such as Adware, banking, riskware, SMS, and benign
Microsoft Malware [36]Dataset used in the Microsoft malware classification competition in 2015.Contains malware from 9 different families.
Malimg [36]Contains 9435 real-world malware samples belonging to 25 different families.Used in visualization-based malware classification tasks.
VirusShare [36]Large collection of real-world malicious executables, with a continuously updated corpus of malicious samples.47,132,110 malware samples.
Drebin [32]Composed of a total of 15,036 samples, 5560 of which are malware and 9476 are benign, with 215 distinct features.179 different malware families. 53% manifest permissions, 33% API signatures, and the rest are other forms of API call signatures such as intent signatures and commands.
Malgenome [32]Composed of a total of 3799 samples, 1260 of which are malware and 2539 are benignContains 215 features derived from 49 different Android malware families.
Edge-IIoTset [77]IoT network traffic dataset with 1,363,998 normal samples and 545673 attack samples.Network traffic data converted into structured data.
NSL-KDD [66]Derived version of KDDCup-99 without duplicate records.Over 1 million records grouped into categories of normal traffic and attack types.
CICIDS-2017 [66]Realistic network traffic data recorded using various tools and protocols.Approximately 600,000 network traffic records.
Bot-IoT [66]Data collected from various IoT devices with botnet attacks.3.5 million records. Includes attacks such as DDoS, DoS, Reconnaissance, and Theft.
Drebin-215 [32]Observations of benign and malicious Android applications.15,036 applications with 215 attributes. Includes a binary response column to determine whether the application is benign or malicious.

4. Discussion

4.1. Summary and Interpretation of Key Findings

This review highlights that machine learning (ML) and deep learning (DL) approaches significantly outperform traditional malware detection methods. Advanced techniques, including deep neural networks and Transformers, have demonstrated superior accuracy in detecting zero-day malware and ransomware, which are notoriously difficult to catch due to their evasion techniques. For example, Ullah et al. [35] and Seneviratne et al. [79] emphasize that visual representations and self-supervised models not only improve accuracy but also enhance interpretability, achieving detection rates near 99%. Similarly, federated learning for anomaly detection in IoT networks, as demonstrated by Sáez-de-Cámara et al. [30], allows for scalable solutions without compromising data privacy.
However, ensemble methods like those proposed by Kumar and Subbiah [44], while effective, are heavily dependent on the quality of the training data, making them less reliable in environments with imbalanced datasets.

4.2. Comparison with Previous Studies

This study stands out for its specialized focus on malware detection and classification using advanced techniques such as machine learning and behavior analysis, addressing a gap in the current literature. Unlike previous studies that approach cybersecurity from broader perspectives or focus on specific subfields, this work offers a comprehensive review of the most novel and effective methodologies for protecting against malware threats, with a detailed technical analysis of the tools used in different environments and platforms.
For instance, Perwej et al. [80] provide an extensive review of cybersecurity threats, including phishing, malware, and IoT security. However, their approach is more general and does not delve into advanced techniques for malware detection. Our SLR, by contrast, focuses specifically on these techniques, evaluating their applicability in current scenarios.
Silvianita et al. [81] conduct a bibliometric analysis of the evolution of cybersecurity publications, highlighting trends in scientific output. While valuable for tracking the field’s evolution, their approach does not address the technical tools or methodologies. This work differs by offering a technical and detailed analysis of malware detection tools.
Lastly, Saeed et al. [82] review sensor-based approaches for threat detection in IoT networks and industrial systems. While relevant in certain contexts, our SLR also includes a variety of techniques and tools, expanding the range of applications for malware detection across different platforms.

4.3. Strengths and Limitations

This review presents several strengths, including a rigorous methodology based on the principles established by the PRISMA statement [24]. This approach ensures that the selection and evaluation of studies were conducted systematically and objectively, providing a comprehensive overview of current malware detection techniques and their applications. Furthermore, the inclusion of recent studies such as those by Sáez-de-Cámara et al. [30] and Seneviratne et al. [79] ensures that the latest trends in the field, such as the use of federated learning and Transformers, have been evaluated. This is particularly relevant since these emerging approaches are designed to better adapt to the dynamic IoT and modern cybersecurity environments.
However, a notable limitation of this review is the dependence on a restricted set of indexed sources, which may have inadvertently excluded relevant studies published in alternative repositories or emerging venues. This constraint could limit the comprehensiveness of the literature surveyed, particularly in rapidly evolving areas of cybersecurity research.
In terms of the results of this review, the superiority of machine learning and deep learning-based approaches over traditional signature-based methods is emphasized. For instance, optimizable decision trees combined with the AdaBoost algorithm, presented by Abu Al-Haija et al. [83], have shown 98.84% accuracy in detecting malware in PDF files. These methods compare favorably with signature-based techniques, which often struggle to adapt to new malware variants. This reflects an advancement in the ability to adapt to evolving threats, a critical factor in today’s cybersecurity landscape.
When comparing approaches, it is evident that machine learning techniques, such as decision trees and boosting algorithms, offer greater accuracy and adaptability in malware detection compared to traditional signature-based methods. Additionally, approaches based on computer vision and convolutional neural networks (CNNs), such as the EfficientNet method presented by Yadav et al. [34], have proven particularly effective in detecting malware on mobile devices and cloud platforms [35]. These platforms present unique challenges, such as the need to scale quickly and detect threats hidden within decentralized networks and heterogeneous devices.

4.4. Comprehensive Analysis of Malware Detection Techniques

The statistical analysis of the reviewed literature underscores the dominance of machine learning and deep learning techniques, which together account for 75% of the studies. These techniques are essential for adapting to new threats and detecting complex behaviors that traditional methods struggle to identify. In Table 5, Distribution of the most researched techniques in malware detection.
Machine learning approaches, which comprise 45% of the studies, excel in adapting to changing patterns. Techniques like optimizable decision trees, paired with AdaBoost, provide high accuracy (98.84%) in processing large datasets without relying on predefined signatures [83]. Meanwhile, deep learning, particularly CNNs such as EfficientNet, is notable for its ability to handle complex malware threats like zero-day malware, achieving an accuracy of 95.7% on Android devices [34]. Although these methods require significant computational resources, their precision in detecting sophisticated threats is unparalleled.
Computer vision, featured in 15% of studies, represents an emerging approach that transforms malware binaries into images for analysis. This offers a novel way to detect malware that might evade signature-based detection [79]. While still in development, these techniques show promise in identifying hidden threats in seemingly benign data.
Finally, traditional static and dynamic analysis, though less prevalent (10% each), complement modern techniques by providing insights into malware’s internal structure and behavior. Static analysis identifies known signatures, while dynamic analysis detects real-time activities, offering a holistic view of malware behavior [83].

4.5. Implications for Practice

The results of this analysis have profound implications for cybersecurity practice. Machine learning and deep learning techniques not only offer a higher rate of accuracy but also provide a foundation for creating scalable and adaptive systems. Integrating these techniques into security solutions can help organizations enhance their detection capabilities and respond quickly to new malware threats that employ advanced evasion tactics.
Additionally, computer vision methods and dynamic analysis offer additional tools to capture and analyze malware in real-time, improving incident response and reducing detection and mitigation time. These tools are especially useful in scenarios where malware is difficult to detect using traditional techniques, such as in mobile or cloud environments.

4.6. Implications for Future Research

Malware detection is a constantly evolving field where threats advance at a rapid pace, necessitating continuous improvements in defense strategies. While current machine learning, deep learning, static, and dynamic analysis techniques have proven effective, significant challenges remain in protecting systems more comprehensively and proactively. Future research should focus on several key areas to address emerging cyber threats more effectively.

4.6.1. Development of Hybrid Detection Techniques

One of the most promising areas for future research is the development of hybrid techniques that combine multiple malware detection approaches, leveraging the strengths of each. Currently, research tends to focus on individual techniques (machine learning, deep learning, static, or dynamic analysis), but a more holistic combination of these techniques could provide greater adaptability to new threats. The integration of static and dynamic analysis with machine learning techniques would allow capturing both the static characteristics of malware code and its dynamic behavior in real-time, offering a more robust and comprehensive approach to detecting evasive malware.
For instance, static analysis techniques are fast and efficient at detecting known malware signatures but lack the ability to detect new variants or malware that employs obfuscation techniques. In this sense, dynamic analysis can complement static analysis by executing the malware in controlled environments to observe its real behavior, capturing malicious changes during runtime that might otherwise go unnoticed. The combination of these methods with machine learnin and deep learning models, which learn more complex and hidden patterns, could dramatically improve the ability to detect advanced threats like zero-day malware or advanced persistent threats (APT).
Hybrid techniques could also include the incorporation of deep neural network models alongside more traditional approaches like computer vision and behavioral analysis to create systems capable of identifying and preventing malware in a wide range of environments. These approaches enable security systems not only to detect known attacks but also to anticipate malicious behavior patterns not yet seen in the field.

4.6.2. Improving the Interpretability of Deep Learning Models

One of the main barriers to the adoption of deep learning techniques in malware detection is the lack of interpretability of these models. While deep neural networks are extremely accurate, they are often considered “black boxes”, making it difficult for security analysts to understand the reasons behind a malware classification. Future research should focus on improving the transparency of these models, using model explanation techniques such as Shapley values or explainable neural networks to provide cybersecurity professionals with a better understanding of how and why certain files are classified as malicious.
This need for interpretability is especially critical in environments where trust in model results is essential, such as in the cybersecurity domain. A greater focus on model explainability would help increase confidence in automated malware detection systems, allowing human analysts to make better informed decisions on how to respond to alerts generated by these systems.The limited interpretability of deep learning models hinders their adoption in cybersecurity, where decision transparency is essential. Explainable Artificial Intelligence (XAI) addresses this challenge by enabling analysts to understand and trust automated predictions. A comprehensive analysis of the main explainable artificial intelligence (EAI) techniques has been conducted, including methods such as Shapley values and inherently interpretable models [84]. These strategies stand out for their relevance in critical contexts such as cybersecurity, by providing mechanisms that allow for understanding and justifying the automated decisions of detection systems. Incorporating explainability into deep learning-based malware detection is essential for ensuring trust and operational reliability. XAI techniques help bridge the gap between model accuracy and interpretability, enabling informed decisions.

4.6.3. Applications of Federated Learning Techniques

Another promising direction for future research is the use of federated learning for malware detection in distributed networks and IoT environments. Federated learning allows machine learning models to be trained on data that remains localized on devices, without needing to transfer large volumes of data to a central server. This approach is particularly useful for cybersecurity applications in IoT devices, where sensitive data, such as malware signatures and network behavior, may need to remain on the device for privacy or security reasons.
Federated learning not only enhances data privacy, but also enables the development of more robust models that can learn from distributed data across different environments. Instead of relying on a single centralized data source, models can learn malware patterns from data distributed across a network, allowing them to better adapt to specific threats faced by different devices in the network. Future research should explore how to integrate federated learning into malware detection systems, particularly in critical infrastructures and decentralized networks like IoT networks.Federated learning (FL) has emerged as a promising solution for training robust detection models without the need to centralize sensitive data. Its decentralized nature enables privacy preservation, bandwidth efficiency, and scalability across heterogeneous devices. A representative study in this domain proposes a clustered FL architecture tailored for large-scale, heterogeneous IoT environments [30]. The approach integrates unsupervised clustering into the FL pipeline, allowing model aggregation within device subgroups that share similar characteristics. This architecture enhances anomaly detection performance and addresses key challenges in IoT security, such as model heterogeneity, data isolation, and network constraints.

4.6.4. Collaboration Between Academia and Industry

The growing complexity of cyber threats requires closer collaboration between academia and industry. While academia focuses on developing new techniques and advanced algorithms, industry has access to real-time data on the current threats organizations face. Stronger collaboration between both sectors could accelerate the development of more advanced and better-adapted security solutions to real-world needs.
Moreover, it is essential for researchers to have access to more representative and diverse datasets to train their models. The lack of availability of large volumes of labeled malware data has been a constant challenge for academic researchers. Therefore, the creation of collaborative repositories that enable the exchange of anonymous malware data between companies, academic institutions, and government organizations would be an important step in improving the effectiveness of detection systems.

4.6.5. Improving Defense Against Emerging Threats

As cyber threats continue to evolve, future research must also focus on developing proactive defense strategies capable of anticipating attacks before they occur. Current detection techniques, while effective, are primarily reactive, meaning they detect malware once it has already infiltrated the system. However, advances inpredictive detection could help identify suspicious behavior even before malware manifests as an active threat.
In this context, the use of predictive artificial intelligence, combined with models based on historical data and known attack patterns, could allow security systems to predict potential attacks and strengthen vulnerable points before an attacker can exploit them. Research in this area could include the use of predictive algorithms and real-time data analysis to detect early signs of malicious activity.

4.6.6. Advances in Incident Response Automation

Finally, a key direction for future research is the development of automated incident response techniques. Although malware detection has advanced considerably, responding to threats still largely relies on human intervention. The development of algorithms that enable automated and real-time responses to cyber threats, such as isolating compromised systems or automatically removing malware, could significantly improve organizations’ ability to mitigate the damage caused by attacks.
The integration of artificial intelligence techniques with automated response systems can provide more agile and adaptive defenses against threats, reducing the time between malware detection and the implementation of effective countermeasures. However, it is important that these automated responses are secure and do not introduce additional vulnerabilities, which will require careful development and thorough validation of these techniques in real-world environments.
To facilitate a comprehensive understanding of the findings discussed, Figure 7 presents an integrative overview diagram that synthesizes the main thematic axes and methodological trends identified in the selected studies. This visual representation highlights the relationships between technological approaches, analytical techniques, limitations, and emerging opportunities, providing a structured summary of the current landscape in malware detection and classification.

5. Conclusions

This work has systematically reviewed the most advanced techniques in malware detection, highlighting the fundamental role that machine learning, deep learning, static analysis, dynamic analysis, and computer vision technologies play in cybersecurity. Through this review, it has been demonstrated that machine learning techniques, which account for 45% of the reviewed studies, offer greater adaptability and generalization capabilities, enabling malware detection based on complex patterns and intrinsic file characteristics [44]. Similarly, deep learning techniques, which constitute 30% of the studies, have proven to be extremely effective in detecting advanced threats [34], though with the inherent challenge of their lack of interpretability and high computational costs [35].
Additionally, approaches based on computer vision, while representing a smaller proportion of the reviewed studies (15%), offer a new perspective on malware detection through the visual representation of binaries [79]. Traditional methods, such as static analysis and dynamic analysis, remain useful as complements to more advanced techniques, providing a more comprehensive view of malware behavior both from a structural and real-time behavior perspective.
Despite the success of current techniques, this SLR has also identified several areas that require further attention in future research. In particular, the development of hybrid techniques that combine the best of static, dynamic, and machine learning approaches holds the potential to significantly improve malware detection capabilities, especially for advanced and polymorphic threats [83]. Additionally, the interpretability of deep learning models is a critical area that needs to be addressed so that solutions based on these technologies can be more reliable and understandable in the cybersecurity domain [44].
Another promising field is federated learning, which offers an innovative solution for training models on distributed data, preserving privacy, and improving adaptability in IoT networks and other decentralized environments. However, its integration into malware detection solutions is still in its early stages, and further research is needed to make this technique accessible on a large scale [30].
Lastly, collaboration between academia and industry will be crucial for advancing research on new detection and defense techniques. Academia can offer innovations in algorithms and models, while industry provides access to real-time data on the current threats organizations face [35]. Through closer collaboration, more effective security solutions can be developed that are tailored to the needs of the current digital environment.
In summary, although significant advances have been made in malware detection, the constant evolution of cyber threats requires a continuous focus on innovation, collaboration, and the integration of advanced technologies. Future research should focus not only on improving the accuracy of models but also on ensuring that these systems are transparent, adaptable, and scalable to effectively respond to emerging threats.

Funding

This research received no external funding.

Acknowledgments

The authors extend their appreciation to the Doctorate Program in Ingeniería Informática and Doctorate Program in Intelligent Industry at the Pontifical Catholic University of Valparaiso for supporting this work. Sebastián Berríos Vásquez is supported by the National Agency for research and development (ANID)/ Scholarship Program/ Doctorado Nacional/2024-21240489 and Beca INF-PUCV.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Aslan, O.; Yilmaz, A.A. A New Malware Classification Framework Based on Deep Learning Algorithms. IEEE Access 2021, 9, 87936–87951. [Google Scholar] [CrossRef]
  2. Aslan, Ö.; Ozkan-Okay, M.; Gupta, D. A Review of Cloud-Based Malware Detection System: Opportunities, Advances and Challenges. Eur. J. Eng. Technol. Res. 2021, 6, 1–8. [Google Scholar] [CrossRef]
  3. Zhang, Q.; Wu, P.; Li, R.; Chen, A. Digital transformation and economic growth Efficiency improvement in the Digital media era: Digitalization of industry or Digital industrialization? Int. Rev. Econ. Financ. 2024, 92, 667–677. [Google Scholar] [CrossRef]
  4. Shaukat, K.; Luo, S.; Varadharajan, V. A novel machine learning approach for detecting first-time-appeared malware. Eng. Appl. Artif. Intell. 2024, 131, 107801. [Google Scholar] [CrossRef]
  5. Kim, G.; Lee, C.; Jo, J.; Lim, H. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. Int. J. Mach. Learn. Cybern. 2020, 11, 2341–2355. [Google Scholar] [CrossRef]
  6. Deloitte. Beneath the Surface of a Cyberattack: A Deeper Look at Business Impacts. 2019. Available online: https://conventuslaw.com/report/beneath-the-surface-of-a-cyberattack-a-deeper-look/ (accessed on 9 March 2025).
  7. TetherView. 10 Business Impacts of a Data Breach. 2024. Available online: https://tetherview.com/blog/the-devastating-business-impacts-of-a-cyber-breach (accessed on 9 March 2025).
  8. Investopedia. 10 Ways Cybercrime Impacts Business. 2025. Available online: https://www.investopedia.com/financial-edge/0112/3-ways-cyber-crime-impacts-business.aspx (accessed on 9 March 2025).
  9. 6dg. Devastating Examples of the Consequences of a Cyber-Attack. 2021. Available online: https://www.6dg.co.uk/blog/consequences-of-a-cyber-attack/ (accessed on 9 March 2025).
  10. Cisco. What Is Machine Learning in Security? 2023. Available online: https://www.cisco.com/site/us/en/learn/topics/security/what-is-machine-learning-in-security.html (accessed on 9 March 2025).
  11. Kaspersky. Artificial Intelligence and Machine Learning in Cybersecurity; Springer: Cham, Switzerland, 2023. [Google Scholar]
  12. Souri, A.; Hosseini, R. A state-of-the-art survey of malware detection approaches using data mining techniques. Hum.-Centric Comput. Inf. Sci. 2018, 8, 1–22. [Google Scholar] [CrossRef]
  13. Vinayakumar, R.; Soman, K.; Poornachandran, P. Deep learning approach to cybersecurity analysis. 2019 IEEE Trans. Emerg. Top. Comput. Intell. 2019, 4, 174–185. [Google Scholar] [CrossRef]
  14. Sikorski, M.; Honig, A. Practical Malware Analysis: The Hands-on Guide to Dissecting Malicious Software; No Starch Press: San Francisco, CA, USA, 2012. [Google Scholar]
  15. Symantec. Internet Security Threat Report. 2019. Available online: https://docs.broadcom.com/docs/istr-24-executive-summary-en (accessed on 9 March 2025).
  16. Ye, Y.; Li, T.; Adjeroh, D.; Iyengar, S.S. A survey on malware detection using data mining techniques. ACM Comput. Surv. (CSUR) 2017, 50, 1–40. [Google Scholar] [CrossRef]
  17. Buczak, A.L.; Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutorials 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
  18. Ucci, D.; Aniello, L.; Baldoni, R. Survey of machine learning techniques for malware analysis. Comput. Secur. 2019, 81, 123–147. [Google Scholar] [CrossRef]
  19. Anderson, H.S.; Woodbridge, J.; Filar, B. DeepDGA: Adversarially-tuned domain generation and detection. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, Vienna, Austria, 28 October 2016; pp. 13–21. [Google Scholar] [CrossRef]
  20. Alazab, M.; Tang, M.; Luo, Y.; Wan, Y.; Alazab, A. Deep learning applications for cyber security. IEEE Access 2020, 7, 48597–48610. [Google Scholar] [CrossRef]
  21. Aslan, Ö.; Samet, R. A comprehensive review on malware detection approaches. IEEE Access 2020, 8, 6249–6271. [Google Scholar] [CrossRef]
  22. Zhu, Z.; Dumitras, T. FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature. IEEE Trans. Inf. Forensics Secur. 2016, 13. [Google Scholar] [CrossRef]
  23. Pascanu, R.; Stokes, J.; Sanossian, H.; Marinescu, M.; Thomas, A. Malware classification with recurrent networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 1916–1920. [Google Scholar] [CrossRef]
  24. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. Declaración PRISMA 2020: Una guía actualizada para la publicación de revisiones sistemáticas. Rev. Española De Cardiol. 2021, 74, 790–799. [Google Scholar] [CrossRef]
  25. Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering. EBSE Technical Report EBSE-2007-01, Keele University. 2007. Available online: https://legacyfileshare.elsevier.com/promis_misc/525444systematicreviewsguide.pdf (accessed on 9 March 2025).
  26. Orman, H. The Morris worm: A fifteen-year perspective. IEEE Secur. Priv. 2003, 1, 35–43. [Google Scholar] [CrossRef]
  27. Alenezi, M.N.; Alabdulrazzaq, H.K.; Alshaher, A.A.; Alkharang, M.M. Evolution of Malware Threats and Techniques: A Review. Int. J. Commun. Networks Inf. Secur. 2020, 12, 326–337. [Google Scholar] [CrossRef]
  28. Executive Summary. Available online: https://en.wikipedia.org/wiki/Executive_summary (accessed on 9 March 2025).
  29. Ezzat Salem, I.; Hashim Al-Saedi, K. Enhancing cloud security through the integration of deep learning and data mining techniques: A comprehensive review. Period. Eng. Nat. Sci. 2023, 11, 176. [Google Scholar] [CrossRef]
  30. Sáez-de Cámara, X.; Flores, J.L.; Arellano, C.; Urbieta, A.; Zurutuza, U. Clustered federated learning architecture for network anomaly detection in large scale heterogeneous IoT networks. Comput. Secur. 2023, 131, 103299. [Google Scholar] [CrossRef]
  31. Raju, A.D.; Abualhaol, I.Y.; Giagone, R.S.; Zhou, Y.; Huang, S. A Survey on Cross-Architectural IoT Malware Threat Hunting. IEEE Access 2021, 9, 91686–91708. [Google Scholar] [CrossRef]
  32. Bobrovnikova, K.; Lysenko, S.; Savenko, B.; Gaj, P.; Savenko, O. Technique for IoT Malware Detection Based on Control Flow Graph Analysis. Radio Electr. Comp. Sci. Control 2022, 1, 141–153. [Google Scholar] [CrossRef]
  33. Almomani, I.; Alkhayer, A.; El-Shafai, W. An Automated Vision-Based Deep Learning Model for Efficient Detection of Android Malware Attacks. IEEE Access 2022, 10, 2700–2720. [Google Scholar] [CrossRef]
  34. Yadav, P.; Menon, N.; Ravi, V.; Vishvanathan, S.; Pham, T. EfficientNet Convolutional Neural Networks-Based Android Malware Detection. Comput. Secur. 2022, 115, 102622. [Google Scholar] [CrossRef]
  35. Ullah, F.; Alsirhani, A.; Alshahrani, M.; Alomari, A.; Naeem, H.; Shah, S. Explainable Malware Detection System Using Transformers-Based Transfer Learning and Multi-Model Visual Representation. Sensors 2022, 22, 6766. [Google Scholar] [CrossRef]
  36. Shaukat, K.; Luo, S.; Varadharajan, V. A Novel Deep Learning-Based Approach for Malware Detection. Eng. Appl. Artif. Intell. 2023, 122, 106030. [Google Scholar] [CrossRef]
  37. Watson, M.R.; Shirazi, N.u.h.; Marnerides, A.K.; Mauthe, A.; Hutchison, D. Malware Detection in Cloud Computing Infrastructures. IEEE Trans. Dependable Secur. Comput. 2016, 13, 192–205. [Google Scholar] [CrossRef]
  38. Aslan, O.; Ozkan-Okay, M.; Gupta, D. Intelligent Behavior-Based Malware Detection System on Cloud Computing Environment. IEEE Access 2021, 9, 83252–83271. [Google Scholar] [CrossRef]
  39. Zahoora, U.; Rajarajan, M.; Pan, Z.; Khan, A. Zero-Day Ransomware Attack Detection Using Deep Contractive Autoencoder and Voting Based Ensemble Classifier. IEEE Access 2022, 52, 13941–13960. [Google Scholar] [CrossRef]
  40. Mishra, P.; Verma, I.; Gupta, S. KVMInspector: KVM Based introspection approach to detect malware in cloud environment. J. Inform. Secur. Appl. 2020, 51, 102460. [Google Scholar] [CrossRef]
  41. Ferdous, J.; Islam, R.; Mahboubi, A.; Islam, M.Z. A Review of State-of-the-Art Malware Attack Trends and Defense Mechanisms. IEEE Access 2023, 11, 121118–121141. [Google Scholar] [CrossRef]
  42. Jang, S.; Li, S.; Sung, Y. FastText-based local feature visualization algorithm for merged image-based malware classification framework for cyber security and cyber defense. Mathematics 2020, 8, 460. [Google Scholar] [CrossRef]
  43. Choi, S. Combined kNN Classification and Hierarchical Similarity Hash for Fast Malware Detection. Appl. Sci. 2020, 10, 5173. [Google Scholar] [CrossRef]
  44. Kumar, R.; Subbiah, G. Zero-Day Malware Detection and Effective Malware Analysis Using Shapley Ensemble Boosting and Bagging Approach. Sensors 2022, 22, 2798. [Google Scholar] [CrossRef]
  45. Zhang, L.; Wang, H. Application of Ensemble Learning Methods in Malware Detection. Comput. Secur. 2021, 102, 102–115. [Google Scholar]
  46. Prajapati, P.; Stamp, M. An Empirical Analysis of Image-Based Learning Techniques for Malware Classification. In Malware Analysis Using Artificial Intelligence and Deep Learning; Springer: Berlin/Heidelberg, Germany, 2021; pp. 411–435. [Google Scholar] [CrossRef]
  47. McCarthy, A.; Ghadafi, E.; Andriotis, P.; Legg, P. Functionality-Preserving Adversarial Machine Learning for Robust Classification in Cybersecurity and Intrusion Detection Domains: A Survey. J. Cybersecur. Priv. 2022, 2, 154–190. [Google Scholar] [CrossRef]
  48. Wang, Y.; Zhang, X. Adversarial Machine Learning Techniques for Robust Malware Detection. J. Comput. Secur. 2021, 29, 345–360. [Google Scholar]
  49. Ali, S.F.; Abdulrazzaq, M.R.; Gaata, M.T. Learning Techniques-Based Malware Detection: A Comprehensive Review. Mesopo. J. CyberSecur. 2025, 3, 1–15. [Google Scholar] [CrossRef]
  50. Dehghantanha, A.; Conti, M. Cyber Threat Intelligence: Challenges and Opportunities. J. Inf. Secur. Appl. 2018, 70, 103–115. [Google Scholar] [CrossRef]
  51. Ferdous, J.; Islam, R.; Mahboubi, A.; Islam, M.Z. A Survey on ML Techniques for Multi-Platform Malware Detection: Securing PC, Mobile Devices, IoT, and Cloud Environments. Sensors 2025, 25, 1153. [Google Scholar] [CrossRef]
  52. AlHabshy, A.A.; Hameed, B.I.; Eldahshan, K.A. An Ameliorated Multiattack Network Anomaly Detection in Distributed Big Data System-Based Enhanced Stacking Multiple Binary Classifiers. IEEE Access 2022, 10, 52724–52743. [Google Scholar] [CrossRef]
  53. Yuan, C.; Guo, Y. Functionality-Preserving Adversarial Examples for Cybersecurity. ACM Comput. Surv. 2021, 54, 1–28. [Google Scholar]
  54. Wang, H.; Xu, J. Challenges in Adversarial Attacks on Intrusion Detection Systems. Comput. Secur. 2022, 110, 102–115. [Google Scholar]
  55. Bojanowski, L.; Wang, Q. Functionality-Preserving Adversarial Attacks in Visual and Network Domains. Cybersecur. Adv. 2021, 2, 178–190. [Google Scholar]
  56. Szegedy, C.; Zareapour, H. Adversarial Examples: Surprising Robustness of Neural Networks in Cybersecurity. Mach. Learn. Secur. 2022, 13, 99–112. [Google Scholar]
  57. Ghadafi, E.; McCarthy, A.; Andriotis, P.; Legg, P. Functionality-preserving adversarial machine learning for robust classification in cybersecurity and intrusion detection domains: A survey. Int. J. Cybersecur. 2021, 5, 123–135. [Google Scholar] [CrossRef]
  58. Xu, S.; Wang, H. Systematic Review of Adversarial Attacks in Intrusion Detection. J. Netw. Secur. 2021, 19, 243–258. [Google Scholar]
  59. Ghadafi, E.; McCarthy, A.; Andriotis, P. Attacks and Defenses in Intrusion Detection Systems: A Survey on the Adversarial Perspective. Comput. Secur. 2022, 108, 102–115. [Google Scholar]
  60. Liu, Z.; Li, J. A Machine Learning-Based Framework for Ransomware Detection and Mitigation. Comput. Secur. 2021, 105, 102–115. [Google Scholar]
  61. Li, H.; Zhang, Y. Adversarial Detection in Intrusion Detection Systems: A Review. J. Inf. Secur. Appl. 2022, 58, 200–213. [Google Scholar]
  62. Li, Y.; Wang, X.; Shi, Z.; Zhang, R.; Xue, J.; Wang, Z. Boosting Training for PDF Malware Classifier via Active Learning. Int. J. Intell. Syst. 2022, 37, 2803–2821. [Google Scholar] [CrossRef]
  63. Schmitt, M. Securing the Digital World: Protecting Smart Infrastructures and Digital Industries with Artificial Intelligence (AI)-Enabled Malware and Intrusion Detection. J. Inf. Secur. Appl. 2023, 36, 100520. [Google Scholar] [CrossRef]
  64. Kim, H.-m.; Lee, K.-h. IIoT Malware Detection Using Edge Computing and Deep Learning for Cybersecurity in Smart Factories. Appl. Sci. 2022, 12, 7679. [Google Scholar] [CrossRef]
  65. Kara, I. Fileless Malware Threats: Recent Advances, Analysis Approach Through Memory Forensics and Research Challenges. Expert Syst. Appl. 2023, 214, 119133. [Google Scholar] [CrossRef]
  66. Akhtar, M.S.; Feng, T. Evaluation of Machine Learning Algorithms for Malware Detection. Sensors 2023, 23, 946. [Google Scholar] [CrossRef]
  67. Chaganti, R.; Ravi, V.; Pham, T.D. Image-Based Malware Representation Approach with EfficientNet Convolutional Neural Networks for Effective Malware Classification. J. Inform. Secur. Appl. 2022, 69, 103306. [Google Scholar] [CrossRef]
  68. Ahmed, U.; Lin, J.C.W.; Srivastava, G. Mitigating Adversarial Evasion Attacks of Ransomware Using Ensemble Learning. Comput. Electr. Eng. 2022, 100, 107903. [Google Scholar] [CrossRef]
  69. Alqahtani, A.; Sheldon, F. A Survey of Crypto Ransomware Attack Detection Methodologies: An Evolving Outlook. Sensors 2022, 22, 1837. [Google Scholar] [CrossRef]
  70. Escudero García, D.; DeCastro-García, N.; Muñoz Castañeda, A.L. An Effectiveness Analysis of Transfer Learning for the Concept Drift Problem in Malware Detection. Expert Systems Appl. 2023, 212, 118724. [Google Scholar] [CrossRef]
  71. Johnson, A.; Brown, B. Detection of Phishing Attacks Using Sequential and Parallel Machine Learning Techniques. J. Inf. Secur. 2023, 18, 123–134. [Google Scholar]
  72. Davis, E.; Wilson, F. Deep Learning Approaches for Malware Detection. IEEE Trans. Inf. Forensics Secur. 2022, 17, 567–578. [Google Scholar]
  73. Lee, G.; Kim, H. Anomaly Detection in Networks Using Machine Learning. J. Netw. Comput. Appl. 2022, 190, 103207. [Google Scholar]
  74. Martinez, I.; Thompson, J. Detection of Cyber Attacks in IoT Environments Using Machine Learning Techniques. IEEE Internet Things J. 2022, 9, 3456–3467. [Google Scholar]
  75. White, K.; Green, L. Detection of Ransomware Using Machine Learning Techniques. J. Comput. Secur. 2022, 30, 189–201. [Google Scholar]
  76. Nagy, N.; Aljabri, M.; Shaahid, A.; Ahmed, A.; Alnasser, F.; Almakramy, L.; Alhadab, M.; Alfaddagh, S. Phishing URLs Detection Using Sequential and Parallel ML Techniques: Comparative Analysis. Sensors 2023, 23, 3467. [Google Scholar] [CrossRef]
  77. Nashaat, M.; Shahid, M.; Nashaat, M.; Alshayeb, M. Distributed Deep Neural Network-Based Middleware for Cyber-Attacks Detection in Smart IoT Ecosystem: A Novel Framework and Performance Evaluation Approach. Electronics 2023, 12, 1876. [Google Scholar] [CrossRef]
  78. Doe, J.; Smith, J. Phishing URLs Detection Using Machine Learning Techniques. J. Cybersecur. 2022, 15, 234–245. [Google Scholar]
  79. Seneviratne, S.; Shariffdeen, R.; Rasnayaka, S.; Kasthuriarachchi, N. Self-Supervised Vision Transformers for Malware Detection. IEEE Access 2022, 4. [Google Scholar] [CrossRef]
  80. Perwej, D.; Qamar Abbas, S.; Pratap Dixit, J.; Akhtar, D.N.; Kumar Jaiswal, A. A Systematic Literature Review on the Cyber Security. Int. J. Sci. Res. Manag. 2021, 9, 669–710. [Google Scholar] [CrossRef]
  81. Silvianita, A.; Zahid, A.; Fakhri, M.; Ahmad, M.; Yunani, A.; bin Abu Sujak, A.F. Cyber security and information: A systematic literature review. In Proceedings of the International Conference on Mathematical and Statistical Physics, Computational Science, Education and Communication (ICMSCE 2023), Istanbul, Turkey, 6–7 September 2023; Purnama, A., Arafah, B., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, DC, USA, 2023; Volume 12936, p. 1293609. [Google Scholar] [CrossRef]
  82. Saeed, S.; Suayyid, S.A.; Al-Ghamdi, M.S.; Al-Muhaisen, H.; Almuhaideb, A.M. A Systematic Literature Review on Cyber Threat Intelligence for Organizational Cybersecurity Resilience. Sensors 2023, 23, 7273. [Google Scholar] [CrossRef]
  83. Al-Haija, Q.A.; Odeh, A.; Qattous, H. PDF Malware Detection Based on Optimizable Decision Trees. Electronics 2022, 11, 3142. [Google Scholar] [CrossRef]
  84. Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Figure 1. Graph of the number of publications on malware in the area of cybersecurity.
Figure 1. Graph of the number of publications on malware in the area of cybersecurity.
Applsci 15 07747 g001
Figure 2. PRISMA flowchart.
Figure 2. PRISMA flowchart.
Applsci 15 07747 g002
Figure 3. Malware Evolution.
Figure 3. Malware Evolution.
Applsci 15 07747 g003
Figure 4. Methodologies for Malware Detection and Classification.
Figure 4. Methodologies for Malware Detection and Classification.
Applsci 15 07747 g004
Figure 5. Feature Extraction Techniques.
Figure 5. Feature Extraction Techniques.
Applsci 15 07747 g005
Figure 6. Frequency of ML/DL, Algorithms.
Figure 6. Frequency of ML/DL, Algorithms.
Applsci 15 07747 g006
Figure 7. Overview diagram of key findings and thematic relationships in malware detection research (2020–2024).
Figure 7. Overview diagram of key findings and thematic relationships in malware detection research (2020–2024).
Applsci 15 07747 g007
Table 1. Search results for malware and cybersecurity keywords.
Table 1. Search results for malware and cybersecurity keywords.
PlatformKeywordsQuantity
Web of Sciencemalware (All Fields) and cybersecurity (All Fields)578
Scopus(TITLE-ABS-KEY (malware) AND TITLE-ABS-KEY (cybersecurity))2126
Table 2. Inclusion and exclusion criteria results.
Table 2. Inclusion and exclusion criteria results.
PlatformKeywordsQuantity
Web of Sciencemalware (All Fields) AND cybersecurity (All Fields) and 2024 or 2023 or 2022 or 2021 or 2020212
Scopus(TITLE-ABS-KEY (malware) AND TITLE-ABS-KEY (cybersecurity)) AND PUBYEAR > 2019 AND PUBYEAR < 20251057
Table 5. Distribution of the most researched techniques in malware detection.
Table 5. Distribution of the most researched techniques in malware detection.
TechniquePercentage of Studies (%)
Machine Learning45
Deep Learning30
Computer Vision15
Static Analysis5
Dynamic Analysis5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Berrios, S.; Leiva, D.; Olivares, B.; Allende-Cid, H.; Hermosilla, P. Systematic Review: Malware Detection and Classification in Cybersecurity. Appl. Sci. 2025, 15, 7747. https://doi.org/10.3390/app15147747

AMA Style

Berrios S, Leiva D, Olivares B, Allende-Cid H, Hermosilla P. Systematic Review: Malware Detection and Classification in Cybersecurity. Applied Sciences. 2025; 15(14):7747. https://doi.org/10.3390/app15147747

Chicago/Turabian Style

Berrios, Sebastian, Dante Leiva, Bastian Olivares, Héctor Allende-Cid, and Pamela Hermosilla. 2025. "Systematic Review: Malware Detection and Classification in Cybersecurity" Applied Sciences 15, no. 14: 7747. https://doi.org/10.3390/app15147747

APA Style

Berrios, S., Leiva, D., Olivares, B., Allende-Cid, H., & Hermosilla, P. (2025). Systematic Review: Malware Detection and Classification in Cybersecurity. Applied Sciences, 15(14), 7747. https://doi.org/10.3390/app15147747

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop