XAI-Driven Malware Detection from Memory Artifacts: An Alert-Driven AI Framework with TabNet and Ensemble Classification

Mystakidis, Aristeidis; Kalogiannnis, Grigorios; Vakakis, Nikolaos; Altanis, Nikolaos; Milousi, Konstantina; Somarakis, Iason; Mihalachi, Gabriela; Mazi, Mariana S.; Sotos, Dimitris; Voulgaridis, Antonis; Tjortjis, Christos; Votis, Konstantinos; Tzovaras, Dimitrios

doi:10.3390/ai7020066

Open AccessArticle

XAI-Driven Malware Detection from Memory Artifacts: An Alert-Driven AI Framework with TabNet and Ensemble Classification

by

Aristeidis Mystakidis

^1,2,†

,

Grigorios Kalogiannnis

^1,†

,

Nikolaos Vakakis

^1,†

,

Nikolaos Altanis

^1,†

,

Konstantina Milousi

¹

,

Iason Somarakis

³

,

Gabriela Mihalachi

³,

Mariana S. Mazi

¹

,

Dimitris Sotos

¹

,

Antonis Voulgaridis

¹

,

Christos Tjortjis

^1,2,*

,

Konstantinos Votis

¹

and

Dimitrios Tzovaras

¹

Information Technologies Institute, Centre for Research & Technology Hellas, 570 01 Thessaloniki, Greece

²

School of Science & Technology, International Hellenic University, 574 00 Thessaloniki, Greece

³

Starion, Italia S.p.A., Via Digrotte Portella, 00044 Frascati, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2026, 7(2), 66; https://doi.org/10.3390/ai7020066

Submission received: 11 December 2025 / Revised: 16 January 2026 / Accepted: 19 January 2026 / Published: 10 February 2026

Download

Browse Figures

Versions Notes

Abstract

Modern malware presents significant challenges to traditional detection methods, often leveraging fileless techniques, in-memory execution, and process injection to evade antivirus and signature-based systems. To address these challenges, alert-driven memory forensics has emerged as a critical capability for uncovering stealthy, persistent, and zero-day threats. This study presents a two-stage host-based malware detection framework, that integrates memory forensics, explainable machine learning, and ensemble classification, designed as a post-alert asynchronous SOC workflow balancing forensic depth and operational efficiency. Utilizing the MemMal-D2024 dataset—comprising rich memory forensic artifacts from Windows systems infected with malware samples whose creation metadata spans 2006–2021—the system performs malware detection, using features extracted from volatile memory. In the first stage, an Attentive and Interpretable Learning for structured Tabular data (TabNet) model is used for binary classification (benign vs. malware), leveraging its sequential attention mechanism and built-in explainability. In the second stage, a Voting Classifier ensemble, composed of Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGB), and Histogram Gradient Boosting (HGB) models, is used to identify the specific malware family (Trojan, Ransomware, Spyware). To reduce memory dump extraction and analysis time without compromising detection performance, only a curated subset of 24 memory features—operationally selected to reduce acquisition/extraction time and validated via redundancy inspection, model explainability (SHAP/TabNet), and training data correlation analysis —was used during training and runtime, identifying the best trade-off between memory analysis and detection accuracy. The pipeline, which is triggered from host-based Wazuh Security Information and Event Management (SIEM) alerts, achieved 99.97% accuracy in binary detection and 70.17% multiclass accuracy, resulting in an overall performance of 87.02%, including both global and local explainability, ensuring operational transparency and forensic interpretability. This approach provides an efficient and interpretable detection solution used in combination with conventional security tools as an extra layer of defense suitable for modern threat landscapes.

Keywords:

malware detection; classification; machine learning; deep learning; XAI; memory artifacts

1. Introduction

Malware attacks continue to be a major threat to the digital infrastructure of organizations and individuals alike. With the increasing sophistication of cyber-attacks and the emergence of fileless, memory-resident, and polymorphic malware, traditional signature-based defense mechanisms have proven inadequate [1]. As such, modern cybersecurity research increasingly focuses on behavior-based and host-level detection systems that analyze the runtime behavior and memory state of endpoints to detect malicious activity [2,3].

In this context, memory forensics has emerged as a powerful methodology for uncovering advanced threats that may not leave traces on hard drives [4]. By analyzing volatile memory snapshots, investigators can identify hidden processes, injected code, kernel hooks, and a wide range of forensic indicators that reflect the live state of a compromised system [5,6]. Volatility 3, the latest version of the popular open-source memory forensics framework [7], enables such analysis through a plugin-based architecture that supports Windows, Linux, and macOS memory dumps.

This paper proposes a two-stage, host-based malware detection pipeline that combines memory forensics with machine learning (ML) and explainability techniques. After pre-processing, memory dump capture, and feature extraction, the first stage employs Attentive and Interpretable Learning for structured Tabular data (TabNet), a deep learning (DL) model specifically designed for tabular data, to distinguish benign from malicious memory samples. The second stage, triggered only for malware-labeled instances, uses a Voting Classifier ensemble consisting of Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGB), and Histogram Gradient Boosting (HGB) to predict the specific malware family. Explainability is integrated into both stages using SHapley Additive exPlanations (SHAP) and TabNet’s internal attention masks to provide interpretable predictions and forensic insight.

To support realistic operational requirements that demand scalability, resource utilization efficiency, and detection performance, we optimized memory dump capture and analysis latency by disabling low-impact features and reducing the feature space to a carefully selected subset of 24 high-utility parameters. These parameters were chosen based on SHAP value analysis, correlation filtering, and empirical evaluation on the MemMal-D2024 dataset, to capture the best trade-off between memory analysis and accuracy. In addition, instead of periodical memory dump collection from all the critical hosts of a network, we adopted an on-demand memory dump extraction, based on suspicious alerts raised from Wazuh agents [8].

Overall, our contributions include the following:

A scalable, two-stage malware classification framework combining TabNet (binary gating) and ensemble learning (conditional family attribution).
An optimized alert-driven memory capture and analysis pipeline for feature extraction taking into account operational requirements.
Integrated explainability via SHAP and TabNet for transparency and threat interpretation, providing attention masks for global and local interpretability.
High classification performance demonstrated on the MemMal-D2024 dataset with real-world malware families.
Evaluation of the proposed methodology on data acquired from custom malware execution on a Windows Server 2016 system.

The remainder of this manuscript is organized as follows: Section 2 reviews the state of the art and delineates the necessary theoretical and practical context. Section 3 examines in detail the concepts and methodology underlying the proposed approach for malware detection. Section 4 reports the empirical findings obtained from experiments conducted on MemMal-D2024 and pilot datasets. Finally, Section 5 and Section 6 summarize the main contributions and discuss their findings, implications, limitations, and directions for future research.

2. Related Work

2.1. Background-Malware Detection

Malware detection is a topic that has been a matter of investigation and research for quite a long time. There are two main malware detection methodologies: static and dynamic analysis. Static malware analysis or signature-based detection involves the inspection of a binary file for known malware signatures, hashes, suspicious strings or structures, and known malicious patterns. This technique is efficient for known malware, but fails when the malware leverages obfuscation, packing, or polymorphic evasion techniques.

On the other hand, dynamic malware analysis is based on malware execution and monitoring within a sandboxed environment. Runtime behavior analysis, including file operations, network activity, registry modification, etc., is crucial for understanding how malware operates and for detecting advanced threats that evade traditional signature-based detection. More recently, AI-based approaches have been adopted, utilizing ML/DL algorithms trained on static features, dynamic behaviors, or more complex representations, such as converted binaries to images [9,10,11].

However, in the continuous battle between defenders and attackers, the latter are developing sophisticated evasion techniques to avoid detection. In-memory malware (or fileless malware) hides in a computer’s volatile RAM, executing code directly in memory to evade disk-based antivirus, making it stealthy and hard to detect. It uses legitimate system tools like PowerShell or WMI for living-off-the-land attacks, stealing data or creating backdoors without touching the hard drive [12]. Obfuscated malware is malicious code intentionally scrambled to hide its harmful intent, making it extremely difficult for security software and analysts to detect, analyze, or reverse engineer, often by altering code structure, inserting junk data, or encrypting parts of itself to evade signature-based detection and sandbox analysis, all while preserving its core malicious functionality [13].

To address those challenges, AI-driven memory analysis emerges as an extra layer of protection on occasions where conventional detection mechanisms fail. Several studies have shown that ML/DL models trained on memory-dump features (or representations derived from memory) can detect novel, obfuscated, in-memory malware variants even if no previous signature exists and even if the malware tries to hide its runtime behavior [14].

2.2. Background Datasets

Selecting an appropriate dataset was an important first step during the implementation of our two-stage detection framework. In order to identify the most suitable data, we conducted an extensive review of publicly available datasets used in host-based intrusion detection and malware analysis, to ensure that the selected dataset aligned with our objectives.

Our research included datasets originating from several research domains:

Log-Based Intrusion Detection Datasets: Many studies have utilized system and application logs collected from host-based intrusion detection systems (HIDSs), such as OSSEC [15] and Wazuh [8], to detect unauthorized access, privilege escalation, and malware infections. These works often rely on predefined rule sets, or they improve detection accuracy with ML or DL models for anomaly detection. Some efforts combine log-based features with cloud security monitoring, while others explore hybrid models integrating static, dynamic, and behavioral analysis for more comprehensive detection. Some notable datasets included in this category are BGL (BlueGene/L) logs [16], which are system logs from a BlueGene/L supercomputer system for log anomaly detection [17,18], and HDFS (Hadoop Distributed File System) [19], a log set generated from system performance-oriented logs running Hadoop-based jobs on Amazon EC2 clusters [20]. While valuable for rule-based and anomaly-based detection, these datasets lack in-memory features, which are essential for detecting fileless or stealthy malware leaving no disk traces.
Malware Classification Datasets: Another point of focus was datasets used for static or dynamic malware analysis. These include datasets like the Mal-API-2019 dataset [1,21], which is composed of Windows API call sequences extracted from sandboxed malware execution labeled across multiple malware families, the VirusShare/TheZoo repositories [22,23], which are large archives of real-world malware samples for signature- and behavior-based detection, and datasets like Malimg [24], which is comprised of malware executables represented as gray-scale images for Convolutional Neural Network (CNN) or Vision Transformer-based classification. However, these datasets primarily capture static or dynamic malware traces rather than live memory data, limiting applicability for real-time host forensics.
Process and system resource datasets: Another part of our research focused on exploring datasets that capture process-level metrics, such as CPU, memory, and disk usage during malware execution. The Cloud Malware: Virtual Machines (VMs) Performance Metrics dataset [25], for example, logs resource consumption in infected versus benign VMs. While useful for behavioral detection, these datasets lack the temporal metadata and memory forensics features required to support our framework.
Memory Dumps Datasets: Since our framework relies on on-demand memory capture triggered by Wazuh alerts, we paid special attention to datasets containing raw or processed memory dumps. The datasets studied include the CICMalMem22 dataset [2], a benchmark dataset with memory dumps of several families of both obfuscated malware and benign software, which, however, lacked long-term temporal coverage, limiting its use for evolving threat scenarios, and the MemMal-D2024 dataset [3], a comprehensive dataset of memory dumps collected from both benign and malware-infected processes running on Windows 10.

Among these datasets, the MemMal-D2024 dataset offered the richest feature set, including malware creation metadata spanning multiple years (not time-ordered memory acquisition timestamps), as well as compatibility with memory forensics pipelines, making it the most suitable choice for our detection framework.

Table 1 below shows a comprehensive comparison of all the datasets explored during our preliminary research.

2.2.1. SOTA-ML Based Detection

ML has demonstrated significant impact across a wide range of domains beyond cybersecurity. Recent works have shown its effectiveness in energy load forecasting [30], traffic prediction, and urban mobility systems [31], enhancing diagnostic accuracy and decision support in healthcare [32], charging demand prediction and optimization in electric vehicle (EV) infrastructures [33], and enabling robust solutions in computer vision and telecommunications [34].

Regarding cybersecurity, traditional host-based intrusion detection systems (HIDS), such as OSSEC [15] and Wazuh [8], rely mainly on signature-based or rule-based approaches. While effective for known threats, these methods often struggle against advanced types of malware, such as fileless, obfuscated, or polymorphic malware. Thus, applying ML and DL techniques to host-based and memory-based malware detection has emerged as a successful way to address the limitations of the earlier, more conventional approaches.

Earlier studies applied supervised ML classifiers to system log files and host event records [21]. Approaches such as Decision Trees (DTs), Random Forests (RFs), Gradient Boosting (GB), and Support Vector Machines (SVMs) have demonstrated promising results in anomaly detection and malicious activity classification, often benefiting from their explainability and relatively low computational overhead, making them suitable for proof-of-concept systems. However, their reliance on manually engineered features can limit their adaptability to evolving malware families.

More recent studies have explored various DL architectures, which can be more capable of learning higher-level representations from complex data. Approaches like feed-forward neural networks [24] and autoencoders [3] have been utilized for anomaly detection in large-scale system logs, while Convolutional Neural Networks (CNNs) [24,35] have been applied to datasets that include more representationally demanding data, like image-based malware data. Similarly, Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) models have been used to capture sequential dependencies in API call traces and log sequences, improving detection of stealthy or multi-stage attacks [21]. In addition, hybrid ensemble methods combining multiple base learners have been proposed to balance detection accuracy and false positive rates [36].

Within the domain of memory dump analysis, studies that use data sets, such as CICMalMem22 and MemMal-D2024, have shown that combining forensic feature extraction with ML classifiers significantly improves detection performance. Advanced techniques, such as deep autoencoders coupled with ensemble classifiers [3], have proven effective for binary classification (benign vs. malicious), although achieving reliable multi-class family classification remains an open challenge.

Building on these findings, our work implements a two-stage host-based malware detection system, integrating memory forensics, ensemble classification, and explainable DL, in order to achieve high detection accuracy combined with efficient runtime performance.

In the first stage of the detection, TabNet, an interpretable DL architecture, is employed for binary classification (benign vs. malware) of the memory forensic features, making use of its sequential attention mechanism and built-in explainability. In the second stage, a Voting Classifier integrating the LGBM, XGB, and HGB models is used to perform multi-class classification across malware families, such as Ransomware, Spyware, and Trojan horse. This process, activated directly from Wazuh SIEM alerts, achieves robust results for binary detection and comparatively strong results for multi-class classification. Finally, to ensure operational transparency, the framework integrates global and local explainability (via SHAP values and TabNet feature masks), enabling model outputs to be mapped to attack techniques, such as MITRE ATT&CK Tactics, Techniques, and Procedures (TTPs) [37], and directly integrated into Wazuh-driven alerting and response workflows.

2.2.2. Evaluation Metrics

The results were assessed using several classification metrics to evaluate the host detection procedure. These include accuracy, precision, recall, and F1-score. Their equations are as follows:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(1)

Precision = \frac{TP}{TP + FP}

(2)

Recall = \frac{TP}{TP + FN}

(3)

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} = \frac{2 \times TP}{2 \times TP + FP + FN}

(4)

In these Equations (1)–(4),

TP

denotes the true positives,

TN

true negatives,

FP

false positives, and

FN

false negatives.

2.2.3. Model Description

For the malware detection implementation, the following algorithms were utilized:

Tabular Attentive and Interpretable Learning for Structured Tabular Data (TabNet) Classifier

The TabNet [38] classifier is employed for the binary malware recognition task. This DL architecture is explicitly designed for tabular data, leveraging a sequential attention gating mechanism to focus on relevant feature subsets at each decision step. It achieves high predictive performance while maintaining interpretability, as it provides insight into feature importance during the decision-making process. Due to its robustness against sparsity and its capacity to manage diverse feature types, TabNet is particularly well-suited for memory forensic and host-based intrusion datasets.

The second stage of the detection pipeline focuses on identifying the specific malware family for each malicious instance. This task is implemented using a soft voting ensemble [39], combining three base learners: LGBM, XGB, and HGB. The ensemble aggregates the probabilistic outputs of the base classifiers and averages them to form the final prediction, improving robustness and generalization.

Light Gradient Boosting Machine (LGBM)

The LGBM [40] is an efficient implementation of gradient boosting Decision Trees that utilizes histogram-based algorithms for accelerated training. It supports large-scale datasets, while maintaining lower memory usage and strong predictive accuracy, making it well-suited for real-time detection systems.

Extreme Gradient Boosting (XGB)

The XGB [41] algorithm is an optimized distributed gradient boosting framework designed for efficiency, scalability, and flexibility. It incorporates advanced regularization, parallel processing, and sparsity-aware optimizations, contributing to high accuracy and resilience to overfitting.

Histogram-Based Gradient Boosting (HGB)

The HGB [42] classifier, implemented within the scikit-learn library, follows a similar approach to LGBM, but it is optimized for Python 3.8 environments. It constructs histograms of continuous features for faster split-finding and improved scalability. Additionally, it natively supports dealing with missing values and categorical variables, making it suitable for heterogeneous forensic data.

Furthermore, other models were tested, both the first and the second stages of the detection pipeline. These were as follows:

Decision Tree (DT)

To establish a baseline, a DT model was first employed [43]. This is a supervised learning technique applicable to both classification and regression tasks, in which a tree-structured model is constructed through the recursive partitioning of the dataset. The feature space is iteratively segmented, yielding a series of binary decision rules that are used to map input features to the corresponding target variable and, ultimately, to generate predictions. Although tree-based models typically exhibit inferior performance relative to neural networks when modeling highly nonlinear relationships, they offer several advantages: they are more transparent, easier to interpret and visualize, and they do not require feature normalization.

Multi-Layer Perceptron (MLP)

An MLP consists of a network of interconnected processing elements, or nodes, commonly referred to as artificial neurons, which are loosely inspired by biological neurons in the human brain. Each connection between neurons, analogous to a synapse in a biological neural system, is associated with a weight and is capable of transmitting signals to downstream neurons. An artificial neuron aggregates its input signals—typically as a weighted sum—applies a transformation function (e.g., an activation function) to this aggregate, and subsequently propagates the resulting output signal to the neurons to which it is connected [44,45,46].

Categorical Boosting (CB)

CB is a free and open-source software framework that implements a gradient boosting methodology. Among other capabilities, it addresses learning problems with extensive categorical feature support by employing a permutation-based alternative to conventional encoding and handling techniques [47]. CB has gained substantial adoption relative to other gradient-boosting algorithms, primarily due to the following properties [48]:

First, it provides native handling of categorical variables, fast GPU-accelerated training, and integrated tools for model and feature-importance visualization and analysis. For improved computational efficiency, CB supports both Oblivious Trees and Symmetric Trees as base learners. Furthermore, it mitigates overfitting by employing ordered boosting, which leverages a scheme based on target statistics computed in an online or permutation-driven manner [47].

As mentioned, all these models were tested both for binary and multi-class classification.

3. Methodology

The methodology of this work includes two main aspects. First, the Wazuh alert-driven memory dump capture and feature extraction, and second, the analysis of the extracted memory features from the AI algorithms to detect potential malware infection. The end-to-end process aims to achieve sophisticated malware detection, where conventional defense mechanisms fail, while balancing detection accuracy and memory data collection latency. The process can be summarized in the following steps:

Alert-triggered memory acquisition: The basis of the methodology is alerts provided by Wazuh when suspicious activity is detected in the behavior of monitored systems. These alerts automatically trigger a script responsible for capturing a memory dump of the system.
Memory analysis and feature extraction pipeline: After memory dump is successfully collected, the feature extraction process starts, where a curated list of features is collected, representative of sophisticated malware activities.
AI-based malware classification: The final step of the methodology involves the processing of the extracted features from trained ML and DL algorithms capable of detecting malware with high accuracy.

The total time required for the completion of the first two steps, from the triggering of memory dump capture to the extraction of the required memory features, reflects the memory data collection latency, while the detection accuracy is linked with the classification performance of the last step.

3.1. Alert-Driven Memory Dump Capture

Memory dumping is a fundamental technique in digital forensics and cybersecurity, enabling investigators to capture the volatile state of a system for subsequent analysis of processes, loaded modules, network connections, and potential malicious activity. On Windows platforms, one of the widely adopted tools for acquiring physical memory is WinPMem [49], an open-source utility that provides reliable and efficient dumping of system memory. WinPMem operates at the kernel level, leveraging custom drivers to access physical memory directly, thereby allowing investigators to obtain a consistent and forensically sound image. The tool supports multiple output formats, such as raw images and the Advanced Forensic File Format 4 (AFF4), which embeds metadata useful for preserving acquisition context [50]. Due to its compatibility with contemporary Windows versions and its ability to minimize system interference during acquisition, WinPMem is considered a practical and academically relevant solution for memory forensics research and operational incident response.

In Linux environments, memory acquisition is commonly performed using tools such as LiME (Linux Memory Extractor) [51] and AVML (Azure Virtual Machine Memory Dumper) [52], which are both designed to capture volatile memory in a controlled and forensically sound manner. LiME operates as a loadable kernel module, allowing direct access to physical memory and offering output in raw or LiME-specific formats over local storage or network sockets. AVML, developed by Microsoft, is a user-space utility optimized for cloud and virtualized environments, which is capable of performing memory dumps without requiring kernel module insertion—thereby reducing the risk of system instability. Compared to WinPMem, which targets Windows systems and relies on kernel-level drivers to access physical memory, LiME and AVML reflect the diversity of Linux ecosystems: LiME offers granular, low-level acquisition suited to on-premise analysis, whereas AVML prioritizes safe, portable acquisition for large-scale or cloud-based deployments. Together, they provide complementary approaches to memory forensics in heterogeneous infrastructures.

Although memory extractor tools enable the acquisition of volatile memory in a forensically sound manner, their effectiveness is amplified when coupled with automated detection mechanisms. Integration with Wazuh, an open source host-based security monitoring platform that combines intrusion detection, integrity monitoring, log collection, vulnerability detection, and security orchestration capabilities into a rule-based analysis of the agent/manager architecture, operationalizes the transition from the simple observation of security events to proactive forensic acquisition, ensuring that critical volatile evidence is preserved at the precise moment of compromise and establishing a continuous workflow between intrusion detection and memory forensics [53].

While the previous tools serves as a robust solution for the acquisition of volatile memory, the subsequent analysis and feature extraction phase is typically conducted using frameworks, such as Volatility [7]. Memory extractors produce memory images that capture the complete state of a system’s RAM at a given moment, preserving processes, kernel structures, network connections, and other transient artifacts. The Volatility framework—and its modern successor, Volatility 3—parses these images to reconstruct system activity through a modular plugin architecture. It extracts high-level features, such as process trees, loaded drivers, open network sockets, and registry handles, enabling investigators to infer behavioral and forensic evidence from the captured state. These extracted Volatility features are systematically fed into an AI-based engine responsible for identifying malware presence. These features—ranging from anomalous process relationships to irregular kernel objects and suspicious network activity—serve as the structured representation of the system’s volatile state. When traditional signature-based or rule-driven mechanisms fail, the AI model leverages these behavioral indicators to detect subtle deviations from benign system patterns, enabling robust recognition of previously unseen or stealthy threats.

3.2. Memory Analysis and Feature Extraction

Wazuh, WinPMem, and the Volatility framework complement one another within an automated forensic pipeline: Wazuh detects anomalous behavior (e.g., suspicious process creation, credential dumping indicators, or unusual network activity) and, via an active-response rule, invokes a controlled memory acquisition on the affected Windows host using WinPMem. This linkage enables near-real-time capture of volatile evidence, while keeping the detection and acquisition components modular—Wazuh for detection and orchestration, and WinPMem for forensically sound acquisition. The collected memory image is then processed by the Volatility framework, which systematically extracts and structures forensic artifacts, such as active processes, loaded modules, open network connections, and registry data. Thus, this integration establishes a complete end-to-end workflow: from automated detection, low-level data capture, and targeted evidence collection to in-depth forensic analysis, thereby enhancing both the timeliness and reliability of digital investigation outcomes.

A robust integration between anomalous behavior detection, memory dumping process, and feature extraction addresses several practical and scientific concerns, including the following:

Forensic soundness and minimization of interference: Configuring the active-response to run WinPMem with minimal additional activity (using raw output and avoiding extra logging on the host where possible) and recording acquisition metadata, such as timestamp, user information, and command line activity.
Privilege requirements and driver trust/signing constraints: Memory acquisition tools, such as WinPMem, require either administrative privileges or properly signed drivers to access physical memory. Ensuring that acquisition runs under the correct privilege level—and that drivers are trusted and signed—prevents capture failures and avoids triggering security mechanisms that could otherwise block or distort the process.
Downstream analysis, traceability, and correlation: The stored memory images are forwarded to analysis pipelines (Volatility/Volatility3 frameworks and custom extractors), and the resulting findings are correlated with the original Wazuh alert that triggered the capture, enabling reproducible and traceable incident timelines.

Figure 1 illustrates a complete automated memory forensics workflow that integrates anomaly detection, memory acquisition, and forensic analysis across Windows- and Linux-based environments.

In the first stage, Wazuh serves as the detection and response component. Its Manager continuously evaluates predefined security rules and, upon detecting anomalies or suspicious behavior, triggers the Agent to execute an active response. This response initiates the memory acquisition process on the affected host system.

In the second stage, the Host Machine performs volatile memory acquisition, using platform-specific tools, such as WinPMem (for Windows) or LiME/AVML (for Linux). These tools capture the system’s active memory state and produce a memory dump file in formats such as .raw, .aff4, or .lime.

Next, the acquired dump is transferred to the Forensic Server, where it is securely stored in a repository for analysis. The Volatility3 framework processes these dumps, leveraging appropriate symbol tables to interpret kernel structures and reconstruct the internal state of the system at the time of acquisition. Finally, a Feature Extractor module converts the parsed data into structured JSON or CSV formats, enabling subsequent automated analysis, visualization, or correlation with security events.

Together, these components form a robust, end-to-end digital forensics pipeline that links real-time intrusion detection with deep memory-level investigation.

3.3. AI-Based Malware Classification

After the extraction of representative features from volatile memory follows the final step of the methodology: classification of the memory features from AI models. In order to identify the most accurate solutions, various data engineering techniques and ML/DL models and architectures are tested. The derived architecture comprises a two-stage malware detection process capable of performing well on the source dataset as well on data collected from a different setup (using also a drift correction methodology), to ensure applicability in different environments. Furthermore, explainable AI techniques are adopted to support malware analysts’ understanding of the reasons behind the framework’s decisions.

The detection implementation is utilized with the Python programming language and with libraries such as Pandas [54] for data manipulation and engineering, Numpy [55] for pre-processing, Matplotlib [56] for visualization, SHAP [57] for explainability and feature attribution, Sklearn [39], Tensorflow [58], and Pytorch [59] to develop the ML and DL methodology.

3.3.1. Data Preprocessing

Several data pre-processing procedures were conducted before training and classification to guarantee consistency and model compatibility. Categorical malware family labels were first derived from hierarchical category strings using regular expressions. At the same time, tests for any missing or corrupt feature values were implemented to remove these cases. Duplicate records and high-cardinality features with excessive sparsity were eliminated to increase generalizability.

More specifically, before model training, we first merged the benign and malware subsets, resulting in 58,168 samples with 59 original features. We then removed 34 features (57.63%) that were either metadata that could not be used on real time detection (e.g., Year/Month that the malware was created) or features with increased memory dump time, yielding a retained set of 25 features (24 + Category or Class, depending on the case). After this feature filtering step, a few missing values were removed in the retained feature set (14 missing rows, 0.003%), while we also removed exact duplicate records (row-level duplicates over the full feature vector), eliminating 1821 samples (3.152%). Overall, 1835 (3.155%) rows were excluded.

To avoid information leakage, the final feature set was fixed a priori based on operational constraints (features that increased acquisition/extraction latency or were not available at run time were excluded). Correlation analysis was also used as an extra exploratory redundancy inspection on the training data to better understand inter-feature dependencies (Figure 2). Tree-based approaches were trained on raw values (insensitive to scaling), whereas non-tree-based models were normalized with StandardScaler [39].

We then trained models via stratified 5-fold cross-validation (k = 5). In each fold, the model was trained on 4/5 of the data (80%) and evaluated on the remaining 1/5 (20%) [60,61,62].

Further on, as described in Section 3.3.5, we applied validation of our methodology on additional data extracted from custom malware execution. In the validation process, we noticed a significant domain drift between data from MemMal-D2024 dataset and the data collected from malware execution on a sandboxed VM. Thus, to mitigate this issue, we tested an affine transformation methodology on the MemMal-D2024 data. The associated results are described in Section 4.2.5.

3.3.2. Explainability Aspects

Several feature importance and explainability mechanisms were implemented using SHAP [63] in the malware detection framework. During training and the malware detection procedure, feature importance and SHAP [57] mechanisms were implemented to interpret model predictions, by assigning importance values to each feature, based on its contribution to the model output. These mechanisms ensure consistency and local accuracy, making them suitable for global and individual prediction explanations. This project employed SHAP and feature importance [39] plots to validate which memory features influenced the model most.

3.3.3. Train Dataset Description

The data used for the host-based detection module was the MemMal-D2024 dataset [3], a thorough compilation of host-based memory forensic features taken from malicious and benign Windows memory snapshots. The malware cases include samples associated with malware instances spanning 2006–2021 (however, this span reflects metadata and does not imply time-ordered memory acquisition) of harmful activity and include a variety of malware families, such as Trojan horses, Worms, Rootkits, and Ransomware. Fine-grained threat profiling is made possible by each sample’s static and behavioral characteristics, which include thread counts, injected modules, loaded DLLs, process handle information, and registry interactions. After several tests regarding memory dump extraction and processing duration for the extraction of memory features, the final parameters utilized for the training of both binary and multiclass models are listed in Table 2.

3.3.4. Two-Stage Host-Based Malware Detection

The initial approach was implemented on a combined dataset with four classes (Benign, Spyware, Ransomware, Trojan), as illustrated in Figure 3. T distribution plot showcases a class imbalance between the Benign (29,298 cases), Spyware (10,020 cases), Ransomware (9791 cases), and Trojan (9059 cases). For the implementation, several models were tested, showcasing a multi-model comparison. The results of this approach are provided in Section 4.2.1.

Based on the results of the initial detection, further implementation was conducted, including a two-stage host-based malware detection to boost performance.

The proposed framework employs a two-stage workflow that combines a DL model optimized for tabular data with a heterogeneous ensemble of gradient boosting classifiers. This approach enables both high accuracy and interpretability, while maintaining low inference latency.

A single multi-class classifier would need to jointly solve two qualitatively different problems: firstly, the benign–malware separation, and, secondly, fine-grained family attribution among malware classes. In our setting, the first task is highly separable using memory-forensics artifacts, whereas the second is inherently harder, due to class imbalance and feature overlap between families. As a result, a single 4-class (Benign, Spyware, Ransomware, Trojan) model can be dominated by the benign–malware decision boundary and may sacrifice minority-family recall (Figure 3).

The first stage focused on distinguishing benign from malicious memory snapshots that contain malware (Figure 4). The implementation used the TabNet Classifier from the PyTorch [59] TabNet library, which utilizes sequential attention to select relevant features at each decision step, providing inherent interpretability through its feature-mask mechanism. The model was trained on malware samples whose creation metadata spans 2006–2021, included in the MemMal-D2024 dataset, with a stratified split between training and testing data. Prior to training, categorical malware family strings were label-encoded, unnecessary columns were dropped, and feature vectors were normalized, using a standard scaler. The available Year/Month fields reflect malware creation metadata and are not used as predictive features; thus, they should not be interpreted as chronological acquisition timestamps for temporal-split evaluation. The TabNet model outputs class probabilities, from which the final binary decision threshold is applied, in order to determine whether a sample proceeds to the second, multiclass stage of malware categorization.

The second stage performed the malware family classification for instances identified as malicious in the first stage. The process integrated a Voting Classifier using LGBM, HGB, and XGB, with the two algorithms trained independently on the same pre-processed feature space for multiclass malware-family classification. The training followed the same split and pre-processing as in the first stage of the classification, to enable like-for-like comparison, and each model output a discrete family label (e.g., Ransomware, Spyware, Trojan) for subsequent evaluation. All the quantitative evaluations of the model runs are reported in a later section concerning the implementation results.

The two-stage pipeline architecture enhances host-based malware classification’s detection accuracy and interpretability. The central concept is to divide the classification task into two distinct and sequential subtasks:

A binary classification task is modeled for the first step-Classifier 1 (C1) (Figure 5), differentiating benign samples from malicious ones. This stage works as a high-level filter to distinguish non-threats before moving to multiclass classification for the malware samples. As mentioned, a TabNet DL model was implemented using labeled benign and malware memory dumps (Figure 4).

If a sample is classified as benign, the pipeline terminates. If the input sample is labelled malware, it proceeds to Stage 2, utilizing Classifier 2 (C2) (Figure 5).

The second step of the pipeline initiates if a sample is detected as malware, with a multiclass classifier C2 being activated to identify the specific malware family. The three malware types are Trojan, Ransomware, and Spyware, with the class distribution illustrated in Figure 6. The two-step pipeline is illustrated in Figure 5.

After a comprehensive comparison between several algorithms, the selected model for this step was a Voting Classifier ensemble combining the LGBM, XGB, and HGB models. The predicted class had the highest average class probability across the three base models.

3.3.5. Validation on Testbed Malware Data

In cybersecurity threat detection, the underlying data distributions often evolve over time, due to changing network patterns, new protocols, or user behavior, creating a phenomenon known as domain drift. Models trained on historical (source domain) data may therefore become less effective as the statistical properties of features change [64]. Furthermore, a model trained in an environment configured in a certain way may exhibit very good performance in that environment but fail when applied in a different environment. In the host-based malware detection case, a different environment may include hosts with different hardware, OS versions, running processes/services, etc. Those differences heavily influence memory feature distributions and, therefore, model performance. However, it is crucial to ensure that the AI models can also perform well in real world deployment. For that purpose, external validation with data collected from a different source than the training dataset is an important task to perform.

For the AI pipeline’s prediction evaluation on independent data from a different environment, a laboratory testbed was set up, where three different attack scenarios were applied, leveraging custom Trojan, Spyware, and Ransomware malware that do not appear in the source dataset. A Windows Server 2016 host was configured to run typical enterprise services, including an HTTP/web service and a Microsoft SQL Server (MSSQL) database. This server stands in for a realistic production machine: IIS or another web server can serve web pages or APIs over HTTP, while MSSQL provides backend database services. The server runs standard Windows services under service-accounts (or built-in accounts) as needed by MSSQL, and these background services auto-launch on boot. Furthermore, a Debian machine was used for applying the attacker’s actions.

The first malware scenario represents a Trojan use case. A backdoor Windows executable was built with msfvenom tool. The malicious file was served on the attacker’s machine and downloaded on the target Windows server. Upon successful download, the malware was executed to open a reverse TCP Meterpreter session to the attacker’s machine. After the session was opened, migration to a legitimate Windows process (such as winlogon.exe) followed, in order to achieve stability and stealthiness.

The second malware scenario is an extension of the Trojan scenario, but focuses on Spyware activities. An attacker, having gained access to the target system through a reverse TCP Meterepreter session, can try to gain elevated SYSTEM privileges, via getsystem Meterpreter command. After elevating privileges, the attacker can gather OS and system information or even dump credentials from LSASS memory. In this Spyware scenario, kiwi Meterpreter extension was loaded, which contains commands to perform Mimikatz-style credential dumping.

The final malware scenario represents a Ransomware. First, the Ransomware payload was built and served from the attacker’s machine using a Python HTTP server. For the Ransomware, implementation PSRansom was used, a Ransomware simulation tool designed to demonstrates how Ransomware operates by encrypting files in a target directory using AES-256 encryption, communicating with a Command & Control (C2) server to exfiltrate encryption keys and data, and creating a ransom note. The Ransomware was downloaded and executed on the Windows server. As a result, critical data on the server were encrypted.

The validation scenarios were designed to cover a representative set of MITRE ATT&CK tactics and techniques across different phases of adversarial behavior. UC1 (Trojan) includes techniques associated with the Resource Development tactic, namely, T1587.001 (Develop Capabilities: Malware) and T1608.001 (Stage Capabilities: Upload Malware), followed by Command and Control through T1105 (Ingress Tool Transfer). The scenario further incorporates the Execution tactic via T1059.001 (Command and Scripting Interpreter: PowerShell) and achieves Persistence through T1055 (Process Injection).

UC2 (Spyware) emphasizes post-compromise activities and spans multiple tactics. It begins with Privilege Escalation using T1134.001 (Access Token Manipulation: Token Impersonation/Theft), followed by Discovery through T1082 (System Information Discovery). The scenario concludes with Credential Access using T1003.001 (OS Credential Dumping: LSASS Memory).

UC3 (Ransomware) focuses on later-stage attack behavior, incorporating Command and Control via T1105 (Ingress Tool Transfer) and Execution through T1059.001 (PowerShell). The scenario culminates in the Impact tactic, represented by T1486 (Data Encrypted for Impact), which reflects the primary objective of Ransomware operations.

Based on those malware scenarios, a validation dataset was created by collecting memory dump features from normal operation of the Windows server, as well as from several executions of each malware scenario. Specifically, the final dataset included 20 benign samples from the Windows server, 20 samples from Trojan execution after the migration of the Meterpreter session to the legitimate Windows process, 20 samples from the Spyware execution when the credential dumping commands were executed, and 34 Ransomware samples during encryption.

Evaluating the binary TabNet and multiclass voting models on the aforementioned dataset revealed that the two malware detection models do not generalize well beyond their training environment, performing very poorly in other domains.

Drift-aware pre-processing techniques that realign feature distributions can mitigate this problem, ensuring that ML–based intrusion detection systems continue to perform reliably [65].

One effective approach for distribution alignment is based on minimizing the Wasserstein distance, which measures the optimal transport cost between two probability distributions. Recent research demonstrates the applicability of Wasserstein-based domain adaptation in cybersecurity, showing that minimizing this distance between source and target domains can improve intrusion detection performance [66]. The 1-D Wasserstein distance

W_{1}

between two one-dimensional probability distributions P and Q, with cumulative distribution functions

F_{P}

and

F_{Q}

, is defined as follows:

W_{1} (P, Q) = \int_{- \infty}^{\infty} |F_{P} (x) - F_{Q} (x)| d x, x \in R .

Building on these principles, an affine transformation method (using scale and shift) provides a lightweight and practical way to align the distributions of numerical features. By optimizing the transformation to minimize the Wasserstein distance between the historical (source) and current (target) distributions, source-domain training data can be reconfigured to align with the target domain, as seen in Figure 7, Figure 8 and Figure 9. This specific approach leverages both linear and log-space scaling and shifting to directly correct the drift of each feature, maintaining the validity of model assumptions and improving robustness against both natural and adversarial distribution shifts [67].

The results of implementing this method, along with comparisons of performance with and without its use, are presented in Section 4.2.5.

As a comparative approach, a transformer-based domain adaptation method similar to that described in [66] was also evaluated. Each feature was treated as a token and embedded using a lightweight transformer encoder to capture inter-feature dependencies. The encoder was trained to map both source and target samples in a shared latent representation space [64]. The target data used for training the encoder were excluded from testing to prevent data leakage. The training goal of the encoder was to minimize the sliced Wasserstein distance between the latent domain distributions, thereby promoting geometric alignment, while maintaining computational efficiency. The malware detection and classification models were then trained using the latent domain representations of the source data. Additionally, for a more mainstream comparison, CORAL domain adaptation methodology, as explained in [68], was also examined.

The evaluation of both approaches was performed using the same test protocol and target-domain testbed dataset as the ones employed for the affine transformation method in Section 4.2.5, the results of which are demonstrated in Table 3 and Table 4. As shown, applying transformer-based domain adaptation with such a small available target dataset proved detrimental: the source-domain accuracy of the proposed two stage pipeline dropped below 50%, while the target-domain accuracy only saw marginal improvement compared to the target-domain accuracy of the original models. This behavior indicates that the learned mapping distorted task-relevant feature geometry. On the other hand, applying CORAL domain adaptation yielded a 7% improvement in target-domain multiclass classification performance, but dropped target-domain binary model performance to 50%. As a result, both approaches were not pursued further in subsequent experiments.

4. Results

4.1. Memory Dumps and Feature Extraction Timings

For this subsection, we evaluated the performance of the memory acquisition and feature–extraction pipeline on two different Windows 11-based systems with distinct CPU and memory configurations. Each system was used to generate full memory dumps and execute the Volatility-based parallel feature extraction workflow under two parallelism settings (using 4 and 8 cores respectively). The goal of these experiments was to quantify the impact of hardware characteristics and the degree of parallelization on the execution time of individual Volatility plugins, the total execution time, and the end-to-end pipeline duration. Table 5 summarizes the hardware specifications of the two testbeds, while Table 6 reports the detailed timing measurements collected during the experiments.

The results in Table 6 demonstrate that improved hardware characteristics lead to substantially better performance across the entire memory forensics pipeline. The system equipped with a higher-performance CPU and faster memory (S1) consistently achieved significantly lower execution times than S2 for all the selected Volatility plugins and for the overall pipeline, under both parallelism settings. These findings confirm that both CPU capability and memory bandwidth play a critical role in accelerating parallel Volatility-based feature extraction workflows.

According to the established SOC standards and incident response guidelines (e.g., the NIST incident response life cycle documented in SP 800-61 and related frameworks), deep forensic analysis is not required to operate at inline or real-time speeds, but instead is an integral part of the incident analysis and evidence preservation stages of incident response, which are typically performed on a timescale of minutes-to-hours rather than seconds. In this context, the proposed system’s end-to-end execution time of [min time from the table–max time from table] seconds not only remains operationally acceptable, but also substantially reduces analyst workload and time-to-insight. When triggered selectively by high-confidence Wazuh alerts, the system automates memory acquisition and forensic feature extraction significantly faster than traditional analyst-driven workflows that typically span tens of minutes to hours in operational SOC environments. The incurred latency represents a deliberate trade-off favoring detection accuracy and forensic completeness for advanced in-memory threats, while remaining compatible with operational response timelines.

4.2. ML-Based Detection

4.2.1. Initial Malware Detection

Table 7 presents the comparative performance of all the tested models, in terms of accuracy, precision, recall, and F1-score on the four-class dataset.

The initial results indicate a medium-to-low performance from TabNet, DT, MLP, while a medium-to-high performance is illustrated by CB, HGB, and LGBM. In any case, in terms of accuracy, none of the aforementioned models showcased a score higher of 0.85, with the best precision, recall, or F1-score scoring around 76%.

4.2.2. Binary Malware Detection

For the binary classification (C1 of Table 5), the TabNet model’s training procedure showcased that early epochs already reached high validation accuracy, as shown in Figure 10. The performance was evaluated using standard classification metrics such as accuracy, precision, recall, and F1-score (Table 8).

The binary malware detection results indicate that all the models achieved high predictive performance, showcasing accuracy, precision, recall, and F1-score over 99%.

TabNet’s results demonstrate that the classifier effectively separated benign and malicious samples, with near-perfect scores across all the key metrics. Given its high interpretability and robust performance on memory-based forensic features, it is well-suited as the first stage in the two-step detection pipeline. Similarly, the confusion matrix of the models further validated these results, showcasing a high predictive performance for the binary classification (Figure 11). Figure reports the confusion matrix averaged over 5-fold stratified cross-validation. Each cell is shown as mean ± standard deviation across folds, quantifying the variability of TN/FP/FN/TP under different fold partitions, illustrating out-of-fold predictions for all 5-fold cross -validation (k = 5) runs.

Furthermore, regarding TabNet model explainability, the top 20 most important features are illustrated on Figure 12. From the diagram, it can be seen that the svcscan.nservices (the total number of registered services) and ldrmodules.not_in_mem (modules not found in the memory list) are by far the most important features, followed by svcscan.nservices (the total number of registered services) and svcscan.shared_process_services (the services running under svchost.exe).

4.2.3. Malware Family Multi-Classification

For the malware family multi-classification (C2 of Table 5), the results are illustrated in Table 9. Among the baseline models, HGB, XGB, and LGBM exhibited the strongest performance, showcasing accuracy of around 69–0.70%, followed by CB, while TabNet and the DT models performed considerably worse in the three-class family malware detection task.

Furthermore, based on the best three models (HGB, XGB, and LGBM), a voting ensemble combining these three algorithms was implemented, with the results being shown in Table 9, achieving the best overall performance compared to the baseline individuals.

In addition, regarding the explainability of the Voting model (HGB, XGB, and LGBM), Figure 13 and Figure 14 illustrate the model transparency to understand the decision-making process of the multiclass malware family classifier. For this reason, we applied SHAP explainability results on the LGBM component of the ensemble, since SHAP is not natively supported on voting classifiers, isolating the LGBM part for interpretability.

Figure 13 shows the global feature importance based on the mean absolute SHAP value per class, where Class 0 (green) represents the Ransomware family, Class 1 (magenta) the Spyware family, and Class 2 (blue) the Trojan family. Features like pslist.avg_handlers, ldrmodules.not_in_load, and ldrmodules.not_in_mem_avg consistently contributed across multiple classes, while some features like svcscan.fs_drivers, and callbacks.nanonymous were class-specific and influence more the prediction of a single malware category.

Figure 14 provides a local SHAP explanation for a single malware sample classified by the LGBM model. The plot visualizes how individual features contributed to the final classification decision—positive contributions (in red), such as pslist.avg_threads, increased the model’s confidence in the predicted malware class by pushing the output value toward the class-specific region. Negative contributions (in blue), including pslist.nproc and pslist.nppid, acted as suppressors by lowering the likelihood of the sample belonging to the predicted class.

4.2.4. Overall Testing Set Comparison

For the proposed two-stage detection architecture, the detailed class-level results evaluation are illustrated in Table 10. The detection implementation maintained near-perfect benign detection, while achieving more balanced performance across the Spyware, Ransomware, and Trojan classes.

Furthermore, an extension of Table 7, including the proposed two-stage detection architecture, is presented in Table 11. The results demonstrate that the proposed solution outperformed the individual multiclass models. The conventional classifiers achieved accuracy between 75% and 0.84%, with HGB yielding the strongest single-model performance. However, the Two-Step TabNet–Voting pipeline achieved noticeably higher overall performance, with accuracy reaching 87% and with improved macro-averaged precision, recall, and F1-score.

All in all, the proposed Two-Step TabNet–Voting pipeline is more robust and reliable for detection.

4.2.5. Domain Drift Correction Ablation Study on Laboratory Memory Data

As described in Section 3.3.5, directly applying the proposed Two-Step TabNet–Voting model, a Tabnet binary malware detection model, and a multiclass malware classification voting model using the HGB, XGB, LGBM models to new domains produced unsatisfactory results. To address this limitation, an affine domain adaptation methodology using 1-D Wasserstein distance was implemented. This procedure was applied independently to each feature of the source MemMal-D2024 dataset to better align its feature distributions with those of the target domain.

To assess the necessity of such an approach, the proposed Two-Step TabNet–Voting model was trained both without and with feature-wise domain adaptation on the source dataset, as part of a domain drift correction ablation study. All the resulting classification models (the binary and multiclass models) were evaluated on samples from a target domain dataset containing 40 Trojan/Spyware, 34 Ransomware, and 20 Benign samples. Half of the target domain samples belonging to each class were used to drift-correct the features of the source dataset and thus were excluded from the target domain test dataset to avoid leakage of data distribution information between them. Instead, these samples were added to the modified source dataset to improve model performance.

The validation accuracy of the binary model trained on the modified dataset (99.9%) remained relatively consistent with the validation accuracy of the original binary model. In contrast, the multiclass voting model that was trained using the modified dataset had relatively lower modified dataset accuracy (62.35%), in comparison to the original multiclass voting model’s source data accuracy (70.17%). Both models were trained and tested using 5-fold cross-validation (k = 5). This behavior was expected and was due to a probable loss of source data information during the feature drift correction [64]. However, by applying the drift correction methodology the binary model’s accuracy on the target domain test showed a significant leap from 60% to 95%, while the multiclass model reached 82.45% accuracy, improving from its initial inability to correctly classify the target domain samples. The results of this ablation study clearly highlight the necessity of adopting a domain drift correction module in the ML-based detection pipeline when applying the proposed methodology on different testbeds.

More extensively, the results of the target domain test are shown in Figure 15, Figure 16, Figure 17 and Figure 18.

4.2.6. AI Pipeline Inference Latency Evaluation

Another crucial aspect to evaluate was the AI pipeline’s inference time. In real deployment as part of SOC workflows, it is very important to analyze security events as soon as possible and to act upon unambiguous findings. The proposed Two-Step TabNet–Voting model AI pipeline was integrated in an IDS (Intrusion Detection System), and an inference latency test was applied to assess the prompt response of the AI pipeline. A message with features corresponding to a known malicious memory dump was inserted in the IDS, and by leveraging the logging Python library two log messages were captured, one upon the message receipt and one right after the models’ predictions and explanations were ready. The difference between those two timestamps indicated the AI pipeline’s latency. Two different tests were executed, the first with features corresponding to a Spyware malware sample, and the second with features derived from a Trojan malware sample. The results, which are presented in Table 12, indicate that inference latency is lower than 1 s.

5. Discussion

This research explored a two-step classification pipeline, particularly for malware detection. Memory dumping and feature extraction using the Volatility framework are the first stages in this pipeline. On the modeling level, having identified that the original malware detection models trained on four classes (Benign, Spyware, Ransomware, Trojan) lacked performance, the proposed implementation includes the usage of a two step classification pipeline (C1 and C2 in Table 5). The first step (C1) is a binary Benign vs. Malware (combined Spyware, Ransomware, Trojan) classification implementation, accompanied with a multiclass (C2) with three classes (Spyware, Ransomware, Trojan), in case the first step detects a Malware class.

The proposed system demonstrated strong performance across both binary and multiclass tasks, achieving 99.97% binary detection accuracy and 86.93% overall accuracy when combined with the second-stage family classifier (as presented in Table 11). The proposed design proved more effective than any individual model trained on the entire four-class dataset, highlighting the importance of problem decomposition in memory-based malware analysis.

Furthermore, an additional contribution of this work lies in further validation in a real-world case, addressing distribution drift between historical and real-world datasets. As discussed in Section 3.3.5 and as demonstrated in Section 4.2.5, distribution drift between domains can cause prediction models that succeed in one domain but fail entirely in another. The proposed domain drift correction methodology demonstrated that it can, to some extent, eliminate this issue by enabling the retraining of models suitable for completely new data domains, using only a small amount of target domain data.

Our ablation results support the architectural decisions, with the two-stage decomposition outperforming the single-stage four-class model by decoupling the easier detection task from the harder family attribution task. Additionally, the with/without drift-correction ablation shows that, despite a small loss in source-domain accuracy, adaptation is essential to avoid severe performance degradation under a cross-environment distribution shift.

6. Conclusions

This work presented a two-step classification pipeline, particularly for malware detection. The initial steps in this pipeline include memory dumping and feature extraction using the Volatility framework. After identifying that the original malware detection models, trained on four classes (Benign, Spyware, Ransomware, Trojan) lacked performance, the proposed implementation included the usage of a two-step classification pipeline.

The first part is a binary Malware (combined Spyware, Ransomware, Trojan) vs. Benign classification problem, accompanied by a multiclass approach using three classes (Spyware, Ransomware, Trojan), in case that the first part detects a Malware class.

Our proposed approach demonstrated strong performance in comparison with the initial four class implementation, achieving higher accuracy, precision, recall, and F1-score. Furthermore, the presented approach was further validated on a real dataset, accompanied by a drift-aware, based on minimizing the 1-D Wasserstein distance, pre-processing mechanism that illustrated high performance.

6.1. Limitations Biases

Several limitations and potential biases may have influenced the robustness and validity of the results. The most critical ones are outlined below.

Two central challenges highlighted by this study concern the quality and level of detail of the available data, as well as the extent to which the resulting predictions can be generalized. The data must be comprehensive, current, and easily accessible. A further challenge arises from the sheer volume of data, which may require careful rebalancing. While large datasets generally improve model training, they can also exacerbate time and memory complexity. In this context, additional methodological issues must be addressed, including the risk of model overfitting and the need for robust treatment of outliers.

A further important limitation arises from the data distributional drift across different environments. Malware behavior, system configurations, memory usage patterns, and background processes vary significantly across operating systems, hardware setups, software versions, and real-world deployment contexts. As a result, the statistical properties of features extracted from network traffic may differ substantially between the training data and unseen environments. Models trained on fixed network data distributions are therefore prone to performance degradation when exposed to such shifts, leading to biased predictions and reduced generalization capability. Thus, domain adaptation is essential for generalizing the proposed pipeline to further domains.

Moreover, it is important to note that while the selected collection of features may cover the behavior of obfuscated or stealthy malware and rootkits, malware that runs on user-mode as a single process and does not exhibit any hiding activities, although much less sophisticated, may be hard to detect. Such malware activity may go unnoticed, due to a weak footprint on the aggregated system metrics. In addition, excluded time-consuming modules, such as malfind, could add value by detecting hidden or injected code/DLLs in user mode memory; however, in this study a balanced forensic depth and detection time was pursued. There are also other modules not included in the source dataset, such as netscan, which could enrich malware detection based on malicious network communications with C2 devices.

6.2. Future Work

Future work could focus on extending both the architecture and dataset implementation of the proposed two-step classification pipeline.

A first direction concerns the enrichment of the malware classes, by including additional families (worms, rootkits, downloaders etc.). Secondly, extending the proposed framework beyond the Windows system to include Linux, IOS, and cloud-native environments would provide a more holistic view of memory-resident threats in heterogeneous infrastructures.

A second direction is the research of different combinations of Volatility modules, in order to cover malware behavior that is not currently taken into account and to perform a comparison to assess the pros and cons of each combination. Future research should systematically evaluate which combinations of Volatility modules are most informative for different malware classes, rather than treating memory analysis as a uniform detection problem. Modules such as pslist, ldrmodules, and svcscan are highly effective for identifying kernel-level tampering, service-based persistence, and injected or unlinked modules, yet they provide limited visibility in regard to user-mode threats or covert C2 tools embedded in otherwise legitimate processes. Future work should therefore benchmark Volatility plugin sets against well-defined attacker models—persistence-focused, injection-focused, or activity-focused—and quantify their discriminative power using controlled malware corpora. This would enable the construction of purpose-driven Volatility profiles, where module selection, feature extraction granularity, and snapshot frequency are tailored to specific threat behaviors, ultimately improving both detection accuracy and forensic efficiency in memory-only investigations.

Author Contributions

Conceptualization, A.M., G.K., N.V., and N.A.; methodology, A.M., G.K., N.V., and N.A.; software, A.M., G.K., N.V., and N.A.; validation, A.M., G.K., N.V., and N.A.; formal analysis, A.M., G.K., N.V., and N.A.; investigation, A.M., G.K., N.V., N.A., and K.M.; resources, M.S.M., G.K., D.S., G.M., and I.S.; data curation, A.M., G.K., N.V., and N.A.; writing—original draft preparation, A.M., K.M.; writing—review and editing, A.M., C.T., G.K., N.V., N.A., and K.M.; visualization, A.M., G.K., N.V., and N.A.; supervision, A.M., A.V., C.T., K.V., and D.T.; project administration, A.V., C.T., K.V., and D.T.; funding acquisition, K.V. and D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project DYNAMO—Dynamic Resilience Assessment Method including combined Business Continuity Management and Cyber Threat Intelligence solution for Critical Sectors—funded by the EU’s Horizon 2020 research and innovation programme, grant agreement no. 101069601.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The MemMal-D2024 dataset analyzed in this study is publicly available from the original publication [3]. The additional laboratory/testbed memory-forensics dataset generated for external validation is not publicly available due to policy and operational restrictions.

Conflicts of Interest

Authors Iason Somarakis and Gabriela Mihalachi were employed by the company STARION ITALIA S.p.A., VIA DIGROTTE PORTELLA, FRASCATI, Italy. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yazı, A.F.; Çatak, F.O.; Gül, E. Classification of Metamorphic Malware with Deep Learning (LSTM). In Proceedings of the 2019 IEEE Signal Processing and Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; IEEE: New York, NY, USA, 2019; pp. 1–4. [Google Scholar] [CrossRef]
Carrier, T. Detecting Obfuscated Malware Using Memory Feature Engineering. Master’s Thesis, University of New Brunswick, Fredericton, NB, Canada, 2021. [Google Scholar]
Maniriho, P.; Mahmood, A.N.; Chowdhury, M.J.M. MeMalDet: A Memory Analysis-Based Malware Detection Framework Using Deep Autoencoders and Stacked Ensemble under Temporal Evaluations. Comput. Secur. 2024, 142, 103864. [Google Scholar] [CrossRef]
Nyholm, H.; Monteith, K.; Lyles, S.; Gallegos, M.; DeSantis, M.; Donaldson, J.; Taylor, C. The Evolution of Volatile Memory Forensics. J. Cybersecur. Priv. 2022, 2, 556–572. [Google Scholar] [CrossRef]
Zhang, H.; Li, B.; Li, W.; Zhu, L.; Chang, C.; Yu, S. MRCIF: A Memory-Reverse-Based Code Injection Forensics Algorithm. Appl. Sci. 2023, 13, 2478. [Google Scholar] [CrossRef]
Nagy, R. Detecting hidden kernel modules in memory snapshots. Forensic Sci. Int. Digit. Investig. 2025, 53, 301928. [Google Scholar] [CrossRef]
Foundation, T.V. The Volatility Framework. 2025. Available online: https://volatilityfoundation.org/the-volatility-framework/ (accessed on 16 September 2025).
Team, W.P. Wazuh—Open Source Security Platform. 2025. Available online: https://wazuh.com (accessed on 16 September 2025).
Manjunatha, V.; Ramesh, R. Machine Learning in Malware Detection: A Survey of Analysis Techniques. IJARCCE 2023, 12, 204. [Google Scholar] [CrossRef]
Kaur, P.; Laxmi, V. Review on Ensemble Based Learning for Malware Detection: A Review. Int. J. Multidiscip. Res. 2024, 6, 1–13. [Google Scholar]
Gopinath, M.; Sethuraman, S.C. A comprehensive survey on deep learning based malware detection techniques. Comput. Sci. Rev. 2023, 47, 100529. [Google Scholar] [CrossRef]
Mohanta, B.K.; Panda, M.; Jena, S.K.; Sahoo, B. Fileless malware threats: Recent advances, analysis approach through memory forensics and research challenges. Expert Syst. Appl. 2022, 214, 119133. [Google Scholar] [CrossRef]
Ugarte, D.; Maiorca, D.; Cara, F.; Giacinto, G. PowerDrive: Accurate de-obfuscation and analysis of PowerShell malware. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment; Springer International Publishing: Cham, Switzerland, 2019; pp. 240–259. [Google Scholar]
Sihwail, R.; Omar, K.; Arifin, K.A.Z. An effective memory analysis for malware detection and classification. Comput. Mater. Contin. 2021, 67, 2301–2320. [Google Scholar] [CrossRef]
OSSEC Project(2025) OSSEC—Open Source Host-Based Intrusion Detection System. 2025. Available online: https://www.ossec.net (accessed on 16 September 2025).
logpai. LogHub BGL Dataset (GitHub Repository). 2025. Available online: https://github.com/logpai/loghub/tree/master/BGL (accessed on 16 September 2025).
Oliner, A.J.; Stearley, J. What Supercomputers Say: A Study of Five System Logs. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Edinburgh, UK, 25–28 June 2007; IEEE: New York, NY, USA, 2007; pp. 575–584. [Google Scholar] [CrossRef]
Zhu, J.; He, S.; He, P.; Liu, J.; Lyu, M.R. Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics. In Proceedings of the 2023 IEEE International Symposium on Software Reliability Engineering (ISSRE), Florence, Italy, 9–12 October 2023; IEEE: New York, NY, USA, 2023; pp. 288–299. [Google Scholar]
logpai. LogHub HDFS README (GitHub Repository). 2025. Available online: https://github.com/logpai/loghub/blob/master/HDFS/README.md (accessed on 16 September 2025).
Zhou, J.; Chen, Z.; Wang, J.; Zheng, Z.; Lyu, M.R. TraceBench: An Open Data Set for Trace-oriented Monitoring. In Proceedings of the 6th IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Singapore, 15–18 December 2014; IEEE: New York, NY, USA, 2014; pp. 371–378. [Google Scholar] [CrossRef]
Çatak, F.O.; Yazı, A.F. A Benchmark API Call Dataset for Windows PE Malware Classification. arXiv 2019, arXiv:1905.01999. [Google Scholar]
Greenhill, A. VirusShare-Search (GitHub Repository). 2025. Available online: https://github.com/AdamGreenhill/VirusShare-Search?tab=readme-ov-file (accessed on 15 September 2025).
theZoo—A Repository of Live Malwares (GitHub). 2025. Available online: https://github.com/ytisf/theZoo (accessed on 15 September 2025).
Ben Abdel Ouahab, I.; Elaachak, L.; Bouhorma, M. Enhancing Malware Classification with Vision Transformers: A Comparative Study with Traditional CNN Models. In Proceedings of the 6th International Conference on Networking, Information Systems & Security (NISS ’23), New York, NY, USA, 24–26 May 2023. [Google Scholar] [CrossRef]
Abdelsalam, M.; Krishnan, R.; Sandhu, R. Online Malware Detection in Cloud Auto-Scaling Systems Using Shallow Convolutional Neural Networks. In Proceedings of the 31st Annual IFIP WG 11.3 Working Conference on Data and Applications Security and Privacy (DBSec), Charleston, SC, USA, 15–17 July 2019; Springer: Cham, Switzerland, 2019; pp. 381–397. [Google Scholar] [CrossRef]
Du, M.; Li, F.; Zheng, G.; Srikumar, V. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17), New York, NY, USA, 30 October–3 November 2017; pp. 1285–1298. [Google Scholar] [CrossRef]
Ahmad, S.; Lavin, A.; Purdy, S.; Agha, Z. Unsupervised Real-Time Anomaly Detection for Streaming Data. Neurocomputing 2017, 262, 134–147. [Google Scholar] [CrossRef]
Karim, S.S.; Afzal, M.; Iqbal, W.; Al Abri, D. Advanced Persistent Threat (APT) and Intrusion Detection Evaluation Dataset for Linux Systems. Data Brief 2024, 53, 110290. [Google Scholar] [CrossRef]
Xu, L.; Zhang, D.; Jayasena, N.; Cavazos, J. HADM: Hybrid Analysis for Detection of Malware. In Proceedings of the SAI Intelligent Systems Conference (IntelliSys) 2016, London, UK, 21–22 September 2016; Bi, Y., Kapoor, S., Bhatia, R., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2018; Volume 16, pp. 709–729. [Google Scholar] [CrossRef]
Mystakidis, A.; Ntozi, E.; Koukaras, P.; Katsaros, N.; Ioannidis, D.; Tjortjis, C.; Tzovaras, D. A multi-energy meta-model strategy for multi-step ahead energy load forecasting. Electr. Eng. 2025, 107, 9675–9699. [Google Scholar] [CrossRef]
Mystakidis, A.; Tjortjis, C. Traffic congestion prediction and missing data: A classification approach using weather information. Int. J. Data Sci. Anal. 2025, 20, 2387–2406. [Google Scholar] [CrossRef]
Stasinos, N.; Kousis, A.; Sarlis, V.; Mystakidis, A.; Rousidis, D.; Koukaras, P.; Kotsiopoulos, I.; Tjortjis, C. A tri-model prediction approach for COVID-19 ICU bed occupancy: A case study. Algorithms 2023, 16, 140. [Google Scholar] [CrossRef]
Mystakidis, A.; Tsalikidis, N.; Koukaras, P.; Skaltsis, G.; Ioannidis, D.; Tjortjis, C.; Tzovaras, D. EV charging forecasting exploiting traffic, weather and user information. Int. J. Mach. Learn. Cybern. 2025, 16, 6737–6763. [Google Scholar] [CrossRef]
Paramonov, K.; Ozay, M.; Mystakidis, A.; Tsalikidis, N.; Sotos, D.; Drosou, A.; Tzovaras, D.; Kim, H.; Chang, K.; Mo, S.; et al. Continual Error Correction on Low-Resource Devices. In Proceedings of the 16th ACM Multimedia Systems Conference, Stellenbosch, South Africa, 31 March–4 April 2025; pp. 349–355. [Google Scholar]
Qian, L.; Cong, L. Channel Features and API Frequency-Based Transformer Model for Malware Identification. Sensors 2024, 24, 580. [Google Scholar] [CrossRef]
Khan, W.; Haroon, M. An Unsupervised Deep Learning Ensemble Model for Anomaly Detection in Static Attributed Social Networks. Int. J. Comput. Commun. Netw. 2022, 3, 153–160. [Google Scholar] [CrossRef]
MITRE ATT&CK® Corporation. MITRE ATT&CK^® Framework. 2025. Available online: https://attack.mitre.org/ (accessed on 16 September 2025).
Arik, S.Ö.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2021; Volume 35, pp. 6679–6687. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Guryanov, A. Histogram-based algorithm for building gradient boosting ensembles of piecewise linear decision trees. In Proceedings of the International Conference on Analysis of Images, Social Networks and Texts, Kazan, Russia, 17–19 July 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 39–50. [Google Scholar]
De Ville, B. Decision trees. Wiley Interdiscip. Rev. Comput. Stat. 2013, 5, 448–455. [Google Scholar] [CrossRef]
Bengio, Y.; Lee, D.; Bornschein, J.; Lin, Z. Towards Biologically Plausible Deep Learning. arXiv 2015, arXiv:1502.04156. [Google Scholar] [CrossRef]
Marblestone, A.H.; Wayne, G.; Kording, K.P. Toward an Integration of Deep Learning and Neuroscience. Front. Comput. Neurosci. 2016, 10, 94. [Google Scholar] [CrossRef] [PubMed]
Olshausen, B.A.; Field, D.J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996, 381, 607–609. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. arXiv 2017, arXiv:1706.09516. [Google Scholar] [CrossRef]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. arXiv 2018, arXiv:1810.11363. Available online: https://arxiv.org/abs/1810.11363 (accessed on 15 September 2025).
WinPmem—A Physical Memory Acquisition Tool. 2025. Available online: https://github.com/Velocidex/WinPmem (accessed on 15 September 2025).
Garfinkel, S.; Schatz, B. Extending the advanced forensic format to accommodate multiple data sources, logical evidence, arbitrary information and forensic workflow. Digit. Investig. 2009, 6, S57–S68. [Google Scholar] [CrossRef]
LiME Linux Memory Extractor. 2025. Available online: https://github.com/504ensicsLabs/LiME (accessed on 15 September 2025).
Microsoft AVML (Acquire Volatile Memory for Linux). 2025. Available online: https://github.com/microsoft/avml (accessed on 15 September 2025).
Kaouk, M.; Flaus, J.M.; Potet, M.L.; Groz, R. A Review of Intrusion Detection Systems for Industrial Control Systems. In Proceedings of the 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 23–26 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1699–1704. [Google Scholar] [CrossRef]
McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 56–61. [Google Scholar] [CrossRef]
Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of machine learning models using improved shapley additive explanation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Niagara Falls, NY, USA, 7–10 September 2019; p. 546. [Google Scholar]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
Imambi, S.; Prakash, K.B.; Kanagachidambaresan, G. PyTorch. In Programming with TensorFlow: Solution for Edge Computing Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 87–104. [Google Scholar]
Gholamy, A.; Kreinovich, V.; Kosheleva, O. Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation. 2018. Available online: https://scholarworks.utep.edu/cs_techrep/1209/ (accessed on 16 September 2025).
Manthiramoorthi, M.; Mani, M.; Murthy, A.G. Application of Pareto’s Principle on Deep Learning Research Output: A Scientometric Analysis. In Proceedings of the International Conference on Machine Learning and Smart Technology—ICMLST, Chennai, India, 23–25 April 2021. [Google Scholar]
Jin, Y.; Gruna, R.; Sendhoff, B. Pareto analysis of evolutionary and learning systems. Front. Comput. Sci. China 2009, 3, 4–17. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Zhao, H.; Yan, L.; Hou, Z.; Lin, J.; Zhao, Y.; Ji, Z.; Wang, Y. Error Analysis Strategy for Long-Term Correlated Network Systems: Generalized Nonlinear Stochastic Processes and Dual-Layer Filtering Architecture. IEEE Internet Things J. 2025, 12, 33731–33745. [Google Scholar] [CrossRef]
Kuppa, A.; Le-Khac, N.A. Learn to adapt: Robust drift detection in security domain. Comput. Electr. Eng. 2022, 102, 108239. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, Z.; Huang, H.; Yang, H. Wasserstein distance guided feature Tokenizer transformer for domain adaptation. Comput. Secur. 2025, 157, 104562. [Google Scholar] [CrossRef]
Tabak, G.; Fan, M.; Yang, S.; Hoyer, S.; Davis, G. Correcting nuisance variation using Wasserstein distance. PeerJ 2020, 8, e8594. [Google Scholar] [CrossRef]
Sun, B.; Feng, J.; Saenko, K. Correlation Alignment for Unsupervised Domain Adaptation. arXiv 2016, arXiv:1612.01939. [Google Scholar] [CrossRef]

Figure 1. Automated memory forensics workflow.

Figure 2. Correlation matrix of the dataset features.

Figure 3. Distribution diagram of the dataset classes.

Figure 4. Binary class distribution.

Figure 5. Two step Ensemble Classification method.

Figure 6. Multi-class distribution.

Figure 7. Pre– and post-drift feature distribution plots of feature dlllist.avg_dlls_per_proc.

Figure 8. Pre– and post-drift feature distribution plots of feature callbacks.ncallbacks.

Figure 9. Pre– and post-drift feature distribution plots of feature ldrmodules.not_in_init_avg.

Figure 10. Training loss vs. validation loss for the TabNet model.

Figure 11. TabNet binary confusion matrix.

Figure 12. TabNet: Top 20 feature importance diagram.

Figure 13. Global feature importance-multiclass SHAP summary plot.

Figure 14. Local explanation with single-sample SHAP force plot. Red increases the model score, while blue does not; the width reflects contribution magnitude.

Figure 15. Confusion matrix of the binary malware detection model on the test set without using drift correction.

Figure 16. Confusion matrix of the binary malware detection on the test set using drift correction.

Figure 17. Confusion matrix of the multiclass model on the test set without using drift correction.

Figure 18. Confusion matrix of the multiclass model on the test set using drift correction.

Table 1. Datasets reviewed for the two-stage host-based malware detection framework.

Reference	Dataset	Data Type	Key Limitations
Logs/Events Datasets
[16]	BGL Logs	System/Event Logs	Not directly compatible with Wazuh’s log collection and analysis pipeline.
[19]	HDFS Logs	System/Event Logs	Focus on system failures not security events; limited applicability to Wazuh.
[26]	OpenStack Logs	System/Event Logs	INFO-level operational logs; lacks direct malware/attack traces relevant to Wazuh detection.
User/Resource Metrics
[27]	Numenta Anomaly Benchmark (NAB)	Real-world streaming data	Generic anomaly streams; no malware semantics; cannot be ingested into Wazuh without major adaptation.
[25]	Cloud VM Malware Metrics	Process/resource utilization	Process-level metrics from malware execution; not natively usable by Wazuh as it focuses on logs/events.
Static/API-Call/Repository Datasets
[24]	Malimg	Malware binaries as images	Image-based representation; not compatible with Wazuh log collection or memory forensics pipeline.
[1,21]	Mal-API-2019	Windows API call sequences	Focus on API calls; not derived from host logs and cannot be fed directly into Wazuh.
[22]	VirusShare	Malware binaries	Malware binaries only; Wazuh cannot process raw binaries.
[23]	theZoo	Malware binaries	Same as VirusShare; repository of live malware samples; not directly integrable with Wazuh.
Linux Datasets
[28]	Linux-APT 2024	Linux security logs	Lacks adaptability as it is focused on Linux environment.
[29]	Linux Dataset	ELF Executables	Requires modifications, such as rule generation from strings.
Memory Dump Datasets
[2]	CICMalMem22	Memory dumps	Memory dumps usable for malware detection, but limited in family diversity and scale compared to newer adaptations.
[3]	MemMal-D2024	Memory dumps	Rich forensic features; long-term span; aligned with Wazuh-triggered memory collection, enabling both detection and family classification.

Table 2. Final features used for binary and multiclass training.

Feature	Description
pslist.nproc	Total number of active processes.
pslist.nppid	Number of parent process IDs observed.
pslist.avg_threads	Average number of threads per process.
pslist.nprocs64bit	Number of 64-bit processes.
pslist.avg_handlers	Average number of handlers per process.
dlllist.ndlls	Total number of dynamically loaded libraries (DLLs).
dlllist.avg_dlls_per_proc	Average DLLs per process.
ldrmodules.not_in_load	Modules not found in load list.
ldrmodules.not_in_init	Modules not found in init list.
ldrmodules.not_in_mem	Modules not found in memory list.
ldrmodules.not_in_load_avg	Average count of modules not in load list.
ldrmodules.not_in_init_avg	Average count of modules not in init list.
ldrmodules.not_in_mem_avg	Average count of modules not in memory list.
modules.nmodules	Number of kernel modules loaded.
svcscan.nservices	Total number of registered services.
svcscan.kernel_drivers	Number of kernel mode drivers.
svcscan.fs_drivers	Number of filesystem drivers.
svcscan.process_services	Number of services running as separate processes.
svcscan.shared_process_services	Services running under `svchost.exe` (shared).
svcscan.interactive_process_services	Services with interactive desktop permissions.
svcscan.nactive	Number of currently active services.
callbacks.ncallbacks	Number of kernel callbacks registered.
callbacks.nanonymous	Number of anonymous (non-symbolic) callbacks.
callbacks.ngeneric	Number of generic (unspecified) callbacks.
Class	Binary class label (0 = Benign, 1 = Malware).
Family_encoded	Integer label for top-level malware family (Trojan, Ransomware, Spyware), derived from the original category.

Table 3. Multiclass classification statistics on testbed data using all adaptation methods.

DA Method	Accuracy	Precision	Recall	F1-Score
Affine OT	0.82	0.83	0.83	0.83
Transformer OT	0.35	0.24	0.36	0.28
CORAL	0.89	0.9	0.9	0.9

Table 4. Binary detection statistics on testbed data using all adaptation methods.

DA Method	Accuracy	Precision	Recall	F1-Score
Affine OT	0.95	0.95	0.95	0.95
Transformer OT	0.5	0.5	0.5	0.5
CORAL	0.5	0.5	0.5	0.49

Table 5. System configurations.

System	CPU	Cores/Threads	RAM	RAM Freq.
S1	i9-14900HX @ 2.20 GHz	24/32	32 GB	5600 MT/s
S2	i7-9700K @ 3.60 GHz	8/8	32 GB	2667 MT/s

Table 6. Timing measurements (in seconds) for different systems (S1/S2) and parallelism (p) settings.

Module Plugin	S1, $p = 4$	S1, $p = 8$	S2, $p = 4$	S2, $p = 8$
memory_dump	43	33	67	69
callbacks	87.54	90.99	150.07	165.91
dlllist	14.20	15.06	23.33	13.76
info	0.81	0.96	1.09	1.45
ldrmodules	43.04	43.76	65.55	39.45
modules	0.81	1.14	1.25	2.01
pslist	1.29	1.54	1.99	1.89
svcscan	13.32	14.06	24.88	32.66
features_total	102	107	169	172
pipeline_total	145	140	236	241

Table 7. Initial prediction results on the four-class dataset (Benign, Spyware, Ransomware, Trojan).

Model	Accuracy	Precision	Recall	F1-Score
TabNet	0.7558	0.6338	0.6249	0.6220
CB	0.8266	0.7320	0.7319	0.7319
DT	0.7514	0.6405	0.6220	0.6117
MLP	0.7724	0.6490	0.6497	0.6463
HGB	0.8406	0.7602	0.7602	0.7602
LGBM	0.8362	0.7480	0.7472	0.7471

Table 8. Binary malware detection performance across all the evaluated models.

Model	Accuracy	Precision	Recall	F1-Score
TabNet	0.9997	0.9997	0.9999	0.9997
LGBM	0.9987	0.9985	0.9989	0.9987
XGB	0.9985	0.9984	0.9986	0.9985
CB	0.9981	0.9980	0.9982	0.9981
HGB	0.9988	0.9987	0.9988	0.9987
DT	0.9953	0.9952	0.9951	0.9951
MLP	0.9974	0.9971	0.9973	0.9972

Table 9. Multiclass malware family classification performance (Ransomware, Spyware, Trojan) across all the evaluated models.

Model	Accuracy	Macro Precision	Macro Recall	F1-Score
HGB	0.6988	0.6987	0.6988	0.6987
XGB	0.6907	0.6896	0.6897	0.6896
LGBM	0.6877	0.6862	0.6865	0.6862
CB	0.6565	0.6554	0.6557	0.6555
MLP	0.5825	0.5816	0.5826	0.5818
DT	0.4798	0.5481	0.4878	0.4472
TabNet	0.4713	0.4736	0.4720	0.4703
Voting	0.7017	0.7009	0.7010	0.7009

Table 10. Performance of the full two-stage pipeline on the test set (Benign, Spyware, Ransomware, Trojan).

Class	Precision	Recall	F1-Score	Support
Benign	1.0000	0.9986	0.9993	1460
Spyware	0.7837	0.7537	0.7684	471
Ransomware	0.7199	0.7045	0.7121	467
Trojan	0.6915	0.7386	0.7143	440
Accuracy			0.8702	2838
Macro avg	0.7988	0.7989	0.7985	2838
Weighted avg	0.8701	0.8693	0.8695	2838

Table 11. Updated initial prediction results on the four-class dataset (Benign, Spyware, Ransomware, Trojan) comparison with the Two-Step TabNet–Voting pipeline.

Model	Accuracy	Precision	Recall	F1-Score
TabNet	0.7558	0.6338	0.6249	0.6220
CB	0.8266	0.7320	0.7319	0.7319
DT	0.7514	0.6405	0.6220	0.6117
MLP	0.7724	0.6490	0.6497	0.6463
HGB	0.8406	0.7602	0.7602	0.7602
LGBM	0.8362	0.7480	0.7472	0.7471
Two-Step TabNet–Voting	0.8702	0.7988	0.7989	0.7985

Table 12. AI pipeline inference latency results.

Malware Type	Receipt Timestamp	Results Timestamp	Latency (ms)
Spyware	16 January 2026 14:30:19.382	16 January 2026 14:30:20.168	786
Trojan	16 January 2026 15:47:43.164	16 January 2026 15:47:43.747	622

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mystakidis, A.; Kalogiannnis, G.; Vakakis, N.; Altanis, N.; Milousi, K.; Somarakis, I.; Mihalachi, G.; Mazi, M.S.; Sotos, D.; Voulgaridis, A.; et al. XAI-Driven Malware Detection from Memory Artifacts: An Alert-Driven AI Framework with TabNet and Ensemble Classification. AI 2026, 7, 66. https://doi.org/10.3390/ai7020066

AMA Style

Mystakidis A, Kalogiannnis G, Vakakis N, Altanis N, Milousi K, Somarakis I, Mihalachi G, Mazi MS, Sotos D, Voulgaridis A, et al. XAI-Driven Malware Detection from Memory Artifacts: An Alert-Driven AI Framework with TabNet and Ensemble Classification. AI. 2026; 7(2):66. https://doi.org/10.3390/ai7020066

Chicago/Turabian Style

Mystakidis, Aristeidis, Grigorios Kalogiannnis, Nikolaos Vakakis, Nikolaos Altanis, Konstantina Milousi, Iason Somarakis, Gabriela Mihalachi, Mariana S. Mazi, Dimitris Sotos, Antonis Voulgaridis, and et al. 2026. "XAI-Driven Malware Detection from Memory Artifacts: An Alert-Driven AI Framework with TabNet and Ensemble Classification" AI 7, no. 2: 66. https://doi.org/10.3390/ai7020066

APA Style

Mystakidis, A., Kalogiannnis, G., Vakakis, N., Altanis, N., Milousi, K., Somarakis, I., Mihalachi, G., Mazi, M. S., Sotos, D., Voulgaridis, A., Tjortjis, C., Votis, K., & Tzovaras, D. (2026). XAI-Driven Malware Detection from Memory Artifacts: An Alert-Driven AI Framework with TabNet and Ensemble Classification. AI, 7(2), 66. https://doi.org/10.3390/ai7020066

Article Menu

XAI-Driven Malware Detection from Memory Artifacts: An Alert-Driven AI Framework with TabNet and Ensemble Classification

Abstract

1. Introduction

2. Related Work

2.1. Background-Malware Detection

2.2. Background Datasets

2.2.1. SOTA-ML Based Detection

2.2.2. Evaluation Metrics

2.2.3. Model Description

Tabular Attentive and Interpretable Learning for Structured Tabular Data (TabNet) Classifier

Light Gradient Boosting Machine (LGBM)

Extreme Gradient Boosting (XGB)

Histogram-Based Gradient Boosting (HGB)

Decision Tree (DT)

Multi-Layer Perceptron (MLP)

Categorical Boosting (CB)

3. Methodology

3.1. Alert-Driven Memory Dump Capture

3.2. Memory Analysis and Feature Extraction

3.3. AI-Based Malware Classification

3.3.1. Data Preprocessing

3.3.2. Explainability Aspects

3.3.3. Train Dataset Description

3.3.4. Two-Stage Host-Based Malware Detection

3.3.5. Validation on Testbed Malware Data

4. Results

4.1. Memory Dumps and Feature Extraction Timings

4.2. ML-Based Detection

4.2.1. Initial Malware Detection

4.2.2. Binary Malware Detection

4.2.3. Malware Family Multi-Classification

4.2.4. Overall Testing Set Comparison

4.2.5. Domain Drift Correction Ablation Study on Laboratory Memory Data

4.2.6. AI Pipeline Inference Latency Evaluation

5. Discussion

6. Conclusions

6.1. Limitations Biases

6.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI