Real-Time Ransomware Detection Using Reinforcement Learning Agents

Thakur, Kutub; Ali, Md Liakat; Schmeelk, Suzanna; Debello, Joan; Rahman, Md Mustafizur

doi:10.3390/info17020194

Open AccessArticle

Real-Time Ransomware Detection Using Reinforcement Learning Agents

by

Kutub Thakur

^1,*

,

Md Liakat Ali

^2,*

,

Suzanna Schmeelk

¹,

Joan Debello

¹ and

Md Mustafizur Rahman

³

¹

College of Professional Studies, St. John’s University, New York, NY 11439, USA

²

Department of Computer Science & Physics, Rider University, Lawrenceville, NJ 08648, USA

³

Department of Mathematics and Computer Sciences, Mercy University, Dobbs Ferry, NY 10522, USA

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(2), 194; https://doi.org/10.3390/info17020194

Submission received: 16 January 2026 / Revised: 10 February 2026 / Accepted: 11 February 2026 / Published: 13 February 2026

(This article belongs to the Special Issue Extended Reality and Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

Traditional signature-based anti-malware tools often fail to detect zero-day ransomware attacks due to their reliance on known patterns. This paper presents a real-time ransomware detection framework that models system behavior as a Reinforcement Learning (RL) environment. Behavioral features—including file entropy, CPU usage, and registry changes—are extracted from dynamic analysis logs generated by Cuckoo Sandbox. A (DQN) agent is trained to proactively block malicious actions by maximizing long-term rewards based on observed behavior. Experimental evaluation across multiple ransomware families such as WannaCry, Locky, Cerber, and Ryuk demonstrates that the proposed RL agent achieves a superior detection accuracy, precision, and F1-score compared to existing static and supervised learning methods. Furthermore, ablation tests and latency analysis confirm the model’s robustness and suitability for real-time deployment. This work introduces a behavior-driven, generalizable approach to ransomware defense that adapts to unseen threats through continual learning.

Keywords:

ransomware detection; real-time security; behavioral analysis; Cuckoo Sandbox

1. Introduction

Ransomware attacks have grown into one of the most disruptive threats in cybersecurity, with widespread implications for personal users, enterprises, and critical infrastructure [1]. These attacks typically encrypt user data and demand payment in exchange for decryption keys [2]. As attackers continue to modify payloads and infection methods, conventional rule-based antivirus and static signature detection approaches have shown limitations in identifying novel or mutated threats [3]. This shifting threat landscape has increased demand for intelligent detection frameworks that can adapt to emerging attack vectors [4]. The growing availability of behavioral logs and sandbox-based forensic tools presents an opportunity to explore learning-based approaches that monitor system activities and detect threats before encryption occurs [5].

Behavior-based detection methods offer promising capabilities by observing process patterns, API call sequences, file system modifications, and CPU usage [6]. These indicators reflect the internal states and behaviors of ransomware during execution. With proper modeling, they allow security systems to make preventive decisions in real time [7]. While machine learning has been applied to classify ransomware using such features, many systems rely on batch-trained models, which are not well-suited for sequential decision making or dynamic threat environments [8]. RL, in contrast, provides a feedback-driven mechanism for continuously learning how to respond to new behaviors based on environment interaction [9]. By framing the problem as a Markov Decision Process (MDP), systems can learn optimal actions such as blocking or allowing processes to run [10].

Cyberattacks, particularly ransomware, have evolved into a global threat with severe economic, operational, and societal consequences. Recent industry and government reports indicate that ransomware incidents have resulted in billions of dollars in financial losses annually, disrupting healthcare systems, financial institutions, smart city infrastructures, and critical public services. The increasing frequency and sophistication of ransomware attacks highlight the limitations of traditional signature-based defenses and underscore the urgent need for intelligent, adaptive detection mechanisms. Prior research surveys further emphasize this challenge. For example, the research [11] systematically analyzes how modern malware leverages dynamic execution behaviors to evade static detection mechanisms, concluding that behavior-aware and learning-based models are essential for combating evolving threats. Similar observations have been reported across intrusion detection and ransomware research, where static feature dependence limits robustness against zero-day and polymorphic attacks. These findings collectively justify the growing shift towards behavioral analysis and reinforcement learning-based frameworks capable of adapting to novel attack patterns in real time. This concern is particularly critical in smart cities and cyber-physical environments, where ransomware attacks can impact public safety, transportation systems, and essential services at scale.

Existing works often treat ransomware detection as a supervised classification problem, focusing on accuracy, but not on policy learning or long-term adaptation [12]. This gap motivates the need for approaches that can function in uncertain, sequential environments where the optimal action depends on current and historical observations [13]. Modeling ransomware behavior in an RL setting creates the opportunity for systems that not only detect, but also decide when and how to intervene. Such a system could prevent encryption before it completes, mitigating damage without prior knowledge of the malware family involved.

Despite ongoing improvements in supervised learning and behavior modeling, most methods remain limited in flexibility and fail to address zero-day ransomware [14]. Traditional models lack exploration capabilities and perform poorly when facing new families or modified variants. Moreover, many datasets used in prior works are restricted to known samples and do not generalize across execution traces collected in varied environments [15]. This constraint limits the models’ relevance to real-world deployment scenarios. RL, which operates on trial-and-error interactions, offers a way to overcome these weaknesses by learning policies that are directly shaped by observed reward and punishment in the environment [16]. Several researchers have proposed hybrid frameworks combining static features and machine learning classifiers [17,18]. For instance, Anderson et al. focused on black-box adversarial environments using RL but without ransomware-specific models. Choi, Choi, and Buu proposed a proximal policy optimization (PPO) agent for Ethereum fraud detection but did not address behavioral ransomware traits. Sakthidevi et al. implemented machine learning on file behavior logs but did not support learning-based control or real-time adaptation. These approaches highlight ongoing exploration in the field, but few integrate environment interaction, online decision-making, and generalization across multiple ransomware families.

In this research paper, we propose a novel RL-based detection agent that models system behavior as a state space, defines reward based on process context, and learns to block ransomware before encryption completes. The agent is trained using logs generated from Cuckoo Sandbox and includes dynamic features such as file creation rates, CPU spikes, and registry access. We evaluated this agent across known ransomware variants including WannaCry, Ryuk, and Cerber to test generalization performance. Unlike previous works, our approach adapts online, updates its policy iteratively, and is guided by performance-based rewards. The goal is not only to detect ransomware, but to learn behavior patterns that allow prevention in real time.

How can RL be used to detect ransomware behavior based on file, CPU, and registry activity in real-time?
How does the proposed RL agent compare with existing models in terms of accuracy, generalization, and error metrics?
Can the RL agent effectively generalize to previously unseen ransomware families without retraining?

This paper makes the following core contributions. First, we formulate ransomware detection as a sequential decision-making problem and propose a reinforcement learning-based framework that enables real-time and proactive mitigation rather than offline classification. Second, we design a behavior-driven state representation using dynamic execution features extracted from sandbox analysis, allowing the model to generalize across diverse ransomware families without reliance on static signatures. Third, we conduct extensive experimental evaluation, including family-wise generalization analysis, demonstrating that the proposed approach achieves improved detection accuracy and early blocking capability compared to existing machine learning and deep learning-based methods.

The rest of the paper is organized as follows: Section 2 presents the literature review on ransomware detection techniques, focusing on RL, behavioral analysis, and real-time monitoring strategies. Section 3 details the construction of the RL environment, including state representation, reward function, and the (DQN) agent design. Section 4 outlines the experimental setup, datasets, ransomware families, feature engineering, and training configurations. Section 5 reports the detection performance metrics, comparative evaluation, training curves, Q-value mappings, and ablation studies. Finally, Section 6 summarizes the findings and highlights future directions for adaptive and proactive ransomware mitigation systems.

2. Literature Review

Ransomware has rapidly emerged as a dominant form of cyberattack, causing extensive disruption to both public and private digital infrastructures [19]. It encrypts files, halts operations, and demands ransom payments—often in cryptocurrency—leaving organizations with few viable recovery options [20]. In recent years, ransomware groups have developed advanced techniques such as double extortion, fileless execution, and polymorphic payloads, making detection harder for static or signature-based antivirus systems [21]. These approaches depend on known patterns, making them ineffective against zero-day threats or mutated ransomware strains [22]. This challenge has accelerated interest in alternative models that can track behavioral characteristics rather than rely on predefined malware signatures [23].

Behavioral features—such as sudden file creation bursts, abnormal CPU spikes, unauthorized registry modifications, and anomalous network traffic—are more resilient indicators of ransomware activity [7]. These traits tend to remain consistent across variants, even as binary signatures change [24]. With access to execution traces via dynamic analysis tools such as Cuckoo Sandbox, researchers can extract system-level metrics that help profile malicious behavior [25]. Traditional machine learning techniques use these features to train binary classifiers; however, they typically assume i.i.d. data and static learning environments, which do not reflect the adversarial and evolving nature of ransomware attacks in real-world systems [26].

RL offers an alternative view of the detection problem. It frames the environment as a sequence of states influenced by actions taken by an agent, who receives rewards or penalties depending on the consequences of each action [27]. This structure suits ransomware detection well: the agent observes system behavior over time, learns what constitutes an anomaly, and adapts policies to intervene before encryption can complete. Unlike batch classification models, RL agents learn from delayed feedback and optimize cumulative reward. This enables the better anticipation of multi-step attacks and supports generalization across ransomware families with different strategies [28].

Despite growing interest in intelligent cybersecurity systems, few studies have modeled ransomware defense explicitly as an RL environment [29]. Most research focuses on predictive modeling without active decision making. For example, Anderson et al. introduced a RL framework for black-box malware manipulation but did not target ransomware. Choi, Choi, and Buu explored PPO in fraud detection using Ethereum graphs but did not address real-time defense. Sakthidevi et al. applied machine learning on local logs to monitor threats but offered no learning-based policy control [30]. These efforts illustrate partial progress toward adaptive defense but stop short of fully interactive, environment-responsive systems tailored for ransomware behaviors.

The absence of control feedback, real-time learning, and sequential decision making in prior models reveals a gap in ransomware defense research [31]. Supervised models may offer high accuracy in static test environments, yet struggle with out-of-distribution samples or zero-day strains. Furthermore, most datasets are composed of known ransomware collected in limited sandbox scenarios and are not optimized for learning agent behavior. This makes it difficult to train models that adapt policies based on long-term reward. As threats become more unpredictable, there is a need for proactive systems capable of learning optimal defense strategies in dynamic contexts [32]. Our proposed framework introduces an RL agent that learns directly from environment observations by mapping behavioral features—such as file system operations, registry access, and CPU usage—into a structured state space. The agent interacts with a virtual sandbox, receives positive reward for correct blocks and negative reward for delayed or missed actions, and continuously updates its Q-values or policy networks [33]. Training involves both (DQN) and (PPO) agents, evaluated on logs sourced from custom executions of ransomware families including WannaCry, Ryuk, Cerber, and Locky. This method outperforms static baselines by incorporating adaptive learning, action-based intervention, and temporal feedback to detect and halt ransomware in real-time [34].

Von der Assen et al. [35] proposed a deep Q-learning approach to model offensive ransomware that adapts to bypass security. Their results showed over 90% stealth success, which raised concern about defensive gaps. Anderson et al. [36] developed an RL-based black-box evasion agent, causing a 33% drop in model detection. Wang et al. [37] introduced a low-latency detection system that achieved a true positive rate of 99.65% with minimal system overhead. Svet et al. [38] utilized unsupervised learning and deep models to capture behavioral signals. Their method reduced false positives and performed well in high-velocity settings.

Berrueta et al. [39] applied classical ML methods on encrypted traffic, reaching 99.8% detection accuracy and a false positive rate of 0.004%. Sakthidevi et al. [40] developed a real-time behavioral framework with visual analysis, though it lacked evaluation across ransomware variants. Gazzan and Sheldon [41] improved deep belief networks using uncertainty-aware early stopping, boosting accuracy from 94% to 98%. Rani and Dhavale [42] compared multiple ML classifiers, with their best model achieving 98.21% accuracy using static behavioral features from labeled samples.

Choi, Choi, and Buu [43] implemented PPO-based tuning in graph neural networks and achieved an F1 score of 0.9478. Their approach showed potential for tasks involving time-sensitive drift but was not designed for ransomware. Amaizu et al. [44] designed a federated model with privacy protection and tested on medical images, reaching 79% accuracy. Their work emphasized distributed privacy but did not handle malicious activity. Al-Fawa’reh et al. [45] proposed a hybrid system combining GAN-generated synthetic data and ensemble classifiers. Their method achieved 99.1% accuracy on the NSL-KDD and custom ransomware traffic.

Hurley et al. [46] used both static and dynamic features to train LSTM networks, reporting an F1 score of 0.965 and latency below 2.3 s. Their focus on feature fusion helped reduce detection delay. These studies provided various approaches to ransomware identification but many lacked policy-based defense or were only tested on specific families. Most works applied supervised classifiers without adaptability to unseen variants. Only von der Assen et al. [35] and Anderson et al. [36] used RL to simulate adversarial agents, and neither addressed proactive blocking. Additionally, studies like Sakthidevi et al. and Rani and Dhavale relied on behavioral logs without generalization testing.

Real-time decision-making was partially covered in Wang et al. and Svet et al., yet these lacked formal learning policies. Studies by Berrueta et al. and Gazzan and Sheldon achieved high accuracy but were restricted to narrow setups. PPO and federated strategies from Choi, Choi, and Buu and Amaizu et al. were tested in non-ransomware domains, showing architecture potential but limited cross-domain adaptability. A majority of the reviewed methods focused on classification without state modeling or action policies, which limited their use in automated defense settings. This gap presents a case for RL agents that operate based on behavioral patterns and system transitions.

No reviewed study tested blocking policies across multiple ransomware families in a dynamic learning environment. The evaluation of GAN-based generalization in Al-Fawa’reh et al. and LSTM-based early detection in Hurley et al. showed promising results, but did not incorporate adaptive decision layers. Models that targeted encryption behavior (e.g., von der Assen et al.) were offensive and not defense-focused. Only Wang et al. measured response time explicitly. Hence, a real-time system built using DQN or PPO with behavioral logs and sandbox outputs could improve generalization and reduce latency while maintaining detection accuracy across variants like WannaCry and Ryuk. Such a system could fill the present methodological gap. Table 1 shows the summary of the related work on ransomware detection and related techniques.

3. Proposed Methodology

This section models ransomware detection as a sequential decision-making problem using (RL). We define a custom (MDP) to simulate system behavior under ransomware threats and train a (DQN) agent to proactively block malicious activity.

The proposed methodology models ransomware detection as an RL task, where system behavior is mapped to an (MDP). The environment is defined using behavioral traces extracted from Cuckoo Sandbox, including features such as file entropy, CPU usage, registry edits, and API call entropy. The agent interacts with this environment through actions like allowing or blocking processes. Rewards are shaped based on entropy thresholds, encryption-like activity, and system responsiveness. A (DQN) is used to learn the optimal policy by minimizing temporal difference loss across replayed experiences. The learning process is stabilized using target networks and experience replay buffers. The RL agent iteratively updates its policy based on observed transitions, enabling real-time adaptation to novel ransomware families. Hyperparameters such as the learning rate, discount factor, and exploration schedule were optimized through extensive tuning, and generalization was validated across WannaCry, Ryuk, Cerber, and Locky variants.

The architecture diagram (Figure 1) presents a top–down view of the system design. It begins with real-time system monitoring at the top, collecting the file system, process, and registry activity. This data flows into a feature extraction module that transforms raw logs into normalized vectors. These vectors are then fed into the RL agent, which evaluates Q-values for possible actions using a deep neural network. The selected action—such as terminating a process or allowing it to continue—is executed by the policy handler. The feedback loop completes as the environment responds to the action, providing a new state and reward to the agent. This design allows the system to operate in real time while continuously learning from system behavior. As supported in prior works like Anderson et al. [35] and von der Assen et al. [36], this architecture emphasizes decision making under uncertainty, which is an essential trait in dynamic threat environments.

3.1. MDP Definition

The environment is defined as an MDP tuple:

M = (S, A, T, R, γ)

(1)

Here,

S is the set of system states derived from runtime features. A is the discrete action set. T represents the stochastic transition dynamics between states. R is the scalar reward function that guides learning.

γ

is the discount factor, weighting long-term vs. immediate rewards.

3.2. State Space

At each time step t, the environment emits a state vector

s_{t} \in R^{n}

, defined as

s_{t} = [f_{t}^{(1)}, f_{t}^{(2)}, \dots, f_{t}^{(n)}]

(2)

Each feature

f_{t}^{(i)}

includes normalized values for the entropy of written files, file write frequency, registry activity count, CPU spikes, memory access rate, and API call entropy. These were chosen based on forensic ransomware behavior observed in Cuckoo logs.

3.3. Behavioral Feature Categorization and Resource Utilization

In malware intrusion detection systems based on dynamic analysis, behavioral features are commonly grouped according to the system resources they affect. In this study, behavioral features extracted from Cuckoo Sandbox logs are explicitly organized into resource-centric categories to improve interpretability and reproducibility.

File system behavior: This category captures ransomware-induced file manipulation patterns, including file creation rate, file write frequency, file overwrite operations, and entropy of written files. File entropy is a particularly strong indicator of cryptographic activity associated with ransomware encryption.
Registry behavior: Registry-based features include the number of registry key creations, deletions, and modifications observed during execution. Ransomware frequently alters registry entries to establish persistence, disable recovery mechanisms, or modify system configurations.
Process and CPU behavior: This category includes CPU utilization spikes, process execution frequency, and abnormal resource consumption patterns. Ransomware typically exhibits elevated CPU usage during encryption phases, which distinguishes it from benign applications.
Memory behavior: Memory-related features capture abnormal memory write operations and allocation rates, which may indicate in-memory encryption routines or unpacking behavior prior to file encryption.
API call behavior: API-level features include the frequency and entropy of sensitive system API calls related to file I/O, cryptographic libraries, and process manipulation. These calls provide fine-grained behavioral insight into ransomware execution logic.
Network behavior: Although Cuckoo Sandbox provides detailed network traffic analysis, this study does not explicitly use network-based features for state representation. The focus is intentionally placed on host-based behavioral features to ensure early-stage detection prior to command-and-control communication or ransom note delivery.

These categorized behavioral features collectively form the state vector used by the reinforcement learning agent. By structuring system behavior according to affected resources, the proposed framework aligns with standard malware analysis practices while maintaining compatibility with real-time deployment constraints.

3.4. Action Space

The agent selects from two discrete actions:

A = {0 : Allow, 1 : Block}

(3)

Action 0 permits continued execution. Action 1 blocks the process and logs an incident. The binary formulation allows mapping detection into a decision boundary learned by the agent.

3.5. State Transition Function

State evolution is modeled as

s_{t + 1} = f (s_{t}, a_{t}) + ϵ_{t}

(4)

f is a nonlinear transformation encoding how ransomware or benign processes evolve the system.

ϵ_{t} \sim N (0, σ^{2})

captures measurement noise from runtime logs.

3.6. Reward Function

The reward function penalizes both false positives and false negatives. It is defined as:

R (s_{t}, a_{t}) = \{\begin{matrix} + 1, & if correctly blocks ransomware \\ - 1, & if incorrectly blocks benign \\ - 2, & if allows ransomware \\ 0, & otherwise \end{matrix}

(5)

This stepwise formulation encourages the agent to block threats while minimizing disruption. False negatives are more heavily penalized due to the irreversible nature of encryption. We also formulate a continuous variant:

R_{t} = α D_{t} - β F P_{t} - δ F N_{t}

(6)

D_{t}

is a binary indicator for correct detection.

F P_{t}

,

F N_{t}

are indicators for false predictions. Hyperparameters

α, β, δ

balance reward sensitivity.

Although both discrete and continuous reward formulations are presented to illustrate alternative learning strategies, the discrete reward function defined in Equation (5) was used for all primary experiments and evaluations reported in this study. The discrete formulation was selected due to its stability, interpretability, and suitability for early ransomware intervention, where decisive blocking actions are critical. The continuous reward function in Equation (6) is included as a potential extension for future work, particularly for the fine-grained optimization of detection latency.

3.7. Q-Value Approximation

We define the optimal Q-value as:

Q^{*} (s, a) = max_{π} E_{π} [\sum_{k = 0}^{\infty} γ^{k} R (s_{t + k}, a_{t + k})]

(7)

This function estimates the maximum expected future return starting from state s, taking action a, and following an optimal policy

π

. It serves as the objective.

3.8. Deep Q-Network Architecture

To approximate

Q^{*}

, we define a deep network:

\hat{Q} (s, a; θ) = {MLP}_{θ} (s)

(8)

{MLP}_{θ}

is a multi-layer perceptron with weights

θ

, mapping input state to Q-values of each action. The network learns representations of ransomware behavior.

3.9. Loss Function

We use a temporal difference (TD) error for learning:

L (θ) = E_{(s, a, r, s^{'})} [{(y_{t} - Q (s, a; θ))}^{2}]

(9)

This loss minimizes the difference between current prediction and target value

y_{t}

, forming the basis for gradient descent updates.

3.10. Target Value Calculation

The TD target is given by:

y_{t} = r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}; θ^{-})

(10)

We compute the target using the reward plus discounted future Q-value from a frozen target network

θ^{-}

, updated periodically for training stability.

3.11. Policy Derivation

The learned policy is greedy over Q-values:

π (s_{t}) = arg max_{a} Q (s_{t}, a; θ)

(11)

At deployment, the agent selects the action with maximum Q-value, ensuring policy follows learned threat patterns.

3.12. Exploration via $ϵ$ -Greedy

The exploration strategy is:

a_{t} = \{\begin{matrix} random (A), & with probability ϵ_{t} \\ arg max_{a} Q (s_{t}, a; θ), & otherwise \end{matrix}

(12)

The agent explores the action space with decaying probability

ϵ_{t}

, allowing discovery of rare patterns early in training.

3.13. Epsilon Decay Schedule

The decay rule is:

ϵ_{t} = max (ϵ_{\min}, ϵ_{0} \cdot λ^{t})

(13)

Here,

ϵ_{0}

is the initial exploration rate,

λ < 1

controls decay, and

ϵ_{\min}

sets a lower bound to ensure continued exploration.

3.14. Convergence Measure

We monitor average Q-value shifts:

Δ Q = \frac{1}{N} \sum_{i = 1}^{N} | Q_{i}^{(t)} - Q_{i}^{(t - 1)} |

(14)

Training terminates when Q-values stabilize below a small threshold

τ

, indicating policy convergence.

3.15. Training Algorithm

Algorithm 1 details the training procedure of the Deep Q-Network (DQN) agent designed for real-time ransomware detection. Unlike conventional classification pipelines, the agent learns through sequential interaction with an environment constructed from dynamic behavioral logs, allowing it to capture temporal dependencies inherent in ransomware execution. At the beginning of each episode, the environment is reset using behavioral traces extracted from Cuckoo Sandbox, representing either benign or ransomware-driven system activity. Each observed state encodes file system operations, registry modifications, CPU utilization, memory activity, and API call entropy. The agent selects actions using an

ϵ

-greedy policy, enabling the early exploration of diverse behavior patterns while gradually shifting toward exploitation of learned ransomware signatures.

The use of an experience replay buffer (Steps 13–14) is critical for stabilizing learning, as it breaks temporal correlations between consecutive ransomware actions and ensures balanced exposure to both benign and malicious transitions. Target network updates (Steps 22–24) further prevent oscillations during training, which is especially important in security environments where misclassification costs are asymmetric. The reward function heavily penalizes delayed ransomware blocking, guiding the agent to prioritize early intervention before encryption completion. Through repeated interaction and policy refinement, the DQN agent learns an optimal action-value function that generalizes across ransomware families. This training strategy enables the model to move beyond static detection toward proactive defense, allowing real-time blocking decisions based on evolving system behavior rather than predefined signatures. This RL setup builds on concepts from Anderson et al. [36] but integrates domain-specific reward shaping and real-time policy adjustment to improve generalization across ransomware families.

Algorithm 1 (DQN) training for ransomware detection.

1:: Initialize Q-network $Q (s, a; θ)$ with random weights $θ$
2:: Initialize target network $Q^{'} (s, a; θ^{-}) \leftarrow θ$
3:: Initialize replay memory $D \leftarrow {}$
4:: Set exploration parameters: $ϵ \leftarrow ϵ_{i n i t}$ , $ϵ_{m i n}$ , decay rate $δ$
5:: Set learning rate $α$ , discount factor $γ$ , target update interval C
6:: for episode = 1 to E do
7:: Reset environment; observe initial state $s_{0}$ extracted from Cuckoo Sandbox behavioral logs
8:: for step t in environment do
9:: With probability $ϵ$ , select random action $a_{t} \in A$
10:: Otherwise, select greedy action $a_{t} = arg {max}_{a} Q (s_{t}, a; θ)$
11:: Execute $a_{t}$ and observe next state $s_{t + 1}$ and reward $r_{t}$
12:: Construct transition tuple: $(s_{t}, a_{t}, r_{t}, s_{t + 1})$
13:: Store transition into replay buffer $D$
14:: Sample random mini-batch ${(s_{i}, a_{i}, r_{i}, s_{i}^{'})}_{i = 1}^{N}$ from $D$
15:: for each sampled transition do
16:: Compute target Q-value:

$y_{i} = \{\begin{matrix} r_{i} & if s_{i}^{'} is terminal \\ r_{i} + γ \cdot max_{a^{'}} Q^{'} (s_{i}^{'}, a^{'}; θ^{-}) & otherwise \end{matrix}$
17:: Compute loss: $L_{i} (θ) = {(y_{i} - Q (s_{i}, a_{i}; θ))}^{2}$
18:: end for
19:: Accumulate loss over batch: $L (θ) = \frac{1}{N} \sum_{i = 1}^{N} L_{i} (θ)$
20:: Perform gradient descent step on $θ$ using Adam optimizer
21:: Update $s_{t} \leftarrow s_{t + 1}$
22:: if $t mod C = 0$ then
23:: Update target network: $θ^{-} \leftarrow θ$
24:: end if
25:: Decay $ϵ$ : $ϵ \leftarrow max (ϵ_{m i n}, ϵ \cdot δ)$
26:: end for
27:: end for

4. Experiment Setup

The experimental setup in this research was designed to rigorously evaluate the performance of the proposed (RL)-based ransomware detection framework in a controlled yet realistic environment. The datasets used in this study are drawn from both simulated and real-world ransomware activity, ensuring a broad representation of system behavior under attack and benign conditions. All logs and metadata were collected and preprocessed to match the RL environment assumptions defined in Section 3.

The primary dataset was constructed using behavioral logs captured from the Cuckoo Sandbox, an open source automated malware analysis system. We deployed variants of well-known ransomware families including WannaCry, Locky, Ryuk, and Cerber. These families were selected for their diversity in encryption strategy, propagation mechanisms, and process injection techniques. Each malware sample was executed in isolation within the sandbox, and the system-level logs were captured for a runtime duration of 180 s per sample. These logs contained detailed traces of file system access, memory allocation, CPU usage, registry changes, and network activity. The environment was instrumented to monitor API calls and I/O entropy, which are known indicators of cryptographic operations. The extracted features were mapped to the same behavioral categories defined in Section 3.2 to ensure consistent state representation across training and evaluation.

Benign samples were obtained from standard software repositories and productivity tools. These included applications like Notepad, Adobe Reader, web browsers, and installation scripts for common software. Similarly to the ransomware samples, benign processes were executed in the Cuckoo Sandbox under identical configurations. The resulting system activity logs were labeled accordingly to differentiate malicious and non-malicious behavior.

For evaluation, we constructed a time-series dataset with a frame window of five seconds, where each window represented an individual state observation. The extracted features for each state included file entropy rate, the number of file writes, registry key edit count, memory write operations, CPU load percentage, and count of suspicious API calls. Each state was associated with a label indicating whether the observed behavior was benign or malicious.

To ensure robustness and prevent data leakage, we split the dataset temporally into training, validation, and test sets. The training set included early-stage behavior from the ransomware samples, while the test set included both early and late-stage encryption activity to evaluate generalization. We further ensured that ransomware families present in the test set were not included in the training set, allowing us to test the model’s ability to detect unseen variants.

Additional validation was performed using a manually curated dataset from VirusShare and VirusTotal. Samples were cross-verified for uniqueness and active payload behavior. This external set included ransomware samples with stealth techniques such as delayed execution and encryption masking. Logs from these executions were parsed and normalized to fit the same state-action representation as used in the Cuckoo-based data.

All datasets were normalized using min–max scaling to ensure uniform feature distribution. Entropy values were scaled to [0, 1] using their empirical min/max from the benign baseline. The final aggregated dataset had over 50,000 state–action–reward tuples, with a malware-to-benign ratio of approximately 1:3. This ensured that the agent learned to identify rare malicious patterns without being biased toward frequent benign transitions. The dataset comprises behavioral traces collected from 128 unique ransomware binaries spanning 4 ransomware families (WannaCry, Ryuk, Cerber, and Locky) and 384 benign applications, each executed in isolated sandbox environments. Multiple execution traces were collected per sample to capture behavioral variability, resulting in over 50,000 state–action tuples used for training and evaluation. This design ensures that the reinforcement learning agent learns generalized ransomware behaviors rather than overfitting to specific execution trajectories or individual binaries. The experiment setting was tuned to reflect real-time conditions by introducing noise, delay, and API call jitter to emulate practical evasion scenarios.

5. Results and Analysis

This section presents a comprehensive evaluation of the proposed real-time ransomware detection framework using (RL) agents. The experimental results include detection performance metrics, policy behavior visualizations, robustness against unseen families, and latency analysis. All experiments were conducted using an Intel i9-13900K workstation with 64 GB RAM and an NVIDIA RTX 4090 GPU. The implementation was developed in Python 3.12 using PyTorch (v2.1.0) and a custom reinforcement learning environment built on OpenAI Gym (v0.26.2).

The training reward curve Figure 2 illustrates the learning progression of the (RL) agent over 3000 episodes, with rewards smoothed using an exponential moving average to reduce variance caused by exploratory actions. The consistent upward trajectory indicates that the agent increasingly learned to make optimal decisions in the environment, successfully distinguishing between ransomware-like and benign behaviors. Initially, the reward remained near zero, reflecting the agent’s trial-and-error exploration. As training progressed, the agent began to exploit more effective policies, leading to a steady improvement in cumulative reward. The curve’s smooth and monotonic rise suggests stable convergence, robust policy refinement, and reliable generalization across diverse behavioral patterns encountered during training.

The Q-value contour map over the principal feature axes in Figure 3 provides a 2D visualization of the RL agent’s decision surface, where high-dimensional system states—composed of features like file entropy, write count, and API activity—are projected using principal component analysis. Contour gradients represent the Q-values associated with the “block” action, with higher values indicating states the agent deems more threatening. Dense high-Q regions align with ransomware behavior such as encryption activity, while flatter low-Q regions correspond to benign processes. The smooth transition across contours reflects the learned decision boundary, confirming that the agent has generalized a structured, interpretable policy rather than overfitting. This visualization offers both diagnostic insight and verification of the model’s real-time discrimination ability.

Table 2 compares the proposed RL agent with Random Forest and LSTM-based classifiers. The RL agent outperformed traditional classifiers across all detection metrics.

To test generalization, we evaluated the model on ransomware families not present in training. Table 3 reports detection accuracy for each family. To rigorously evaluate generalization across unseen ransomware variants, a strict Leave-One-Group-Out (LOGO) cross-validation strategy was employed for the experiments reported in Table 3. In this setting, ransomware families were treated as distinct groups, and all samples belonging to the test family were completely excluded from the training phase. This ensures that no behavioral traces from the test families were observed during training, thereby providing a realistic assessment of the model’s ability to generalize to previously unseen ransomware families.

An ablation study (Table 4) assessed feature contributions. Removing entropy-based features yielded the largest performance drop, confirming their criticality in detecting cryptographic behavior.

Latency benchmarks in Table 5 confirm that the model can operate under real-time constraints, requiring less than 5 ms per decision.

Finally, Table 6 compares our method to six recent studies. Metrics like precision, recall, F1-score, MSE, and RMSE indicate that our agent achieves superior performance across the board.

These results confirm that our RL-based approach offers improved accuracy, generalization, and latency compared to existing models, making it a strong candidate for real-time deployment in ransomware prevention systems.

Figure 4 presents the precision comparison, illustrating how accurately each model identified ransomware samples without misclassifying benign processes. The proposed RL agent achieved the highest precision at 0.96, outperforming all baseline models, including von der Assen et al. at 0.91 and Sakthidevi et al. at 0.89. This indicates that the proposed agent produces fewer false positives, which is crucial in reducing unnecessary disruptions in benign software execution. The high contrast in the bar heights also reflects the agent’s ability to isolate decisive features associated with ransomware without being overly sensitive to benign anomalies.

The recall comparison line plot in Figure 5 highlights how well each model identifies actual ransomware threats. The proposed RL-based detection system achieved the top recall value of 0.91, indicating high sensitivity and low false negative rates. Von der Assen et al. [35] followed closely at 0.90, reflecting strong performance but slightly lower adaptability. Sakthidevi et al. [40] and Berrueta et al. [39] maintained respectable values around 0.88 and 0.85, respectively, though their approaches lacked adaptive policy tuning. Anderson et al. [36] and Svet et al. [38] showed lower scores of 0.82 and 0.84, which may reflect difficulties in generalizing across unseen ransomware variants. Choi, Choi, and Buu [43] also showed moderate recall at 0.85 but were not focused on ransomware-specific behavior. Overall, the proposed model demonstrated the most consistent ability to identify ransomware accurately, reducing missed detections and enhancing threat mitigation capabilities.

The line plot in Figure 6 compares the F1-scores of various ransomware detection methods, highlighting the performance balance between precision and recall. The proposed RL-based model achieved the highest F1-score of 0.93, indicating superior consistency in detecting ransomware with minimal false positives and false negatives. In contrast, Anderson et al. [36] and Svet et al. [38] trailed with F1-scores of 0.83 and 0.85, reflecting limitations in capturing both detection completeness and precision. Von der Assen et al. [35] recorded 0.90, showing competitive but slightly lower generalization ability. Choi, Choi, and Buu [43] and Berrueta et al. [39] maintained average scores near 0.86, emphasizing adequate performance but lacking robustness under real-time conditions. Sakthidevi et al. [40] performed reasonably with 0.88 but showed no RL adaptation. The plot’s downward trend across baselines reinforces the proposed agent’s advantage in maintaining detection stability and generalizability across complex ransomware behaviors.

Figure 7 illustrates the mean squared error (MSE) values for all compared models using a horizontal bar plot. The proposed (RL) model achieved the lowest MSE at 0.045, confirming its high fidelity in decision accuracy during both training and inference. The closest competitor, Sakthidevi et al. [40], recorded 0.059, while Anderson et al. [36] showed the highest deviation with an MSE of 0.081, implying frequent prediction inconsistencies. Models from von der Assen et al. [35] and Svet et al. [38] also crossed 0.065, reflecting noisier decision boundaries under real-time threats. The unique horizontal layout enhances comparative readability and aligns method names with their corresponding bars, helping visualize how error magnitude accumulates across each approach. Overall, the proposed agent exhibits the most reliable output variance, reinforcing the benefit of experience-based policy convergence in minimizing prediction uncertainty.

The RMSE bar chart with hatching in Figure 8 presents the deviation of each model’s predictions from true labels, providing a tangible view of predictive consistency. The proposed RL agent showed the lowest RMSE at 0.212, confirming its ability to produce stable and accurate ransomware classification outputs. Comparatively, von der Assen et al. [35] and Sakthidevi et al. [40] followed with RMSE values of 0.255 and 0.243, suggesting slightly less precise decision boundaries. Models from Anderson et al. [36] and Svet et al. [38] exceeded 0.270, highlighting greater variability and potential overfitting or under-generalization. Choi, Choi and Buu [43] and Berrueta et al. [39] presented intermediate RMSE levels, though their focus was not tailored for fine-grained ransomware recognition. The lighter hue and lined bars enhance visual contrast, emphasizing the proposed model’s reduced error propagation in real-time operational contexts.

6. Conclusions

This study introduced a real-time ransomware detection framework that models system behavior as an RL environment, enabling an agent to proactively block malicious activities based on observable features such as file system access, CPU usage, and registry modifications. By integrating (DQN) and (PPO) strategies, the proposed method learned adaptive defense policies from dynamic logs collected via Cuckoo Sandbox and public ransomware repositories. Experimental results demonstrated that the RL agent consistently outperformed existing methods in precision (0.96), recall (0.91), and F1-score (0.93), while also yielding the lowest MSE (0.045) and RMSE (0.212), indicating superior predictive accuracy and generalization across multiple ransomware families including WannaCry, Cerber, and Ryuk. Comparative analysis with twelve existing studies confirmed that most approaches either lacked real-time adaptability or failed to generalize across ransomware variants. The use of behavior-driven feedback enabled the RL agent to make timely and informed decisions under unseen attack conditions. The agent’s training trajectory, Q-value mapping, and evaluation curves further validated its stability and policy convergence.

The main contributions of this work are three-fold. First, we present a reinforcement learning-based ransomware detection framework that models system behavior as a sequential decision-making process, enabling proactive and real-time blocking rather than offline classification. Second, we design a behavior-driven state representation using dynamic features extracted from Cuckoo sandbox, allowing the model to generalize across multiple ransomware families without reliance on static signatures. Third, extensive experimental evaluation demonstrates improved accuracy, generalization, and low-latency performance compared to existing machine learning and deep learning-based approaches.

Despite these advances, a key research gap remains in extending adaptive ransomware detection to distributed and large-scale environments. Future work will explore federated and collaborative reinforcement learning to enable cross-host learning without centralized data sharing, incorporate explainability mechanisms to improve trust in automated blocking decisions, and integrate richer temporal and network-level behaviors to further enhance resilience against emerging and evasive ransomware variants.

Author Contributions

Conceptualization, K.T., M.L.A. and S.S.; methodology, K.T., M.L.A., J.D. and S.S.; validation, K.T., M.L.A., M.M.R. and S.S.; formal analysis, K.T., J.D., M.M.R. and S.S.; investigation, K.T., M.L.A. and S.S.; writing—original draft preparation, K.T., M.L.A., J.D. and M.M.R.; writing—review and editing, K.T., M.L.A., J.D., M.M.R. and S.S.; supervision, J.D., M.M.R. and K.T.; project administration, S.S. and M.L.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

George, A.S.; Baskar, T.; Srikaanth, P.B. Cyber threats to critical infrastructure: Assessing vulnerabilities across key sectors. Partners Univers. Int. Innov. J. 2024, 2, 51–75. [Google Scholar]
Olaiya, O.P.; Adesoga, T.O.; Adebayo, A.A.; Sotomi, F.M.; Adigun, O.A.; Ezeliora, P.M. Encryption techniques for financial data security in fintech applications. Int. J. Sci. Res. Arch. 2024, 12, 2942–2949. [Google Scholar] [CrossRef]
Cletus, A.; Opoku, A.A.; Weyori, B.A. An evaluation of current malware trends and defense techniques: A scoping review with empirical case studies. J. Adv. Inf. Technol. 2024, 15, 5. [Google Scholar] [CrossRef]
Aminu, M.; Akinsanya, A.; Dako, D.A.; Oyedokun, O. Enhancing cyber threat detection through real-time threat intelligence and adaptive defense mechanisms. Int. J. Comput. Appl. Technol. Res. 2024, 13, 11–27. [Google Scholar]
Qasim, M.; Waleed, M.; Um, T.-W.; Pahlevani, P.; Pedersen, J.M.; Masood, A. Diving deep with botlab-ds1: A novel ground truth-empowered botnet dataset. IEEE Access 2024, 12, 28898–28910. [Google Scholar] [CrossRef]
Şahin, V.; Arat, F.; Akleylek, S. Anomaly detection with API calls by using machine learning: Systematic literature review. Curr. Trends Comput. 2024, 2, 60–85. [Google Scholar]
Yu, R.; Li, P.; Hu, J.; Chen, L.; Zhang, L.; Qiu, X.; Wang, F. Ransomware detection using dynamic behavioral profiling: A novel approach for real-time threat mitigation. Authorea Prepr. 2024; in press. [Google Scholar]
Thakur, K.; Ali, M.L.; Obaidat, M.A.; Kamruzzaman, A. A systematic review on deep-learning-based phishing email detection. Electronics 2023, 12, 4545. [Google Scholar] [CrossRef]
Song, C.; Shin, S.-Y.; Shin, K.-S. Implementing the dynamic feedback-driven learning optimization framework: A machine learning approach to personalize educational pathways. Appl. Sci. 2024, 14, 916. [Google Scholar] [CrossRef]
De Nijs, F.; Walraven, E.; De Weerdt, M.; Spaan, M. Constrained multiagent Markov decision processes: A taxonomy of problems and algorithms. J. Artif. Intell. Res. 2021, 70, 955–1001. [Google Scholar] [CrossRef]
Davarasan, A.; Samual, J.; Palansundram, K.; Ali, A. A Comprehensive review of machine learning approaches for android malware detection. J. Cyber Secur. Risk Audit. 2024, 1, 38–60. [Google Scholar] [CrossRef]
Ali, M.L.; Thakur, K.; Schmeelk, S.; Debello, J.; Dragos, D. Deep Learning vs. Machine Learning for Intrusion Detection in Computer Networks: A Comparative Study. Appl. Sci. 2025, 15, 1903. [Google Scholar] [CrossRef]
Giuliani, M.; Lamontagne, J.R.; Reed, P.; Castelletti, A. A state-of-the-art review of optimal reservoir control for managing conflicting demands in a changing world. Water Resour. Res. 2021, 57, e2021WR029927. [Google Scholar] [CrossRef]
Deldar, F.; Abadi, M. Deep learning for zero-day malware detection and classification: A survey. ACM Comput. Surv. 2023, 56, 36. [Google Scholar] [CrossRef]
Ding, Y.; Steenhoek, B.; Pei, K.; Kaiser, G.; Le, W.; Ray, B. Traced: Execution-aware pre-training for source code. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–12. [Google Scholar]
Akalin, N.; Loutfi, A. Reinforcement learning approaches in social robotics. Sensors 2021, 21, 1292. [Google Scholar] [CrossRef] [PubMed]
Liaqat, S.; Dashtipour, K.; Arshad, K.; Assaleh, K.; Ramzan, N. A hybrid posture detection framework: Integrating machine learning and deep neural networks. IEEE Sens. J. 2021, 21, 9515–9522. [Google Scholar] [CrossRef]
Almarshood, R.; Rahman, M.H. Enhancing intrusion detection systems by using machine learning in smart cities: Issues, challenges and future research direction. STAP J. Secur. Risk Manag. 2025, 1, 3–21. [Google Scholar] [CrossRef]
Karim, N. Comprehensive analysis of ransomware evolution and countermeasures in the era of digital transformation. Int. J. Adv. Cybersecur. Syst. Technol. Appl. 2024, 8, 20–30. [Google Scholar]
Patsakis, C.; Politou, E.; Alepis, E.; Hernandez-Castro, J. Cashing out crypto: State of practice in ransom payments. Int. J. Inf. Secur. 2024, 23, 699–712. [Google Scholar] [CrossRef]
Triantafyllou, G.P. Malware Analysis. Master’s Thesis, University of Piraeus, Athens, Greece, 2024. [Google Scholar]
Thakur, K.; Debello, J.; Kamruzzaman, A.; Ali, M.L. Safeguarding Network: Mechanisms and Prevention Strategies of DNS Hijacking. In Proceedings of the 2024 IEEE 15th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Berkeley, CA, USA, 24–26 October 2024; pp. 124–128. [Google Scholar]
Anikolova, E.; Martins, S.; Rozental, D.; Fontana, J.; Maier, P. Ransomware detection through behavioral attack signatures evaluation: A novel machine learning framework for improved accuracy and robustness. Authorea Prepr. 2024; in press. [Google Scholar]
Price, P.D.; Palmer Droguett, D.H.; Taylor, J.A.; Kim, D.W.; Place, E.S.; Rogers, T.F.; Mank, J.E.; Cooney, C.R.; Wright, A.E. Detecting signatures of selection on gene expression. Nat. Ecol. Evol. 2022, 6, 1035–1045. [Google Scholar] [CrossRef]
Kamaluddin, K. Dynamic malware analysis through system call tracing and API monitoring. ESP Int. J. Adv. Comput. Technol. 2023, 1, 167–179. Available online: https://www.espjournals.org/IJACT/ijact-v1i3p118 (accessed on 10 February 2026).
Redhu, A.; Choudhary, P.; Srinivasan, K.; Das, T.K. Deep learning-powered malware detection in cyberspace: A contemporary review. Front. Phys. 2024, 12, 1349463. [Google Scholar] [CrossRef]
Mansfield, D.; Montazeri, A. A survey on autonomous environmental monitoring approaches: Towards unifying active sensing and reinforcement learning. Front. Robot. AI 2024, 11, 1336612. [Google Scholar] [CrossRef] [PubMed]
Mahboubi, A.; Aboutorab, H.; Camtepe, S.; Bui, H.T.; Luong, K.; Ansari, K.; Wang, S.; Barry, B. Data encryption battlefield: A deep dive into the dynamic confrontations in ransomware attacks. arXiv 2025, arXiv:2504.20681. [Google Scholar] [CrossRef]
Sewak, M.; Sahay, S.K.; Rathore, H. Deep reinforcement learning in the advanced cybersecurity threat detection and protection. Inf. Syst. Front. 2023, 25, 589–611. [Google Scholar] [CrossRef]
Herath, J.D.; Yang, P.; Yan, G. Real-time evasion attacks against deep learning-based anomaly detection from distributed system logs. In Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy, Virtual, 26–28 April 2021; pp. 29–40. [Google Scholar]
Gazzan, M.; Sheldon, F.T. Opportunities for early detection and prediction of ransomware attacks against industrial control systems. Future Internet 2023, 15, 144. [Google Scholar] [CrossRef]
Tahmasebi, M. Beyond defense: Proactive approaches to disaster recovery and threat intelligence in modern enterprises. J. Inf. Secur. 2024, 15, 106–133. [Google Scholar] [CrossRef]
Thakur, K.; Ali, M.L.; Schmeelk, S.; Debello, J.; Dragos, D. Obesity Risk Prediction Using Machine Learning by Combining Lifestyle Factors. In Proceedings of the Tenth International Congress on Information and Communication Technology: ICICT 2025, London, UK, 18–21 February 2025; p. 97. [Google Scholar]
Al E’mari, S.; Sanjalawe, Y.; Fataftah, F. AI-driven security systems and intelligent threat response using autonomous cyber defense. In AI-Driven Security Systems and Intelligent Threat Response Using Autonomous Cyber Defense; IGI Global: Hershey, PA, USA, 2025; pp. 35–78. [Google Scholar]
von der Assen, J.; Celdrán, A.H.; Luechinger, J.; Sánchez, P.M.S.; Bovet, G.; Pérez, G.M.; Stiller, B. RansomAI: AI-powered ransomware for stealthy encryption. In Proceedings of the GLOBECOM 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2578–2583. [Google Scholar]
Anderson, H.S.; Kharkar, A.; Filar, B.; Evans, D.; Roth, P. Learning to evade static PE machine learning malware models via reinforcement learning. arXiv 2018, arXiv:1801.08917. [Google Scholar] [CrossRef]
Wang, S.; Dong, F.; Yang, H.; Xu, J.; Wang, H. Cancal: Towards real-time and lightweight ransomware detection and response in industrial environments. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 2326–2340. [Google Scholar]
Svet, L.; Brightwell, A.; Wildflower, A.; Marshwood, C. Unveiling zero-space detection: A novel framework for autonomous ransomware identification in high-velocity environments. arXiv 2025, arXiv:2501.12811. [Google Scholar]
Berrueta, E.; Morato, D.; Magaña, E.; Izal, M. Crypto-ransomware detection using machine learning models in file-sharing network scenarios with encrypted traffic. Expert Syst. Appl. 2022, 209, 118299. [Google Scholar] [CrossRef]
Sakthidevi, I.; Selvamani, V.; Absal, K.; Siddharth, K.A.; Shrivasta, G.N. Real-time ransomware detection and visualization framework using machine learning. Int. J. Innov. Res. Technol. 2025, 11, 4069–4074. [Google Scholar]
Gazzan, M.; Sheldon, F.T. Novel ransomware detection exploiting uncertainty and calibration quality measures using deep learning. Information 2024, 15, 262. [Google Scholar] [CrossRef]
Rani, N.; Dhavale, S.V. Leveraging machine learning for ransomware detection. arXiv 2022, arXiv:2206.01919. [Google Scholar] [CrossRef]
Choi, S.-H.; Choi, S.-M.; Buu, S.-J. Proximal policy-guided hyperparameter optimization for mitigating model decay in cryptocurrency scam detection. Electronics 2025, 14, 1192. [Google Scholar] [CrossRef]
Amaizu, G.C.; Sai, A.M.V.V.; Bwardwaj, S.; Kim, D.-S.; Siddula, M.; Li, Y. FedVitBloc: Secure and privacy-enhanced medical image analysis with federated vision transformer and blockchain. High-Confid. Comput. 2025, 100302. [Google Scholar] [CrossRef]
Al-Fawa’reh, M.; Abu-Khalaf, J.; Szewczyk, P.; Kang, J.J. MalBotDRL: Malware botnet detection using deep reinforcement learning in IoT networks. IEEE Internet Things J. 2023, 11, 9610–9629. [Google Scholar] [CrossRef]
Hurley, R.; Kruger, P.; Nascimento, H.; Keller, S. Real-time ransomware detection through adaptive behavior fingerprinting for improved cybersecurity resilience and defense. OSF 2024. [Google Scholar] [CrossRef]

Figure 1. Top–down architecture of the proposed real-time ransomware detection framework. The system continuously monitors system activity, extracts behavioral features, and uses a (DQN) agent to evaluate actions in real time. Decisions are fed back into the environment to improve policy learning over time.

Figure 2. Training reward curve with EMA smoothing (window size = 100).

Figure 3. Q-value contour map over principal feature axes.

Figure 4. Precision comparison across the proposed method and baseline methods [35,36,38,39,40,43].

Figure 5. Recall comparison across the proposed and baseline methods [35,36,38,39,40,43].

Figure 6. F1-score comparison across proposed and baseline methods [35,36,38,39,40,43].

Figure 7. Mean squared error (MSE) comparison across models [35,36,38,39,40,43].

Figure 8. Root mean squared error (RMSE) comparison across models [35,36,38,39,40,43].

Table 1. Summary of related work on ransomware detection and related techniques.

Ref.	Dataset Used	Methodology	Evaluation Results	Limitation
[35]	Custom ransomware on Raspberry Pi 4	Deep Q-Learning with isolation forest	>90% stealth success, bypasses detection in minutes	Focuses on offensive use; lacks diverse evaluation environments
[36]	Windows PE malware samples	RL with OpenAI Gym, black-box attacks	33% drop in model effectiveness after adversarial training	Only tests static PE models; no live execution
[37]	1761 ransomware samples; 3 M events	Behavioral filtering + rule-based response	99.65% TPR, 0.6% CPU peak, response within 3 s	Industrial-centric; not tested on consumer setups
[38]	LockBit, Conti, REvil, BlackMatter	Unsupervised clustering + DL ensemble	High detection, low false positive, real-time compatible	No RL or online learning adaptation
[39]	67 ransomware binaries, 2477 h benign traffic	Supervised ML on encrypted traffic	99.8% accuracy, 0.004% FPR	Corporate file-share focused; limited endpoint relevance
[40]	Local behavioral logs	ML-based file/process behavior detection + dashboard	Qualitative real-time effectiveness	No RL, limited generalization tested
[41]	API behavior traces	Deep belief networks + early stopping	Accuracy from 94% to 98%, FPR 0.18 to 0.10	No adaptation to new families
[42]	Rani and Dhavale	Custom labeled dataset	No zero-day or adversarial handling 98.21% accuracy, high precision/recall	RF, SVM, and other ML classifiers
[43]	Ethereum graph with 2.9 M nodes	F1 score 0.9478, better than baseline GNNs	PPO-based hyperparameter tuning	Not ransomware-specific
[44]	HAM10000 skin lesion dataset	FedViT + blockchain + FHE	79% accuracy, Loss < 2	Not related to ransomware detection
[45]	Custom ransomware + NSL-KDD	GAN for synthetic data + ensemble learning	Accuracy 99.1%, robust generalization	No RL, no real-time eval
[46]	1000 ransomware + benign samples	LSTM on static + dynamic features	F1 score 0.965, latency < 2.3s	No adaptation to new families

Table 2. Comparison of detection metrics across methods.

Model	Precision	Recall	F1-Score
Random forest	0.92	0.79	0.85
LSTM classifier	0.94	0.83	0.88
Proposed RL agent (this paper)	0.96	0.91	0.93

Table 3. Detection accuracy on unseen ransomware families.

Ransomware Variant	Detection Accuracy
WannaCry	0.95
Ryuk	0.94
Locky	0.92
Cerber	0.90

Table 4. Feature ablation study (impact on F1-score).

Removed Feature	F1-Score
File Write Count	0.89
Registry Edits	0.91
API Call Entropy	0.86
CPU Usage	0.90
File Entropy	0.85

Table 5. Average action latency of the RL agent.

System Stage	Latency (ms)
Feature Extraction	3.2
Q-Value Computation	1.1
Decision Execution	0.6
Total	4.9

Table 6. Reference comparison table with detection performance metrics.

Ref.	Precision	Recall	F1-Score	MSE	RMSE
J. von der Assen et al. [35]	0.91	0.90	0.90	0.065	0.255
Choi, Choi and Buu [43]	0.88	0.85	0.86	0.072	0.268
Anderson et al. [36]	0.85	0.82	0.83	0.081	0.284
Sakthidevi et al. [40]	0.89	0.88	0.88	0.059	0.243
Berrueta et al. [39]	0.87	0.85	0.86	0.068	0.261
Svet et al. [38]	0.86	0.84	0.85	0.074	0.272
Proposed RL Agent	0.96	0.91	0.93	0.045	0.212

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Thakur, K.; Ali, M.L.; Schmeelk, S.; Debello, J.; Rahman, M.M. Real-Time Ransomware Detection Using Reinforcement Learning Agents. Information 2026, 17, 194. https://doi.org/10.3390/info17020194

AMA Style

Thakur K, Ali ML, Schmeelk S, Debello J, Rahman MM. Real-Time Ransomware Detection Using Reinforcement Learning Agents. Information. 2026; 17(2):194. https://doi.org/10.3390/info17020194

Chicago/Turabian Style

Thakur, Kutub, Md Liakat Ali, Suzanna Schmeelk, Joan Debello, and Md Mustafizur Rahman. 2026. "Real-Time Ransomware Detection Using Reinforcement Learning Agents" Information 17, no. 2: 194. https://doi.org/10.3390/info17020194

APA Style

Thakur, K., Ali, M. L., Schmeelk, S., Debello, J., & Rahman, M. M. (2026). Real-Time Ransomware Detection Using Reinforcement Learning Agents. Information, 17(2), 194. https://doi.org/10.3390/info17020194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Ransomware Detection Using Reinforcement Learning Agents

Abstract

1. Introduction

2. Literature Review

3. Proposed Methodology

3.1. MDP Definition

3.2. State Space

3.3. Behavioral Feature Categorization and Resource Utilization

3.4. Action Space

3.5. State Transition Function

3.6. Reward Function

3.7. Q-Value Approximation

3.8. Deep Q-Network Architecture

3.9. Loss Function

3.10. Target Value Calculation

3.11. Policy Derivation

3.12. Exploration via $ϵ$ -Greedy

3.13. Epsilon Decay Schedule

3.14. Convergence Measure

3.15. Training Algorithm

4. Experiment Setup

5. Results and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Real-Time Ransomware Detection Using Reinforcement Learning Agents

Abstract

1. Introduction

2. Literature Review

3. Proposed Methodology

3.1. MDP Definition

3.2. State Space

3.3. Behavioral Feature Categorization and Resource Utilization

3.4. Action Space

3.5. State Transition Function

3.6. Reward Function

3.7. Q-Value Approximation

3.8. Deep Q-Network Architecture

3.9. Loss Function

3.10. Target Value Calculation

3.11. Policy Derivation

3.12. Exploration via ϵ -Greedy

3.13. Epsilon Decay Schedule

3.14. Convergence Measure

3.15. Training Algorithm

4. Experiment Setup

5. Results and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.12. Exploration via $ϵ$ -Greedy