1. Introduction
Ransomware attacks have grown into one of the most disruptive threats in cybersecurity, with widespread implications for personal users, enterprises, and critical infrastructure [
1]. These attacks typically encrypt user data and demand payment in exchange for decryption keys [
2]. As attackers continue to modify payloads and infection methods, conventional rule-based antivirus and static signature detection approaches have shown limitations in identifying novel or mutated threats [
3]. This shifting threat landscape has increased demand for intelligent detection frameworks that can adapt to emerging attack vectors [
4]. The growing availability of behavioral logs and sandbox-based forensic tools presents an opportunity to explore learning-based approaches that monitor system activities and detect threats before encryption occurs [
5].
Behavior-based detection methods offer promising capabilities by observing process patterns, API call sequences, file system modifications, and CPU usage [
6]. These indicators reflect the internal states and behaviors of ransomware during execution. With proper modeling, they allow security systems to make preventive decisions in real time [
7]. While machine learning has been applied to classify ransomware using such features, many systems rely on batch-trained models, which are not well-suited for sequential decision making or dynamic threat environments [
8]. RL, in contrast, provides a feedback-driven mechanism for continuously learning how to respond to new behaviors based on environment interaction [
9]. By framing the problem as a Markov Decision Process (MDP), systems can learn optimal actions such as blocking or allowing processes to run [
10].
Cyberattacks, particularly ransomware, have evolved into a global threat with severe economic, operational, and societal consequences. Recent industry and government reports indicate that ransomware incidents have resulted in billions of dollars in financial losses annually, disrupting healthcare systems, financial institutions, smart city infrastructures, and critical public services. The increasing frequency and sophistication of ransomware attacks highlight the limitations of traditional signature-based defenses and underscore the urgent need for intelligent, adaptive detection mechanisms. Prior research surveys further emphasize this challenge. For example, the research [
11] systematically analyzes how modern malware leverages dynamic execution behaviors to evade static detection mechanisms, concluding that behavior-aware and learning-based models are essential for combating evolving threats. Similar observations have been reported across intrusion detection and ransomware research, where static feature dependence limits robustness against zero-day and polymorphic attacks. These findings collectively justify the growing shift towards behavioral analysis and reinforcement learning-based frameworks capable of adapting to novel attack patterns in real time. This concern is particularly critical in smart cities and cyber-physical environments, where ransomware attacks can impact public safety, transportation systems, and essential services at scale.
Existing works often treat ransomware detection as a supervised classification problem, focusing on accuracy, but not on policy learning or long-term adaptation [
12]. This gap motivates the need for approaches that can function in uncertain, sequential environments where the optimal action depends on current and historical observations [
13]. Modeling ransomware behavior in an RL setting creates the opportunity for systems that not only detect, but also decide when and how to intervene. Such a system could prevent encryption before it completes, mitigating damage without prior knowledge of the malware family involved.
Despite ongoing improvements in supervised learning and behavior modeling, most methods remain limited in flexibility and fail to address zero-day ransomware [
14]. Traditional models lack exploration capabilities and perform poorly when facing new families or modified variants. Moreover, many datasets used in prior works are restricted to known samples and do not generalize across execution traces collected in varied environments [
15]. This constraint limits the models’ relevance to real-world deployment scenarios. RL, which operates on trial-and-error interactions, offers a way to overcome these weaknesses by learning policies that are directly shaped by observed reward and punishment in the environment [
16]. Several researchers have proposed hybrid frameworks combining static features and machine learning classifiers [
17,
18]. For instance, Anderson et al. focused on black-box adversarial environments using RL but without ransomware-specific models. Choi, Choi, and Buu proposed a proximal policy optimization (PPO) agent for Ethereum fraud detection but did not address behavioral ransomware traits. Sakthidevi et al. implemented machine learning on file behavior logs but did not support learning-based control or real-time adaptation. These approaches highlight ongoing exploration in the field, but few integrate environment interaction, online decision-making, and generalization across multiple ransomware families.
In this research paper, we propose a novel RL-based detection agent that models system behavior as a state space, defines reward based on process context, and learns to block ransomware before encryption completes. The agent is trained using logs generated from Cuckoo Sandbox and includes dynamic features such as file creation rates, CPU spikes, and registry access. We evaluated this agent across known ransomware variants including WannaCry, Ryuk, and Cerber to test generalization performance. Unlike previous works, our approach adapts online, updates its policy iteratively, and is guided by performance-based rewards. The goal is not only to detect ransomware, but to learn behavior patterns that allow prevention in real time.
How can RL be used to detect ransomware behavior based on file, CPU, and registry activity in real-time?
How does the proposed RL agent compare with existing models in terms of accuracy, generalization, and error metrics?
Can the RL agent effectively generalize to previously unseen ransomware families without retraining?
This paper makes the following core contributions. First, we formulate ransomware detection as a sequential decision-making problem and propose a reinforcement learning-based framework that enables real-time and proactive mitigation rather than offline classification. Second, we design a behavior-driven state representation using dynamic execution features extracted from sandbox analysis, allowing the model to generalize across diverse ransomware families without reliance on static signatures. Third, we conduct extensive experimental evaluation, including family-wise generalization analysis, demonstrating that the proposed approach achieves improved detection accuracy and early blocking capability compared to existing machine learning and deep learning-based methods.
The rest of the paper is organized as follows:
Section 2 presents the literature review on ransomware detection techniques, focusing on RL, behavioral analysis, and real-time monitoring strategies.
Section 3 details the construction of the RL environment, including state representation, reward function, and the (DQN) agent design.
Section 4 outlines the experimental setup, datasets, ransomware families, feature engineering, and training configurations.
Section 5 reports the detection performance metrics, comparative evaluation, training curves, Q-value mappings, and ablation studies. Finally,
Section 6 summarizes the findings and highlights future directions for adaptive and proactive ransomware mitigation systems.
2. Literature Review
Ransomware has rapidly emerged as a dominant form of cyberattack, causing extensive disruption to both public and private digital infrastructures [
19]. It encrypts files, halts operations, and demands ransom payments—often in cryptocurrency—leaving organizations with few viable recovery options [
20]. In recent years, ransomware groups have developed advanced techniques such as double extortion, fileless execution, and polymorphic payloads, making detection harder for static or signature-based antivirus systems [
21]. These approaches depend on known patterns, making them ineffective against zero-day threats or mutated ransomware strains [
22]. This challenge has accelerated interest in alternative models that can track behavioral characteristics rather than rely on predefined malware signatures [
23].
Behavioral features—such as sudden file creation bursts, abnormal CPU spikes, unauthorized registry modifications, and anomalous network traffic—are more resilient indicators of ransomware activity [
7]. These traits tend to remain consistent across variants, even as binary signatures change [
24]. With access to execution traces via dynamic analysis tools such as Cuckoo Sandbox, researchers can extract system-level metrics that help profile malicious behavior [
25]. Traditional machine learning techniques use these features to train binary classifiers; however, they typically assume i.i.d. data and static learning environments, which do not reflect the adversarial and evolving nature of ransomware attacks in real-world systems [
26].
RL offers an alternative view of the detection problem. It frames the environment as a sequence of states influenced by actions taken by an agent, who receives rewards or penalties depending on the consequences of each action [
27]. This structure suits ransomware detection well: the agent observes system behavior over time, learns what constitutes an anomaly, and adapts policies to intervene before encryption can complete. Unlike batch classification models, RL agents learn from delayed feedback and optimize cumulative reward. This enables the better anticipation of multi-step attacks and supports generalization across ransomware families with different strategies [
28].
Despite growing interest in intelligent cybersecurity systems, few studies have modeled ransomware defense explicitly as an RL environment [
29]. Most research focuses on predictive modeling without active decision making. For example, Anderson et al. introduced a RL framework for black-box malware manipulation but did not target ransomware. Choi, Choi, and Buu explored PPO in fraud detection using Ethereum graphs but did not address real-time defense. Sakthidevi et al. applied machine learning on local logs to monitor threats but offered no learning-based policy control [
30]. These efforts illustrate partial progress toward adaptive defense but stop short of fully interactive, environment-responsive systems tailored for ransomware behaviors.
The absence of control feedback, real-time learning, and sequential decision making in prior models reveals a gap in ransomware defense research [
31]. Supervised models may offer high accuracy in static test environments, yet struggle with out-of-distribution samples or zero-day strains. Furthermore, most datasets are composed of known ransomware collected in limited sandbox scenarios and are not optimized for learning agent behavior. This makes it difficult to train models that adapt policies based on long-term reward. As threats become more unpredictable, there is a need for proactive systems capable of learning optimal defense strategies in dynamic contexts [
32]. Our proposed framework introduces an RL agent that learns directly from environment observations by mapping behavioral features—such as file system operations, registry access, and CPU usage—into a structured state space. The agent interacts with a virtual sandbox, receives positive reward for correct blocks and negative reward for delayed or missed actions, and continuously updates its Q-values or policy networks [
33]. Training involves both (DQN) and (PPO) agents, evaluated on logs sourced from custom executions of ransomware families including WannaCry, Ryuk, Cerber, and Locky. This method outperforms static baselines by incorporating adaptive learning, action-based intervention, and temporal feedback to detect and halt ransomware in real-time [
34].
Von der Assen et al. [
35] proposed a deep Q-learning approach to model offensive ransomware that adapts to bypass security. Their results showed over 90% stealth success, which raised concern about defensive gaps. Anderson et al. [
36] developed an RL-based black-box evasion agent, causing a 33% drop in model detection. Wang et al. [
37] introduced a low-latency detection system that achieved a true positive rate of 99.65% with minimal system overhead. Svet et al. [
38] utilized unsupervised learning and deep models to capture behavioral signals. Their method reduced false positives and performed well in high-velocity settings.
Berrueta et al. [
39] applied classical ML methods on encrypted traffic, reaching 99.8% detection accuracy and a false positive rate of 0.004%. Sakthidevi et al. [
40] developed a real-time behavioral framework with visual analysis, though it lacked evaluation across ransomware variants. Gazzan and Sheldon [
41] improved deep belief networks using uncertainty-aware early stopping, boosting accuracy from 94% to 98%. Rani and Dhavale [
42] compared multiple ML classifiers, with their best model achieving 98.21% accuracy using static behavioral features from labeled samples.
Choi, Choi, and Buu [
43] implemented PPO-based tuning in graph neural networks and achieved an F1 score of 0.9478. Their approach showed potential for tasks involving time-sensitive drift but was not designed for ransomware. Amaizu et al. [
44] designed a federated model with privacy protection and tested on medical images, reaching 79% accuracy. Their work emphasized distributed privacy but did not handle malicious activity. Al-Fawa’reh et al. [
45] proposed a hybrid system combining GAN-generated synthetic data and ensemble classifiers. Their method achieved 99.1% accuracy on the NSL-KDD and custom ransomware traffic.
Hurley et al. [
46] used both static and dynamic features to train LSTM networks, reporting an F1 score of 0.965 and latency below 2.3 s. Their focus on feature fusion helped reduce detection delay. These studies provided various approaches to ransomware identification but many lacked policy-based defense or were only tested on specific families. Most works applied supervised classifiers without adaptability to unseen variants. Only von der Assen et al. [
35] and Anderson et al. [
36] used RL to simulate adversarial agents, and neither addressed proactive blocking. Additionally, studies like Sakthidevi et al. and Rani and Dhavale relied on behavioral logs without generalization testing.
Real-time decision-making was partially covered in Wang et al. and Svet et al., yet these lacked formal learning policies. Studies by Berrueta et al. and Gazzan and Sheldon achieved high accuracy but were restricted to narrow setups. PPO and federated strategies from Choi, Choi, and Buu and Amaizu et al. were tested in non-ransomware domains, showing architecture potential but limited cross-domain adaptability. A majority of the reviewed methods focused on classification without state modeling or action policies, which limited their use in automated defense settings. This gap presents a case for RL agents that operate based on behavioral patterns and system transitions.
No reviewed study tested blocking policies across multiple ransomware families in a dynamic learning environment. The evaluation of GAN-based generalization in Al-Fawa’reh et al. and LSTM-based early detection in Hurley et al. showed promising results, but did not incorporate adaptive decision layers. Models that targeted encryption behavior (e.g., von der Assen et al.) were offensive and not defense-focused. Only Wang et al. measured response time explicitly. Hence, a real-time system built using DQN or PPO with behavioral logs and sandbox outputs could improve generalization and reduce latency while maintaining detection accuracy across variants like WannaCry and Ryuk. Such a system could fill the present methodological gap.
Table 1 shows the summary of the related work on ransomware detection and related techniques.
3. Proposed Methodology
This section models ransomware detection as a sequential decision-making problem using (RL). We define a custom (MDP) to simulate system behavior under ransomware threats and train a (DQN) agent to proactively block malicious activity.
The proposed methodology models ransomware detection as an RL task, where system behavior is mapped to an (MDP). The environment is defined using behavioral traces extracted from Cuckoo Sandbox, including features such as file entropy, CPU usage, registry edits, and API call entropy. The agent interacts with this environment through actions like allowing or blocking processes. Rewards are shaped based on entropy thresholds, encryption-like activity, and system responsiveness. A (DQN) is used to learn the optimal policy by minimizing temporal difference loss across replayed experiences. The learning process is stabilized using target networks and experience replay buffers. The RL agent iteratively updates its policy based on observed transitions, enabling real-time adaptation to novel ransomware families. Hyperparameters such as the learning rate, discount factor, and exploration schedule were optimized through extensive tuning, and generalization was validated across WannaCry, Ryuk, Cerber, and Locky variants.
The architecture diagram (
Figure 1) presents a top–down view of the system design. It begins with real-time system monitoring at the top, collecting the file system, process, and registry activity. This data flows into a feature extraction module that transforms raw logs into normalized vectors. These vectors are then fed into the RL agent, which evaluates Q-values for possible actions using a deep neural network. The selected action—such as terminating a process or allowing it to continue—is executed by the policy handler. The feedback loop completes as the environment responds to the action, providing a new state and reward to the agent. This design allows the system to operate in real time while continuously learning from system behavior. As supported in prior works like Anderson et al. [
35] and von der Assen et al. [
36], this architecture emphasizes decision making under uncertainty, which is an essential trait in dynamic threat environments.
4. Experiment Setup
The experimental setup in this research was designed to rigorously evaluate the performance of the proposed (RL)-based ransomware detection framework in a controlled yet realistic environment. The datasets used in this study are drawn from both simulated and real-world ransomware activity, ensuring a broad representation of system behavior under attack and benign conditions. All logs and metadata were collected and preprocessed to match the RL environment assumptions defined in
Section 3.
The primary dataset was constructed using behavioral logs captured from the Cuckoo Sandbox, an open source automated malware analysis system. We deployed variants of well-known ransomware families including WannaCry, Locky, Ryuk, and Cerber. These families were selected for their diversity in encryption strategy, propagation mechanisms, and process injection techniques. Each malware sample was executed in isolation within the sandbox, and the system-level logs were captured for a runtime duration of 180 s per sample. These logs contained detailed traces of file system access, memory allocation, CPU usage, registry changes, and network activity. The environment was instrumented to monitor API calls and I/O entropy, which are known indicators of cryptographic operations. The extracted features were mapped to the same behavioral categories defined in
Section 3.2 to ensure consistent state representation across training and evaluation.
Benign samples were obtained from standard software repositories and productivity tools. These included applications like Notepad, Adobe Reader, web browsers, and installation scripts for common software. Similarly to the ransomware samples, benign processes were executed in the Cuckoo Sandbox under identical configurations. The resulting system activity logs were labeled accordingly to differentiate malicious and non-malicious behavior.
For evaluation, we constructed a time-series dataset with a frame window of five seconds, where each window represented an individual state observation. The extracted features for each state included file entropy rate, the number of file writes, registry key edit count, memory write operations, CPU load percentage, and count of suspicious API calls. Each state was associated with a label indicating whether the observed behavior was benign or malicious.
To ensure robustness and prevent data leakage, we split the dataset temporally into training, validation, and test sets. The training set included early-stage behavior from the ransomware samples, while the test set included both early and late-stage encryption activity to evaluate generalization. We further ensured that ransomware families present in the test set were not included in the training set, allowing us to test the model’s ability to detect unseen variants.
Additional validation was performed using a manually curated dataset from VirusShare and VirusTotal. Samples were cross-verified for uniqueness and active payload behavior. This external set included ransomware samples with stealth techniques such as delayed execution and encryption masking. Logs from these executions were parsed and normalized to fit the same state-action representation as used in the Cuckoo-based data.
All datasets were normalized using min–max scaling to ensure uniform feature distribution. Entropy values were scaled to [0, 1] using their empirical min/max from the benign baseline. The final aggregated dataset had over 50,000 state–action–reward tuples, with a malware-to-benign ratio of approximately 1:3. This ensured that the agent learned to identify rare malicious patterns without being biased toward frequent benign transitions. The dataset comprises behavioral traces collected from 128 unique ransomware binaries spanning 4 ransomware families (WannaCry, Ryuk, Cerber, and Locky) and 384 benign applications, each executed in isolated sandbox environments. Multiple execution traces were collected per sample to capture behavioral variability, resulting in over 50,000 state–action tuples used for training and evaluation. This design ensures that the reinforcement learning agent learns generalized ransomware behaviors rather than overfitting to specific execution trajectories or individual binaries. The experiment setting was tuned to reflect real-time conditions by introducing noise, delay, and API call jitter to emulate practical evasion scenarios.
5. Results and Analysis
This section presents a comprehensive evaluation of the proposed real-time ransomware detection framework using (RL) agents. The experimental results include detection performance metrics, policy behavior visualizations, robustness against unseen families, and latency analysis. All experiments were conducted using an Intel i9-13900K workstation with 64 GB RAM and an NVIDIA RTX 4090 GPU. The implementation was developed in Python 3.12 using PyTorch (v2.1.0) and a custom reinforcement learning environment built on OpenAI Gym (v0.26.2).
The training reward curve
Figure 2 illustrates the learning progression of the (RL) agent over 3000 episodes, with rewards smoothed using an exponential moving average to reduce variance caused by exploratory actions. The consistent upward trajectory indicates that the agent increasingly learned to make optimal decisions in the environment, successfully distinguishing between ransomware-like and benign behaviors. Initially, the reward remained near zero, reflecting the agent’s trial-and-error exploration. As training progressed, the agent began to exploit more effective policies, leading to a steady improvement in cumulative reward. The curve’s smooth and monotonic rise suggests stable convergence, robust policy refinement, and reliable generalization across diverse behavioral patterns encountered during training.
The Q-value contour map over the principal feature axes in
Figure 3 provides a 2D visualization of the RL agent’s decision surface, where high-dimensional system states—composed of features like file entropy, write count, and API activity—are projected using principal component analysis. Contour gradients represent the Q-values associated with the “block” action, with higher values indicating states the agent deems more threatening. Dense high-Q regions align with ransomware behavior such as encryption activity, while flatter low-Q regions correspond to benign processes. The smooth transition across contours reflects the learned decision boundary, confirming that the agent has generalized a structured, interpretable policy rather than overfitting. This visualization offers both diagnostic insight and verification of the model’s real-time discrimination ability.
Table 2 compares the proposed RL agent with Random Forest and LSTM-based classifiers. The RL agent outperformed traditional classifiers across all detection metrics.
To test generalization, we evaluated the model on ransomware families not present in training.
Table 3 reports detection accuracy for each family. To rigorously evaluate generalization across unseen ransomware variants, a strict Leave-One-Group-Out (LOGO) cross-validation strategy was employed for the experiments reported in
Table 3. In this setting, ransomware families were treated as distinct groups, and all samples belonging to the test family were completely excluded from the training phase. This ensures that no behavioral traces from the test families were observed during training, thereby providing a realistic assessment of the model’s ability to generalize to previously unseen ransomware families.
An ablation study (
Table 4) assessed feature contributions. Removing entropy-based features yielded the largest performance drop, confirming their criticality in detecting cryptographic behavior.
Latency benchmarks in
Table 5 confirm that the model can operate under real-time constraints, requiring less than 5 ms per decision.
Finally,
Table 6 compares our method to six recent studies. Metrics like precision, recall, F1-score, MSE, and RMSE indicate that our agent achieves superior performance across the board.
These results confirm that our RL-based approach offers improved accuracy, generalization, and latency compared to existing models, making it a strong candidate for real-time deployment in ransomware prevention systems.
Figure 4 presents the precision comparison, illustrating how accurately each model identified ransomware samples without misclassifying benign processes. The proposed RL agent achieved the highest precision at 0.96, outperforming all baseline models, including von der Assen et al. at 0.91 and Sakthidevi et al. at 0.89. This indicates that the proposed agent produces fewer false positives, which is crucial in reducing unnecessary disruptions in benign software execution. The high contrast in the bar heights also reflects the agent’s ability to isolate decisive features associated with ransomware without being overly sensitive to benign anomalies.
The recall comparison line plot in
Figure 5 highlights how well each model identifies actual ransomware threats. The proposed RL-based detection system achieved the top recall value of 0.91, indicating high sensitivity and low false negative rates. Von der Assen et al. [
35] followed closely at 0.90, reflecting strong performance but slightly lower adaptability. Sakthidevi et al. [
40] and Berrueta et al. [
39] maintained respectable values around 0.88 and 0.85, respectively, though their approaches lacked adaptive policy tuning. Anderson et al. [
36] and Svet et al. [
38] showed lower scores of 0.82 and 0.84, which may reflect difficulties in generalizing across unseen ransomware variants. Choi, Choi, and Buu [
43] also showed moderate recall at 0.85 but were not focused on ransomware-specific behavior. Overall, the proposed model demonstrated the most consistent ability to identify ransomware accurately, reducing missed detections and enhancing threat mitigation capabilities.
The line plot in
Figure 6 compares the F1-scores of various ransomware detection methods, highlighting the performance balance between precision and recall. The proposed RL-based model achieved the highest F1-score of 0.93, indicating superior consistency in detecting ransomware with minimal false positives and false negatives. In contrast, Anderson et al. [
36] and Svet et al. [
38] trailed with F1-scores of 0.83 and 0.85, reflecting limitations in capturing both detection completeness and precision. Von der Assen et al. [
35] recorded 0.90, showing competitive but slightly lower generalization ability. Choi, Choi, and Buu [
43] and Berrueta et al. [
39] maintained average scores near 0.86, emphasizing adequate performance but lacking robustness under real-time conditions. Sakthidevi et al. [
40] performed reasonably with 0.88 but showed no RL adaptation. The plot’s downward trend across baselines reinforces the proposed agent’s advantage in maintaining detection stability and generalizability across complex ransomware behaviors.
Figure 7 illustrates the mean squared error (MSE) values for all compared models using a horizontal bar plot. The proposed (RL) model achieved the lowest MSE at 0.045, confirming its high fidelity in decision accuracy during both training and inference. The closest competitor, Sakthidevi et al. [
40], recorded 0.059, while Anderson et al. [
36] showed the highest deviation with an MSE of 0.081, implying frequent prediction inconsistencies. Models from von der Assen et al. [
35] and Svet et al. [
38] also crossed 0.065, reflecting noisier decision boundaries under real-time threats. The unique horizontal layout enhances comparative readability and aligns method names with their corresponding bars, helping visualize how error magnitude accumulates across each approach. Overall, the proposed agent exhibits the most reliable output variance, reinforcing the benefit of experience-based policy convergence in minimizing prediction uncertainty.
The RMSE bar chart with hatching in
Figure 8 presents the deviation of each model’s predictions from true labels, providing a tangible view of predictive consistency. The proposed RL agent showed the lowest RMSE at 0.212, confirming its ability to produce stable and accurate ransomware classification outputs. Comparatively, von der Assen et al. [
35] and Sakthidevi et al. [
40] followed with RMSE values of 0.255 and 0.243, suggesting slightly less precise decision boundaries. Models from Anderson et al. [
36] and Svet et al. [
38] exceeded 0.270, highlighting greater variability and potential overfitting or under-generalization. Choi, Choi and Buu [
43] and Berrueta et al. [
39] presented intermediate RMSE levels, though their focus was not tailored for fine-grained ransomware recognition. The lighter hue and lined bars enhance visual contrast, emphasizing the proposed model’s reduced error propagation in real-time operational contexts.