Dynamic Defense Strategy Selection Through Reinforcement Learning in Heterogeneous Redundancy Systems for Critical Data Protection

Yu, Xuewen; He, Lei; Geng, Jingbu; Liang, Zhihao; Gan, Zhou; Zhao, Hantao

doi:10.3390/app15169111

Open AccessArticle

Dynamic Defense Strategy Selection Through Reinforcement Learning in Heterogeneous Redundancy Systems for Critical Data Protection

by

Xuewen Yu

^1,2,

Lei He

^1,*,

Jingbu Geng

¹

,

Zhihao Liang

²,

Zhou Gan

³ and

Hantao Zhao

^1,2

¹

Purple Mountain Laboratories, Nanjing 211189, China

²

School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China

³

China West Construction Group Co., Ltd., Chengdu 610000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9111; https://doi.org/10.3390/app15169111

Submission received: 2 July 2025 / Revised: 5 August 2025 / Accepted: 7 August 2025 / Published: 19 August 2025

(This article belongs to the Special Issue Advanced Security and Privacy Mechanisms for Cyber–Physical Systems and Industrial Networks)

Download

Browse Figures

Versions Notes

Abstract

In recent years, the evolution of cyber-attacks has exposed critical vulnerabilities in conventional defense mechanisms, particularly across national infrastructure systems such as power, transportation, and finance. Attackers are increasingly deploying persistent and sophisticated techniques to exfiltrate or manipulate sensitive data, surpassing static defense methods that depend on known vulnerabilities. This growing threat landscape underscores the urgent need for more advanced and adaptive defensive strategies to counter continuously evolving attack vectors. To address this challenge, this paper proposes a novel reinforcement learning-based optimization framework integrated with a Dynamic Heterogeneous Redundancy (DHR) architecture. Our approach uniquely utilizes reinforcement learning for the dynamic scheduling of encryption-layer configurations within the DHR framework, enabling adaptive adjustment of defense policies based on system status and threat progression. We evaluate the proposed system in a simulated adversarial environment, where reinforcement learning continuously adjusts encryption strategies and defense behaviors in response to evolving attack patterns and operational dynamics. Experimental results demonstrate that our method achieves a higher defense success rate while maintaining lower defense costs, thereby enhancing system resilience against cyber threats and improving the efficiency of defensive resource allocation.

Keywords:

cybersecurity; endogenous security; reinforcement learning; Dynamic Heterogeneous Redundancy

1. Introduction

With the rapid advancement of computer network technologies, cyber-attacks have become increasingly diverse and sophisticated, posing severe threats to the critical data that underpin national infrastructure systems, including power grids, transportation networks, and financial institutions. Critical data such as system configurations, user credentials, transaction records, and real-time operational commands is essential for the operation, control, and security of key services, and its integrity and confidentiality are vital to national security and economic stability [1].

However, the protection of critical data remains a formidable challenge [2,3,4]. Traditional cybersecurity approaches primarily rely on predefined attack signatures, static configurations, and known vulnerability models. These reactive strategies are often inadequate against modern threats that are stealthy, adaptive, and frequently unknown [5]. As a result, systems remain vulnerable to zero-day exploits and advanced persistent threats (APTs), limiting the effectiveness of conventional defenses in dynamic and complex threat environments.

To overcome these limitations, mimic defense technology has emerged as a promising paradigm. It constructs a Dynamic Heterogeneous Redundancy (DHR) architecture that enhances system resilience through diversity and dynamic reconfiguration [6]. By introducing functional redundancy via heterogeneous implementations and continuously altering execution environments, the DHR architecture significantly increases the difficulty for attackers to predict or exploit system behavior. However, the application of mimic defense is still in its early stages, and developing effective strategies for dynamic defense remains a significant research challenge.

Motivated by the need for intelligent and adaptive defense mechanisms, this work introduces a novel integration of reinforcement learning (RL) within the endogenous security framework, particularly the Dynamic Heterogeneous Redundancy (DHR) architecture. By learning and adapting through continuous interaction with the environment, RL provides a powerful foundation for designing context-aware defense strategies that dynamically respond to evolving threats. Unlike traditional methods that rely on static rule sets or known attack signatures, our approach employs RL-driven scheduling to continuously adjust encryption configurations and defense actions. This dynamic adaptation disrupts the predictability that attackers often exploit, thereby significantly strengthening the protection of critical data and enhancing overall system resilience.

This paper makes three contributions to the field of cybersecurity.

It introduces a reinforcement learning-based framework designed to enhance the dynamic and adaptive defense capabilities of the DHR architecture. By integrating reinforcement learning to control the strategy scheduling of the DHR architecture, the proposed framework outputs defense strategies based on environmental information, effectively responding to emerging threats.
It provides a simulation environment that mimics attack–defense scenarios, offering a robust platform to validate the proposed approach.
The research presents a comparative analysis, demonstrating the superiority of the proposed reinforcement learning-enhanced mimic defense system over traditional static and heuristic approaches.

The remainder of this paper is organized as follows. Section 2 reviews related work on critical data protection strategies and the application of reinforcement learning in network security. Section 3 introduces the methodology, including the design of the attack–defense system and the reinforcement learning framework. Section 4 presents the experimental setup and results. Section 5 discusses the contributions, ethical considerations, and limitations of the proposed approach. Finally, we conclude the paper in Section 6.

2. Related Work

This section provides an overview of the current approaches in network security defense and the application of reinforcement learning in network security. We first discuss traditional defense strategies, highlighting their strengths and limitations in addressing the evolving nature of cybersecurity threats. Next, we explore the integration of reinforcement learning into network security, focusing on its potential to enhance defense adaptability and decision-making. Finally, we examine the limitations of existing methods and identify the gap in research that the proposed work aims to address.

2.1. Defense Strategies for Critical Data Protection in Network Systems

Currently, computer networks face numerous security challenges, including various forms of malicious code and system vulnerabilities. Malicious code such as computer viruses, Trojan horses, worms, and logic bombs has become a significant issue in political, economic, and military contexts [7,8,9,10,11,12,13].

To protect critical data and ensure system reliability, a variety of defensive techniques have been implemented, including installing antivirus software, host firewalls, hiding IP addresses, closing non-essential ports, and adjusting browser security settings. While these measures can improve network security to some extent, they are often reactive and fail to adapt to new and sophisticated threats.

Researchers have proposed various models to enhance defense mechanisms against these challenges. For example, Gao [14] utilized particle swarm optimization based on Bayesian attack graphs to construct a model using attack benefits and defense costs. Huang [15] developed a defense measure selection model against distributed denial-of-service (DDoS) attacks using objective attack history. Si [16] proposed a criticality metric for network states to remove the maximum criticality from attack graphs to maintain network integrity. Despite these efforts, traditional methods often rely on static models and lack the flexibility to adapt to evolving threats. While dynamic encryption technologies, such as those that change algorithms and keys over time, aim to address some of these limitations, they remain limited in scalability and adaptability and are unable to handle increasingly complex attack vectors [17].

Endogenous security, which integrates mimic computing with network defense, is a promising direction for enhancing system security against unknown threats [18]. The theory posits that systems will always have “dark functions” that present potential vulnerabilities. By leveraging mimic defense technology, which introduces dynamic, heterogeneous redundancy (DHR) architecture, systems can respond to attacks in a less predictable and more resilient manner [19]. Endogenous security has been enhanced by combining it with other techniques, including honeypot technology and federated learning, which contribute to the robustness and adaptability of defense mechanisms [20,21,22]. However, existing research still lacks a unified approach for dynamically scheduling defense strategies, a gap that the proposed framework in this paper aims to address.

2.2. Application of Reinforcement Learning in Network Security

Traditional network attack and defense strategies rely on logical reasoning and equilibrium solutions, often failing to effectively address complex, dynamic security challenges. One key limitation is the lack of flexibility and scalability, which is especially evident in large-scale systems with high traffic variability. To overcome these issues, some researchers have turned to reinforcement learning, which offers a promising approach to enhance network defense mechanisms [23].

Reinforcement learning, first proposed by Minsky [24], has evolved into a powerful tool in artificial intelligence (AI) and network security. With its ability to learn and adapt through interactions with the environment, RL provides a way to develop dynamic defense strategies that are more capable of handling unexpected or novel attacks. Key advancements in RL include dynamic programming methods [25,26], Q-learning [27,28], and deep reinforcement learning (DRL) [29,30], which has been increasingly applied in network security [31].

In the context of network security, reinforcement learning (RL) has been applied across several domains, including intrusion detection systems (IDSs), software-defined networking (SDN), and advanced persistent threats (APTs). For example, Louati et al. [32] adopted a distributed multi-agent reinforcement learning approach to address the challenges of processing massive heterogeneous data in real time and detecting stealthy attacks in large-scale networks. Similarly, Khayat et al. [33] introduced a DQN-based hybrid optimization framework for IoT devices, achieving 99.8% accuracy on the BoT-IoT dataset through feature fusion and chaos-based feature selection, effectively detecting zero-day attacks in resource-constrained environments. In the SDN domain, RL has been utilized to enhance network performance and fault recovery. Kumar et al. [34] proposed an RL-driven SDN model that significantly improved throughput, efficiency, and quality of service (QoS) compared to conventional networking models, providing intelligent and efficient connectivity solutions for next-generation SIoT systems. Additionally, Huang et al. [35] introduced a deep reinforcement learning-based faulty-link recovery approach DDPG-LBBP, which outperformed baseline methods in terms of packet loss rate, recovery delay, and success rate. With respect to APTs, Xiao et al. [36] proposed an RL-based defense scheme to optimize both scanning intervals and repair strategies in large-scale smart grid meter systems. Furthermore, Chen et al. [37] developed a Grouped Multi-Agent Deep RL framework to jointly optimize defense effectiveness and resource utilization under cost constraints, effectively mitigating the dimensionality explosion through adaptive cooperation strategies. RL has also been applied in adversarial cyber-attack simulation. Oh et al. [31] proposed a deep RL-based adversarial simulation framework that dynamically adjusted exploration-exploitation trade-offs by tuning epsilon and decay rates, achieving a 65% improvement in convergence speed over random search and enhancing attack success rates. In another study, Shah et al. [38] employed a game-theoretic adversarial RL framework to evaluate alert detection systems in Cybersecurity Operation Centers (CSOCs), revealing the robustness of defensive strategies against intelligent attacks.

To highlight key characteristics of these representative studies, Table 1 provides a comparative overview across several important dimensions, including their application domains, RL algorithms used, and whether they support dynamic heterogeneous redundancy (DHR). These methods demonstrate the growing applicability of reinforcement learning in designing autonomous and adaptive defense strategies capable of responding to complex and evolving cyber threats. However, most existing RL-based cybersecurity frameworks struggle to simultaneously accommodate dynamic behavior, system heterogeneity, and resource redundancy. The integration of RL into mimic defense architectures, specifically through dynamic strategy scheduling in DHR systems, presents a unique opportunity to overcome these limitations. Current research lacks a comprehensive framework that not only employs RL for defense decision-making but also dynamically adapts defense strategies, fully leveraging the potential of mimic defense technology. This gap is precisely what the proposed work aims to address.

3. Methodology

To simulate attack and defense scenarios, we construct an interactive attack–defense system using Unity. This system illustrates the behavior logic of both attacker and defender, simulating the state changes of resources. To enhance the effectiveness of defense strategies, we apply an RL agent for policy scheduling within the DHR architecture. As a result, the system can implement dynamic defense measures to effectively counter emerging threats in cyberspace. The diagram below (Figure 1) illustrates the overall architecture of the proposed RL-based model (ERL model), integrated within the endogenous security framework. It highlights the interactions between the core components of the system. The RL agent receives environment information, such as attack operations, and selects the current encryption method set from the encryption resource pool through policy scheduling. This selected set is then used to counteract attack operations, protect resources, and ensure system resilience.

In our ERL model, encryption switching is implemented as a dynamic and adaptive mechanism governed by reinforcement learning. Each encryption algorithm within the system is abstracted as a heterogeneous executor, reflecting the diversity necessary to enhance unpredictability and resilience. During each simulation round, the RL agent evaluates the current threat and system state to determine whether a defensive action is warranted. If a defense is initiated, the agent selects a specific target resource and dynamically replaces one of its encryption layers with an algorithm from the heterogeneous pool. This process mirrors real-world security practices, such as periodic encryption updates or reactive configuration adjustments based on observed threats. Furthermore, the defender operates under a constrained cost budget per round, requiring the agent to strategically balance defense effectiveness with resource expenditure. This cost-aware scheduling ensures the system maintains robust protection while optimizing costs and operational efficiency.

3.1. Attack and Defense Scenario System Design

In the context of offensive and defensive confrontations, the attributes and value of network resources play a critical role in the security and performance of the system. To simulate real-world scenarios, we designed five different types of resources, each containing a string of text information. These resources are protected using six widely recognized and commonly used encryption algorithms: DES, 3DES, AES, RSA, ECC, and RC4 [39,40,41]. Each resource carries 4 to 5 pieces of text information, each piece encrypted by a randomly selected algorithm. For example, if a resource has an encryption layer

{D E S, R C 4, A E S, R S A}

, it indicates that the resource contains four pieces of text information, each protected by its respective encryption algorithm. When establishing the attack–defense scenario, the system randomly generates resources along with their text information and encryption algorithms, thereby simulating the dynamic changes in a real environment. This design validates the system’s stability and security in handling sensitive data and critical files, providing practical verification of the overall system’s security and attack resistance capabilities.

In our system design, we simulate persistent threats (e.g., APTs) that dynamically select encryption layers to attack. The attacker aims to breach all encryption layers of a given resource at minimal cost by selecting optimal target resources and attack tools. We develop an algorithm to identify the appropriate target resource and select the optimal attack tools. We first define the cost consumption of switching attack tools

{C o s t}_{n}

, set as a linear relationship with the number of algorithms applicable to the attack tool, with a ratio of 1. After determining the cost of attack tools, the key issue is how to breach all encryption layers of a resource at the lowest cost. First, we consider a sequence of encryption layers

{a_{0}, a_{1}, a_{2}, \dots, a_{n}}

. Since the attacker must breach the encryption layers in order (i.e., they can only breach from

a_{0}

to

a_{n}

sequentially), for the last three consecutive layers

{a_{n - 2}, a_{n - 1}, a_{n}}

, if these three layers can reuse one tool, the cost consumption is reduced compared to not reusing the tool for these three layers. Similarly, for the last two consecutive layers

{a_{n - 1}, a_{n}}

, if these layers can reuse one tool, the cost consumption is also reduced compared to not reusing the tool. There are two reuse strategies described above, but it is not immediately clear which one incurs a lower cost. Therefore, we consider the following recurrence relation:

{C o s t}_{n} = min \{\begin{matrix} {C o s t}_{n - 3} + m a t c h e s ({s e q}_{n - 2}^{3}), & i f m a t c h e s ({s e q}_{n - 2}^{3}) > 0 and n \geq 2 \\ {C o s t}_{n - 2} + m a t c h e s ({s e q}_{n - 1}^{2}), & i f m a t c h e s ({s e q}_{n - 1}^{2}) > 0 and n \geq 1 \\ {C o s t}_{n - 1} + 2, & e l s e \end{matrix}

(1)

Here,

{C o s t}_{n}

represents the cost consumed by the subsequence from the beginning (index 0) to the end of the sequence. The function

m a t c h e s ()

is a utility function implemented in the system. Given an array, this function searches through all arrays of attack tools, starting with arrays of length 2 up to length 3, to find the first matching array. It then returns the length of the matching array. If no match is found, it returns −1.

The recurrence formula listed above ultimately results in the appearance of the first term

{C o s t}_{- 1}

. After analysis,

C o s t

should also have an initial value,

{C o s t}_{- 1} = 0

. Thus, using the above formulas, the cost consumed for each encryption layer sequence can be calculated. By comparing these costs, the most advantageous resource to target for an attack can be determined.

The specific algorithm flow for the attacker’s target resource selection is shown in Figure 2, and the pseudocode for the algorithm is presented in Algorithm 1. Below is a concise overview of the algorithm process:

(1): Enter the attacker’s round.
(2): The attacker checks (corresponding to “Judgment 1” in Figure 2) whether the tool selected in the previous round can be reused for the previously chosen target resource. If yes, the attacker reuses the tool to continue attacking the previous target resource, ending the round. If no, proceed to the next step.
(3): For the outermost layer of each resource, determine whether to reuse the tool from the previous round. If yes, calculate the $C o s t$ value of the resource after removing the outermost layer; if no, directly calculate the $C o s t$ value including the outermost layer. This step corresponds to “Calculate $C o s t$ for Each Resource” in Figure 2.
(4): Compare the $C o s t$ values of each resource and select the resource with the smallest $C o s t$ value as the target resource for this round. This step corresponds to “Select New Target” in Figure 2.
(5): The above process determines the target resource for the attacker’s round. Once the target resource is determined, choosing the attack tool becomes relatively straightforward. Next, determine whether the $C o s t$ value of the new target calculated in Step (3) was derived after removing the outermost layer (corresponding to “Judgment 2” in Figure 2). If the calculation was based on reusing the tool from the previous round and removing the outermost layer, the attacker reuses the tool from the previous round (corresponding to “Reuse Tool” in Figure 2). If not, the attacker selects a tool with higher reusability for the top 2 to 3 layers. Here, the $m a t c h e s ()$ function can be directly used to match the top 2 or 3 layers. If the $m a t c h e s ()$ function reveals that the top 2 or 3 layers of the new target resource cannot be reused, the attacker directly selects an attack tool of length 2 that can attack the outermost layer (corresponding to “Select New Tool” in Figure 2).
(6): After confirming both the target resource and the tool, the attacker launches the attack, ending the round.

Thus, the attacker’s logic is guided by the above-designed target resource selection algorithm. In the system’s design, the attacker uses this algorithm to consistently determine the most suitable target selection and tool selection method to completely compromise a resource while consuming the least cost.

The defender’s role is to protect resources and reduce the threat posed by attackers, simulating the defense strategies and countermeasures of a network administrator. Defenders select target resources and switch algorithms based on the encryption layers, thereby increasing the attacker’s cost and effort. This defense strategy aims to increase the difficulty of attacks and reduce their success rate. Additionally, defenders must consider cost consumption and diverse response strategies. In the confrontation with attackers, defenders might take various actions, which could inadvertently favor the attackers. Therefore, the defender’s actions are dynamically changing. Throughout the attack–defense process, defenders need to decide whether to defend, identify the target resources to defend, and select the encryption algorithm to switch. The logical flowchart of the defender’s process from the start to the end of each round is shown in Figure 3.

Algorithm 1 Target Resource Selection Algorithm

Input:: None
Output:: The target resource and attack tool of the current round
1:: if last resource and last tool can be reused then
2:: target resource ← last resource
3:: attack-tool ← last tool
4:: else
5:: costs ← empty array
6:: for r in resources do
7:: if last tool can be reused to r then
8:: delete −1 value of r
9:: end if
10:: add cost of r ← cost(r)
11:: end for
12:: target resource ← r of which cost is the lowest
13:: if last tool can be reused to target resource then
14:: attack-tool ← last tool
15:: else
16:: attack-tool ← tool in tools matches for r
17:: end if
18:: end if
19:: return target resource, attack tool

3.2. Reinforcement Learning Framework Platform Construction

We introduce an RL agent to perform policy scheduling within the DHR architecture, where the agent continuously learns from its interactions with the environment to improve its decision-making process. The goal of the agent is to obtain a strategy

π

suitable for the environment, where

π

represents the probability of the agent choosing a specific action in a given state. Typically, this is defined as

π : S \times A \to [0, 1]

(2)

Strategies can be either deterministic or stochastic. A deterministic strategy means that the agent takes the same action when it is in the same state at different times. A stochastic strategy, on the other hand, is a probability distribution that provides the probabilities of choosing different actions in a given state. In other words, when the environmental state is the same, a stochastic strategy offers the agent the optimal action based on this probability distribution. Essentially, it is a probabilistic problem without a definite answer [42]. A stochastic strategy can typically be defined as follows:

π (a ∣ s) = p (a_{t} = a ∣ s_{t} = s)

(3)

In the process of interaction between the agent and the environment, the strategy updates itself in real time, continuously optimizing through each reinforcement learning interaction. This ongoing optimization aims for the agent’s strategy to become optimal, known as strategy updating. To evaluate the performance of the agent in a particular state s at a given moment, the following parameter is defined:

V_{π} (s) = E (G_{t} ∣ s_{t} = s, π)

(4)

This parameter is called the state-value function. Similarly, to evaluate the performance of the action a taken by the agent in a particular state s at a given moment, the action-value function is defined as follows:

Q_{π} (s, a) = E (G_{t} ∣ s_{t} = s, a_{t} = a, π)

(5)

Here,

G_{t}

represents the total reward the agent receives from the current state s until the end of the interaction process, which is the cumulative reward mentioned above. These are the two types of value functions in reinforcement learning. During the reinforcement learning process, the agent updates and evaluates these value functions, continuously optimizing and updating the strategy, to improve the agent’s strategy to achieve optimal performance. The details of the reinforcement learning component are introduced in the following subsections.

3.2.1. Action Space Design

In reinforcement learning, the action space defines the set of all possible actions an agent can perform, reflecting the range of operations or decisions the agent can take. The type of action space depends on the nature of the problem and the characteristics of the environment. Action spaces are generally divided into the following two categories:

(1): Discrete Action Space: In this type of action space, the agent’s selectable actions form a discrete set. For example, in board games, the action space might include all legal moves of the pieces or specific positions on the board. In a discrete action space, each action represents a specific choice.
(2): Continuous Action Space: In this type of action space, the agent can choose a series of continuous values as actions. For instance, in robot control problems, the action space might consist of a continuous action vector, where each element represents the angle or speed of different robot joints. In a continuous action space, the agent has an infinite number of possible action choices, typically achieved by selecting specific values within a continuous range.

The structure of the action space is shown in Figure 4. When defining the action space of the defender agent, we consider three discrete parameters: isDefend (deciding whether to execute a defensive operation), chooseToDefend (selecting the resource to defend), and changeToEnId (choosing the encryption algorithm). These parameters have 2, 5, and 6 possible values, respectively, forming a discrete action space with three branches corresponding to the above parameters.

During the training process of the reinforcement learning model, the values within this action space are dynamically generated by the reinforcement learning policy. In each training cycle, these action values are produced in the agent’s action reception function and adjust the defender agent’s attributes according to predefined mapping relationships, thereby guiding the defensive actions in its decision-making process.

3.2.2. State Space Design

The state space of reinforcement learning plays a crucial role in introducing intelligent and adaptive decision-making mechanisms. It integrates the state information of key elements within the system, such as attacker behavior, resource status, and defender strategies, thereby reflecting the dynamic changes in the system. This enables the DHR architecture to accurately perceive the characteristics of the network environment through the reinforcement learning state space and optimize decision-making strategies during continuous learning, adapting to the ever-changing network attack scenarios to achieve intelligent decision-making and operations. By thoroughly analyzing the state space, the DHR architecture can more accurately assess risks, optimize resource allocation, and respond promptly, thereby enhancing network security and resistance to attacks.

Within the reinforcement learning framework, the state space represents the entire set of environment states observable by the agent, encompassing all possible states. States can be discrete or continuous, depending on the specific problem and environmental characteristics. In a discrete state space, the environment states form a distinctly defined set; in a continuous state space, the environment states are represented by continuous values or vectors, allowing the agent to observe an infinite number of possible states. For the defender agent, the goal is to effectively respond to attacks on resources, select appropriate resources for defense, and adjust the encryption layer algorithms of the resources, achieving the switching of heterogeneous redundant executors. The state space of the defender agent includes the status of resource encryption layers and the frequency of attacks on resources. It also incorporates the remaining cost value of the defender to accurately measure the current cost state of the system. Given that the defender might adopt different behavioral strategies at various cost stages, this variation is crucial for the adaptability of the behavioral strategy. Therefore, when facing attacks, the defender can effectively adjust its behavioral strategy according to different cost scenarios, achieving more targeted and efficient dynamic adjustments, thereby enhancing the system’s security and response efficiency.

3.2.3. Reward Function Design

In the reinforcement learning framework, reward information provides a dynamic way to adjust and optimize system security. By adjusting the reward function based on the current security status, resource conditions, and changes in attacker behavior, the system can flexibly respond to different security threat scenarios and optimize its defense strategies.

For the defender agent, the effectiveness of its actions in the attack–defense interaction is mainly evaluated by whether it successfully prevents an attack on specific resources. If the attacker successfully breaches a resource, the defender agent receives a significant penalty, with a reward value set to −5. This penalty mechanism ensures that the defender avoids repeating past failed decisions, adhering to the trial-and-error principle of reinforcement learning. If the attacker’s attempt fails due to random factors rather than the defender’s actions, the defender agent receives a low reward, with a reward value of +1. This low reward aims to prevent the defender from making ineffective decisions by adopting random strategies against inherently secure resources.

During the attack–defense interactions, to incentivize the defender agent to adopt effective defense strategies and efficiently utilize heterogeneous redundant executors, a reward value of +1 is also set when the defender’s effective measures cause the attack to fail. This reward structure aims to encourage the defender agent to take more effective and precise defense measures, thereby enhancing the overall security and responsiveness of the system.

4. Experiment

Our reinforcement learning framework is designed and implemented based on the ML-Agents library [43]. The design of the action space, state space, and reward mechanism involved in reinforcement learning is introduced in Section 3. ML-Agents is a machine learning toolkit provided by the Unity community, allowing the training of agents to perform various tasks within the Unity engine. This toolkit is tightly integrated with the Unity editor and its functionalities, offering a simple yet powerful platform for developers using the Unity engine. It enables them to develop both games and train machine learning models in the same environment without switching between different tools and platforms. ML-Agents provides a set of simple yet powerful APIs, making it easy for developers to define the behavior of agents and their interaction with the environment. Additionally, when training with ML-Agents, it offers configuration files to define various parameters and settings for training, including agent, algorithm, hyperparameters, and training parameters. This eliminates the need for developers to manually implement neural networks.

The reinforcement learning model implemented in this paper is trained using a Windows 10 operating system and an Nvidia GTX 3080 Ti GPU, within a Python = 3.8.18 environment configured with torch = 1.7.1, cudatoolkit = 10.2 and mlagents = 0.29.0. The training process involves approximately 1.5 million steps and is completed in about 4 h.

4.1. Preliminary Model Experiment

In our study, we evaluated the performance of different reinforcement learning algorithms within our framework through pre-tests. The algorithms tested include A2C, PPO, and SAC, which were considered our candidate reinforcement learning algorithms. After training these algorithms, we calculated the average results of five runs and processed the data to generate reward curves and entropy values, as shown in Figure 5 and Figure 6. As the number of training steps increased, the cumulative average reward value consistently increased, indicating an improvement in the model’s learning effectiveness. Simultaneously, the entropy value decreased with the increase in training steps, suggesting that the model’s decision-making strategy gradually stabilized. These curves also demonstrate that the PPO algorithm achieved higher reward values at a faster rate and decreased to a lower entropy value at a more stable rate.

It can be seen that the PPO algorithm is more suitable for our framework. The clear contrasts between these curves confirm the value of our preliminary experiments. Therefore, PPO was selected as the reinforcement learning algorithm for the practical ERL model for further testing and evaluation. It should be noted that we modified our algorithm configuration parameters according to the characteristics of each algorithm, aiming for each to reach its optimal performance.

4.2. Performance of the Model at Different Training Stages in Reinforcement Learning

During the formal experiment phase, to meticulously record the learning progress of the system at different training stages, we saved a copy of the model every 100,000 training steps. This approach allowed us to capture the state changes in the system throughout the training process, visually demonstrating the evolution of the model’s performance. Using the ML-Agents training framework, we were able to monitor the learning progress of the system model at different time points, which helped us gain a deeper understanding of the model’s robustness and intelligent decision-making capabilities, as well as determine the optimal endpoint for model training. After the training concluded, we applied models from different training stages to the defender agent and conducted 3000 attack tests in our attack–defense system. In these attack tests, we do not focus on the specific execution processes of the encryption and decryption algorithms. Our primary concern is the impact of the attacks on the state of the resources and the responses made by the defender agents. These attacks are designed to alter the encryption layers of the resources, simulating a range of potential attack scenarios. The effectiveness of the defense agents is evaluated based on their ability to preserve resource integrity and prevent the attacker from successfully compromising the resources. Upon completion of the experiments, we recorded the number of successful defenses by the defender agent and calculated the defense success rate.

According to Figure 7, as the number of training steps increases, the defender’s success rate rises from 64.5% to 90.7%, showing a clear upward trend. In the early stages of training, the success rate increases rapidly, indicating that the model quickly adapts and enhances its defensive capabilities. As training progresses, the rate of increase slows and stabilizes, reflecting that the defensive strategies mature through continuous learning. This demonstrates the strengths of reinforcement learning in continuously adjusting and optimizing defense strategies, particularly in executing heterogeneous redundancy switching strategies.

Simultaneously, as shown in Figure 8, the average cost consumption for both attack and defense shows an upward trend and gradually stabilizes with ongoing training. This indicates that as the defender’s behavior strategies become more mature and intelligent, attackers need to expend more resources to succeed. In the initial stages of training, due to chaotic defense strategies, attackers can easily succeed at a lower cost. As the defender’s strategies are optimized and become more intelligent, the difficulty for attackers increases, leading to higher cost consumption. For the defender, as effective defense strategies are learned and mastered, their actions become more precise. Although there may be an initial cost increase due to frequent attempts at different defense strategies, ultimately, as the model’s training effectiveness improves, the defender can achieve more efficient defense at a lower cost, resulting in a trend where costs first rise and then stabilize. This change reflects the model’s ability to adapt to system dynamics and optimize defense strategies, ultimately enhancing the overall system defense performance.

This process aligns with the expected outcomes of reinforcement learning training in the endogenous security DHR architecture. During the training process, by reasonably designing the action space, state space, and reward mechanism, the defender can learn and adjust its defense strategies through trial and error. This continuous process of strategy adjustment and optimization enables the defender to gradually explore and implement effective heterogeneous redundancy switching strategies, thus achieving the core goal of the endogenous security DHR architecture: leveraging learning mechanisms to adapt to system dynamic changes, thereby significantly improving the system’s defense capabilities.

4.3. Comparison of ERL Model Strategies with Other Strategies

Based on the experimental results, we selected the model with the highest defender success rate (with 1.5 million training steps) and compared it with traditional defense scheduling strategies to verify whether the ERL model effectively learns the attack patterns of attackers and has predictive capabilities. We designed two conventional defense scheduling strategies: random defense scheduling (Baseline), where in rounds with sufficient cost, a defense is randomly selected each time, with both the target resource and the algorithm for switching chosen randomly; and greedy defense scheduling (Greedy), where in rounds with sufficient cost, a defense is selected each time, targeting the resource that was attacked in the previous round, while the algorithm for switching is chosen randomly. These strategies enable a comparative evaluation of core components, such as policy learning and dynamic decision-making, by contrasting them with non-learning baseline methods. For the above strategies, three independent experiments were conducted, with each experiment involving 3000 attack tests. The number of successful defenses and the cost consumption for both attackers and defenders were recorded, and the defense success rate was calculated. The average values and standard error of the mean (SEM) for each metric were computed to generate comparative data. The results were then illustrated in a comparison chart, as shown in Figure 9.

The analysis based on test data and statistical results shows that the ERL model significantly improves the defender’s success rate compared to the random strategy and the greedy strategy, with increases of 60.767% and 53.799%, respectively. Additionally, as shown in Figure 9, under the ERL model strategy, the defender’s cost consumption is substantially reduced compared to the other two conventional strategies, while the attacker’s cost consumption significantly increases. This indicates that the ERL model achieves its intended goal of reducing the attacker’s threat at a lower cost. Therefore, the reinforcement learning strategy enables the defender to effectively learn the attacker’s mindset, demonstrating the effectiveness and feasibility of this reinforcement learning model compared to conventional strategies.

5. Discussion

This section discusses the proposed method with respect to its contributions, ethical and legal considerations, and limitations.

5.1. Contributions

This study explores the application of a reinforcement learning model based on endogenous security in network defense. First, the reinforcement learning model demonstrates immense potential in handling complex and dynamic environments. Experimental results show that the model is capable of dynamically scheduling the DHR architecture, adapting quickly to environmental changes and optimizing defense strategies. This leads to enhanced defense performance, demonstrated by improved defense success rates and reduced costs. The integration of endogenous security, which focuses on a system’s inherent capability to counteract security threats, further strengthens system resilience. By incorporating dynamic encryption technology, security is enhanced even more. The experimental results indicate that our model shows significant advantages in defense success rate and cost consumption when compared to random defense scheduling and greedy defense scheduling strategies. Through learning and predicting attack patterns, the model can select more accurate and cost-effective defense strategies.

Furthermore, the proposed framework demonstrates clear potential for practical deployment in critical infrastructure sectors such as energy and finance. For instance, in smart grid environments, the RL agent could function as part of a local controller to dynamically adjust communication protocols and encryption layers, thereby reducing vulnerability to data tampering and replay attacks. In financial systems, the framework can be integrated with transaction validation modules to reinforce multi-layer encryption policies and enhance resistance to advanced persistent threats (APTs). By addressing both defense performance and cost constraints, this research lays a foundation for developing adaptive, resource-efficient cybersecurity solutions for real-world applications.

5.2. Ethical and Legal Considerations

Deploying autonomous RL-driven defense systems in critical infrastructure introduces ethical and legal challenges that must be carefully addressed. Autonomous decision-making raises concerns about accountability and transparency, especially when incorrect or overly aggressive defense actions could disrupt essential services such as power distribution or financial transactions. In such safety-critical contexts, the lack of interpretability of RL models becomes a significant issue. While our framework demonstrates strong performance, its decision-making process remains inherently opaque. To mitigate this, future work could incorporate explainable reinforcement learning (XRL) techniques—such as policy visualization, saliency maps, or reward attribution methods—to ensure that defense decisions are interpretable and auditable by human operators.

From a legal perspective, deploying autonomous cybersecurity agents in regulated sectors must comply with data protection laws, operational safety standards, and AI governance frameworks. As global focus on responsible AI intensifies, systems like ours will require rigorous validation and auditing procedures to prevent misuse, ensure fairness, and minimize unintended consequences. By acknowledging these challenges, this research aligns with ongoing discussions on AI safety and ethics in critical infrastructure protection.

5.3. Limitations

Although the experimental results validate the proposed model’s potential, several limitations must be acknowledged. First, the evaluation is conducted within a self-constructed simulation environment, which cannot fully replicate the complexity and unpredictability of real-world attack vectors. Advanced threats such as zero-day exploits, adversarial samples, and coordinated multi-vector attacks may require additional defense layers and adaptive learning mechanisms not fully captured in our current experiments.

Another limitation is the lack of comprehensive comparisons with state-of-the-art methods beyond the baseline random and greedy strategies. While this study establishes a strong proof of concept, future work should include more rigorous benchmarking against recent RL-based cybersecurity solutions and advanced heuristic models. Additionally, the computational overhead of model training and inference, including memory and energy consumption, needs further optimization for large-scale or resource-constrained environments. To address this, future research may explore distributed RL training, lightweight model architectures, and sample-efficient learning techniques to reduce computational burden and enhance scalability.

Finally, although we outlined potential real-world applications in sectors such as energy and finance, practical deployment requires systematic testing on live infrastructure or realistic digital twins to validate robustness and reliability. Furthermore, more detailed statistical analyses should be conducted to assess the performance, stability, and generalization of RL-based scheduling under diverse threat scenarios. Future research should also investigate integrating human-in-the-loop oversight and hybrid defense mechanisms that combine automated RL scheduling with expert-driven policy adjustments, ensuring both high performance and operational safety.

6. Conclusions

The application of the ERL model in network security has demonstrated significant potential in managing complex and dynamic environments. Experimental results highlight the importance of continuously optimizing learning algorithms and strategies, particularly during the later stages of training. This study also encourages the integration of reinforcement learning with existing defense models. First, reinforcement learning models based on endogenous security can complement current defense strategies, especially in scenarios requiring rapid response and dynamic adjustment. Second, integrating multi-layer defense strategies with traditional mechanisms, such as firewalls and intrusion detection systems, can form a comprehensive defense system that enhances overall security. Finally, leveraging the adaptive capabilities of reinforcement learning models allows for the continuous adjustment and optimization of defense strategies, providing effective countermeasures against emerging threats.

Author Contributions

Conceptualization, L.H. and H.Z.; methodology, X.Y.; software, X.Y.; validation, X.Y. and J.G.; formal analysis, Z.L.; investigation, X.Y.; resources, Z.L.; data curation, X.Y.; writing—original draft preparation, J.G.; writing—review and editing, X.Y.; visualization, Z.G.; supervision, L.H.; project administration, L.H.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key R&D Program of China under grant no. 2022YFB3104300.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

Author Zhou Gan was employed by the company China West Construction Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DHR	Dynamic Heterogeneous Redundancy
APTs	Advanced Persistent Threats
RL	Reinforcement Learning
DDoS	Distributed Denial of Service
AI	Artificial Intelligence
DRL	Deep Reinforcement Learning
IDS	Intrusion Detection Systems
SDN	Software Defined Networking
QoS	Quality of Service
CSOCs	Cybersecurity Operation Centers
XRL	Explainable Reinforcement Learning
ERL	Reinforcement Learning-based Endogenous Security

References

Blanco, J.M.; Del Alamo, J.M.; Duenas, J.C.; Cuadrado, F. A Formal Model for Reliable Data Acquisition and Control in Legacy Critical Infrastructures. Electronics 2024, 13, 1219. [Google Scholar] [CrossRef]
Lallie, H.S.; Shepherd, L.A.; Nurse, J.R.; Erola, A.; Epiphaniou, G.; Maple, C.; Bellekens, X. Cyber security in the age of COVID-19: A timeline and analysis of cyber-crime and cyber-attacks during the pandemic. Comput. Secur. 2021, 105, 102248. [Google Scholar] [CrossRef] [PubMed]
Yoo, S.K.; Baik, D.K. Comprehensive damage assessment of cyberattacks on defense mission systems. IEICE Trans. Inf. Syst. 2019, 102, 402–405. [Google Scholar] [CrossRef]
Williams, B.; Qian, L. Semi-Supervised Learning for Intrusion Detection in Large Computer Networks. Appl. Sci. 2025, 15, 5930. [Google Scholar] [CrossRef]
Park, T.; Kim, K. Strengthening network-based moving target defense with disposable identifiers. IEICE Trans. Inf. Syst. 2022, 105, 1799–1802. [Google Scholar] [CrossRef]
Wu, J.; Ji, X.; He, L.; Cao, Z.; Xie, Y. Endogenous Security Enabled Cyber Resilience Research. Inf. Commun. Technol. 2023, 17, 4–11. [Google Scholar]
Yu, A.; Guan, G. The Research of Computer Network Security Protection Strategy. Comput. Study 2010, 5, 47–49. [Google Scholar]
Yungaicela-Naula, N.M.; Vargas-Rosales, C.; Pérez-Díaz, J.A.; Zareei, M. Towards security automation in Software Defined Networks. Comput. Commun. 2022, 183, 64–82. [Google Scholar] [CrossRef]
Sheikh, Z.A.; Singh, Y.; Singh, P.K.; Ghafoor, K.Z. Intelligent and secure framework for critical infrastructure (CPS): Current trends, challenges, and future scope. Comput. Commun. 2022, 193, 302–331. [Google Scholar] [CrossRef]
Mutambik, I.; Almuqrin, A. Balancing Efficiency and Efficacy: A Contextual Bandit-Driven Framework for Multi-Tier Cyber Threat Detection. Appl. Sci. 2025, 15, 6362. [Google Scholar] [CrossRef]
Ko, K.; Kim, S.; Kwon, H. Selective Audio Perturbations for Targeting Specific Phrases in Speech Recognition Systems. Int. J. Comput. Intell. Syst. 2025, 18, 103. [Google Scholar] [CrossRef]
Ko, K.; Kim, S.; Kwon, H. Multi-targeted audio adversarial example for use against speech recognition systems. Comput. Secur. 2023, 128, 103168. [Google Scholar] [CrossRef]
Ko, K.; Gwak, H.; Thoummala, N.; Kwon, H.; Kim, S. SqueezeFace: Integrative face recognition methods with LiDAR sensors. J. Sensors 2021, 2021, 4312245. [Google Scholar] [CrossRef]
Gao, N.; Gao, L.; He, Y.; Le, Y.; Gao, Q. Dynamic Security Risk Assessment Model Based on Bayesian Attack Graph. J. Sichuan Univ. Sci. Ed. 2016, 48, 111–118. [Google Scholar]
Huang, L.; Feng, D.; Lian, Y.; Chen, K.; Zhang, Y.; Liu, Y. Method of DDoS Countermeasure Selection Based on Multi-Attribute Decision Making. J. Softw. 2015, 26, 1742–1756. [Google Scholar]
Si, J.; Zhang, B.; Man, D.; Yang, W. Approach to making strategies for network security enhancement based on attack graphs. J. Commun. 2009, 30, 123–128. [Google Scholar]
Jin, Z.; Li, D.; Zhang, X. Research on dynamic searchable encryption method based on Bloom filter. Appl. Sci. 2024, 14, 3379. [Google Scholar] [CrossRef]
Wu, J. Research on Cyber Mimic Defense. J. Cyber Secur. 2016, 1, 1–10. [Google Scholar]
Wu, J.; Zou, H.; Xue, X.; Zhang, F.; Shang, Y. Cyber Resilience Enabled by Endogenous Security and Safety: Vision, Techniques, and Strategies. Strateg. Study Chin. Acad. Eng. 2023, 25, 106–115. [Google Scholar]
Yuan, H.; Guo, J.; Mingyang, X. Research on honeypot based on endogenous safety and security architecture. Appl. Res. Comput. 2023, 40, 1194–1202. [Google Scholar]
Chen, H.; Han, X.; Zhang, Y. Endogenous Security Formal Definition, Innovation Mechanisms, and Experiment Research in Industrial Internet. Tsinghua Sci. Technol. 2024, 29, 492–505. [Google Scholar] [CrossRef]
Cai, N.; He, G. Multi-cloud resource scheduling intelligent system with endogenous security. Electron. Res. Arch. 2024, 32, 1380–1405. [Google Scholar] [CrossRef]
Hu, Z.; Zhu, M.; Liu, P. Adaptive Cyber Defense Against Multi-Stage Attacks Using Learning-Based POMDP. ACM Trans. Priv. Secur. 2020, 24, 1–25. [Google Scholar] [CrossRef]
Minsky, M. Steps toward Artificial Intelligence. Proc. IRE 1961, 49, 8–30. [Google Scholar] [CrossRef]
Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef] [PubMed]
Ruszczyński, A. Risk-averse dynamic programming for Markov decision processes. Math. Program. 2010, 125, 235–261. [Google Scholar] [CrossRef]
Waltz, M.; Fu, K. A heuristic approach to reinforcement learning control systems. IEEE Trans. Autom. Control 1965, 10, 390–398. [Google Scholar] [CrossRef]
Kröse, B.J. Learning from delayed rewards. Robot. Auton. Syst. 1995, 15, 233–235. [Google Scholar] [CrossRef]
Bueff, A.; Belle, V. Logic + Reinforcement Learning + Deep Learning: A Survey. In Proceedings of the 15th International Conference on Agents and Artificial Intelligence. SCITEPRESS-Science and Technology Publications; Lisbon, Portugal, 22–24 February 2023.
Nguyen, T.T.; Reddi, V.J. Deep Reinforcement Learning for Cyber Security. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3779–3795. [Google Scholar] [CrossRef]
Oh, S.H.; Jeong, M.K.; Kim, H.C.; Park, J. Applying Reinforcement Learning for Enhanced Cybersecurity against Adversarial Simulation. Sensors 2023, 23, 3000. [Google Scholar] [CrossRef]
Louati, F.; Ktata, F.B.; Amous, I. Big-IDS: A decentralized multi agent reinforcement learning approach for distributed intrusion detection in big data networks. Clust. Comput. 2024, 27, 6823–6841. [Google Scholar] [CrossRef]
Khayat, M.; Barka, E.; Serhani, M.A.; Sallabi, F.; Shuaib, K.; Khater, H.M. Reinforcement Learning with Deep Features: A Dynamic Approach for Intrusion Detection in IOT Networks. IEEE Access 2025, 13, 92319–92337. [Google Scholar] [CrossRef]
Kumar, A.; Chakravarty, S.; Nanthaamornphong, A. Investigation of the satellite internet of things and reinforcement learning via complex software defined network modeling. Int. J. Electr. Comput. Eng. (2088-8708) 2025, 15, 3506–3518. [Google Scholar] [CrossRef]
Huang, W.; Gui, W.; Li, Y.; Lv, Q.; Zhang, J.; He, X. Faulty Links’ Fast Recovery Method Based on Deep Reinforcement Learning. Algorithms 2025, 18, 241. [Google Scholar] [CrossRef]
Xiao, L.; Liu, H.; Lv, Z.; Chen, Y.; Lin, Z.; Du, Y. Reinforcement Learning Based APT Defense for Large-scale Smart Grids. IEEE Internet Things J. 2024, 12, 11917–11925. [Google Scholar] [CrossRef]
Chen, J.; Lan, X.; Zhang, Q.; Ma, W.; Fang, W.; He, J. Defending Against APT Attacks in Cloud Computing Environments Using Grouped Multi-Agent Deep Reinforcement Learning. IEEE Internet Things J. 2025, 12, 19459–19470. [Google Scholar] [CrossRef]
Shah, A.; Sinha, A.; Ganesan, R.; Jajodia, S.; Cam, H. Two can play that game: An adversarial evaluation of a cyber-alert inspection system. ACM Trans. Intell. Syst. Technol. (TIST) 2020, 11, 1–20. [Google Scholar] [CrossRef]
Yasmin, N.; Gupta, R. Modified lightweight cryptography scheme and its applications in IoT environment. Int. J. Inf. Technol. 2023, 15, 4403–4414. [Google Scholar] [CrossRef]
Shen, X.; Li, X.; Yin, H.; Cao, C.; Zhang, L. Lattice-based multi-authority ciphertext-policy attribute-based searchable encryption with attribute revocation for cloud storage. Comput. Netw. 2024, 250, 110559. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Z.; Lu, J. Differential-Neural Cryptanalysis on AES. IEICE Trans. Inf. Syst. 2024, 107, 1372–1375. [Google Scholar] [CrossRef]
Benhamou, E. Similarities Between Policy Gradient Methods (PGM) in Reinforcement Learning (RL) and Supervised Learning (SL). SSRN Electron. J. 2019. [Google Scholar] [CrossRef]
Juliani, A.; Berges, V.P.; Teng, E.; Cohen, A.; Harper, J.; Elion, C.; Goy, C.; Gao, Y.; Henry, H.; Mattar, M.; et al. Unity: A general platform for intelligent agents. arXiv 2020, arXiv:1809.02627. [Google Scholar]

Figure 1. In the ERL model, the RL agent selects encryption algorithms from the encryption algorithm resource pool to be combined into the encryption method set for policy scheduling, dynamically replacing the current encryption method set.

Figure 2. The attacker’s target resource selection algorithm. It involves two key judgments. The first judgment (corresponding to “Judgment 1” in the figure) is whether the tool selected in the previous round can be reused for the previously chosen target resource. The second judgment is whether the calculation process of the new target’s

C o s t

value was derived after removing the outermost layer.

Figure 2. The attacker’s target resource selection algorithm. It involves two key judgments. The first judgment (corresponding to “Judgment 1” in the figure) is whether the tool selected in the previous round can be reused for the previously chosen target resource. The second judgment is whether the calculation process of the new target’s

C o s t

value was derived after removing the outermost layer.

Figure 3. Defender flowchart. The defender’s strategy involves selecting three key parameters: whether to choose to defend, which resource to target, and the algorithm number used for switching.

Figure 4. Parameters of the action space representation. The action space of the defender agent specifically consists of three branches, with sizes of 2, 5, and 6 respectively. These correspond to deciding whether to execute a defensive operation, selecting the resource to defend, and choosing the encryption algorithm.

Figure 5. Comparison of rewards for three reinforcement learning algorithms. The PPO algorithm achieved higher average reward values more quickly.

Figure 6. Comparison of entropy values for three reinforcement learning algorithms. The PPO algorithm decreased to a lower entropy value at a more stable rate.

Figure 7. Defense success rate at different stages of reinforcement learning. The defense success rate gradually increases with the number of training steps and reaches a stable range.

Figure 8. Comparison of attack–defense cost consumption at different training stages of reinforcement learning. The defender agent’s cost consumption is significantly lower than the attacker’s cost consumption.

Figure 9. Comparison of attack–defense success rates and cost consumption. The ERL model significantly improves the defender’s success rate compared to the random and greedy strategies while also substantially reducing the defender’s cost consumption compared to the other two conventional strategies.

Table 1. Comparison of Representative RL-Based Cybersecurity Applications.

Domain	Work	RL Type	Dynamic	Heterogeneous	Redundant
Intrusion Detection Systems	[32]	MARL	YES	NO	NO
Intrusion Detection Systems	[33]	DQN	YES	NO	NO
Software Defined Networking	[34]	DRL	YES	NO	YES
Software Defined Networking	[35]	DDPG	YES	YES	NO
Advanced Persistent Threats	[36]	Actor-Critic	YES	NO	YES
Advanced Persistent Threats	[37]	MADRL	YES	YES	NO
Adversarial Cyber-attack	[31]	DRL	YES	NO	NO
Adversarial Cyber-attack	[38]	ARL	YES	YES	NO

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, X.; He, L.; Geng, J.; Liang, Z.; Gan, Z.; Zhao, H. Dynamic Defense Strategy Selection Through Reinforcement Learning in Heterogeneous Redundancy Systems for Critical Data Protection. Appl. Sci. 2025, 15, 9111. https://doi.org/10.3390/app15169111

AMA Style

Yu X, He L, Geng J, Liang Z, Gan Z, Zhao H. Dynamic Defense Strategy Selection Through Reinforcement Learning in Heterogeneous Redundancy Systems for Critical Data Protection. Applied Sciences. 2025; 15(16):9111. https://doi.org/10.3390/app15169111

Chicago/Turabian Style

Yu, Xuewen, Lei He, Jingbu Geng, Zhihao Liang, Zhou Gan, and Hantao Zhao. 2025. "Dynamic Defense Strategy Selection Through Reinforcement Learning in Heterogeneous Redundancy Systems for Critical Data Protection" Applied Sciences 15, no. 16: 9111. https://doi.org/10.3390/app15169111

APA Style

Yu, X., He, L., Geng, J., Liang, Z., Gan, Z., & Zhao, H. (2025). Dynamic Defense Strategy Selection Through Reinforcement Learning in Heterogeneous Redundancy Systems for Critical Data Protection. Applied Sciences, 15(16), 9111. https://doi.org/10.3390/app15169111

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Defense Strategy Selection Through Reinforcement Learning in Heterogeneous Redundancy Systems for Critical Data Protection

Abstract

1. Introduction

2. Related Work

2.1. Defense Strategies for Critical Data Protection in Network Systems

2.2. Application of Reinforcement Learning in Network Security

3. Methodology

3.1. Attack and Defense Scenario System Design

3.2. Reinforcement Learning Framework Platform Construction

3.2.1. Action Space Design

3.2.2. State Space Design

3.2.3. Reward Function Design

4. Experiment

4.1. Preliminary Model Experiment

4.2. Performance of the Model at Different Training Stages in Reinforcement Learning

4.3. Comparison of ERL Model Strategies with Other Strategies

5. Discussion

5.1. Contributions

5.2. Ethical and Legal Considerations

5.3. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI