Adaptive Cyber Defense Through Hybrid Learning: From Specialization to Generalization

Farooq, Muhammad Omer

doi:10.3390/fi17100464

Open AccessArticle

Adaptive Cyber Defense Through Hybrid Learning: From Specialization to Generalization

by

Muhammad Omer Farooq

^1,2

¹

HSR Innovations and Consulting Ltd., Old Quarters Ballincollig, P31 EV91 Cork, Ireland

²

Department of Electronics and Computer Engineering, University of Limerick, V94 T9PX Limerick, Ireland

Future Internet 2025, 17(10), 464; https://doi.org/10.3390/fi17100464

Submission received: 6 September 2025 / Revised: 30 September 2025 / Accepted: 4 October 2025 / Published: 9 October 2025

(This article belongs to the Special Issue AI and Security in 5G Cooperative Cognitive Radio Networks)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces a hybrid learning framework that synergistically combines Reinforcement Learning (RL) and Supervised Learning (SL) to train autonomous cyber-defense agents capable of operating effectively in dynamic and adversarial environments. The proposed approach leverages RL for strategic exploration and policy development, while incorporating SL to distill high-reward trajectories into refined policy updates, enhancing sample efficiency, learning stability, and robustness. The framework first targets specialized agent training, where each agent is optimized against a specific adversarial behavior. Subsequently, it is extended to enable the training of a generalized agent that learns to counter multiple, diverse attack strategies through multi-task and curriculum learning techniques. Comprehensive experiments conducted in the CybORG simulation environment demonstrate that the hybrid RL–SL framework consistently outperforms pure RL baselines across both specialized and generalized settings, achieving higher cumulative rewards. Specifically, hybrid-trained agents achieve up to 23% higher cumulative rewards in specialized defense tasks and approximately 18% improvements in generalized defense scenarios compared to RL-only agents. Moreover, incorporating temporal context into the observation space yields a further 4–6% performance gain in policy robustness. Furthermore, we investigate the impact of augmenting the observation space with historical actions and rewards, revealing consistent, albeit incremental, gains in SL-based learning performance. Key contributions of this work include:

(i)

a novel hybrid learning paradigm that integrates RL and SL for effective cyber-defense policy learning,

(i i)

a scalable extension for training generalized agents across heterogeneous threat models, and

(i i i)

empirical analysis on the role of temporal context in agent observability and decision-making. Collectively, the results highlight the promise of hybrid learning strategies for building intelligent, resilient, and adaptable cyber-defense systems in evolving threat landscapes.

Keywords:

autonomous cyber operations; cyber security; defensive blue agent; supervised learning; reinforcement learning

1. Introduction

The increasing complexity and volume of cyber threats continue to challenge traditional security paradigms. Conventional cyber-defense mechanisms, such as signature-based intrusion detection systems (IDSs) and predefined rule sets, are often insufficient to detect and respond effectively to sophisticated, multistage, and adaptive attacks. Attackers can exploit these static defenses to bypass detection, escalate privileges, or disrupt operations, leaving critical systems vulnerable. This creates a need for autonomous cyber-defense agents that can learn from interactions, adapt to evolving threats, and generalize defensive strategies across diverse adversarial behaviors, thereby reducing human workload and mitigating operational and financial risks [1]. Modern adversaries frequently employ multistage, evasive, and adaptive attack strategies that can bypass static, rule-based defenses. In this environment, static detection mechanisms, such as signature-based intrusion detection systems (IDSs) and predefined rule sets, are increasingly inadequate for timely and effective responses. This calls for autonomous cyber-defense agents that can learn from interactions, adapt to evolving threats, and generalize defensive strategies across diverse adversarial behaviors. Reinforcement Learning (RL) has gained significant attention in cyber defense research as a promising tool for training such autonomous agents [2,3]. RL enables agents to learn optimal defensive strategies by interacting with the environment, observing the effects of their actions, and maximizing long-term cumulative reward [4]. Recent studies have successfully applied RL to cyber-defense scenarios, such as automated intrusion response and moving target defense systems [5,6]. However, RL alone often suffers from several limitations in real-world security contexts, including sample inefficiency, high variance in learning outcomes, and limited policy generalization to unseen threat behaviors. These limitations are especially pronounced in environments with sparse rewards and partial observability conditions common in real-world cyber systems.

The increasing complexity and volume of cyber threats continue to challenge traditional security paradigms. Modern adversaries frequently employ multistage, evasive, and adaptive attack strategies that can bypass static, rule-based defenses. In this environment, static detection mechanisms, such as signature-based intrusion detection systems (IDSs) and predefined rule sets, are increasingly inadequate for timely and effective responses. This calls for autonomous cyber-defense agents that can learn from interactions, adapt to evolving threats, and generalize defensive strategies across diverse adversarial behaviors. Reinforcement Learning (RL) has gained significant attention in cyber defense research as a promising tool for training such autonomous agents. RL enables agents to learn optimal defensive strategies by interacting with the environment, observing the effects of their actions, and maximizing long-term cumulative reward. Recent studies have successfully applied RL to cyber-defense scenarios, such as automated intrusion response and moving target defense systems [5,6]. However, RL alone often suffers from several limitations in real-world security contexts, including sample inefficiency, high variance in learning outcomes, and limited policy generalization to unseen threat behaviors. These limitations are especially pronounced in environments with sparse rewards and partial observability conditions common in real-world cyber systems.

To mitigate the above-discussed challenges, this paper proposes a hybrid learning framework that combines RL with Supervised Learning (SL) to train more efficient, robust, and adaptive cyber-defense agents. The central idea is to use RL for exploration and strategy discovery while employing SL to distill high-reward trajectories into supervised learning signals that guide policy refinement. This approach effectively integrates the strengths of both learning paradigms: RL contributes adaptivity and exploration, while SL improves convergence, sample efficiency, and stability. The framework is applied in two stages. In the first stage, specialized defensive agents are trained, each tailored to counter a specific adversarial strategy. RL is used to explore optimal responses, and SL leverages the most successful episodes to come up with effective defense agents. In the second stage, the approach is extended to build a generalized defensive agent through multi-task learning, enabling the agent to operate effectively across a spectrum of adversarial behaviors. This is motivated by the growing need for scalable cyber-defense solutions that generalize well across diverse and evolving threats [7].

Additionally, the author explores the effect of augmenting the agent’s observation space with temporal context, specifically recent actions and reward histories. By including this information, the agent gains a short-term memory that helps it infer trends or shifts in adversarial behavior over time. Results show that this augmentation, while simple, yields modest but consistent improvements in policy robustness, especially during supervised fine-tuning. This research builds on and extends recent work exploring RL and hybrid learning in cyber defense. For example, Dutta et al. applied deep RL to cyber systems under dynamic adversarial uncertainty [5], and Nyberg and Johnson demonstrated graph-based simulations for training defensive RL agents that generalize across attacker strategies [6]. Meanwhile, Wiebe et al. explored cooperative multi-agent RL for learning cyber-defense tactics from scratch [7]. While these works highlight the power of RL, our contribution lies in strategically integrating RL with SL, yielding performance improvements in both specialized and generalized defensive settings.

While RL offers adaptivity and exploration, and SL provides stability and efficiency, the unique challenges of cyber defense require a hybrid of both. The advantages and necessity of this combination can be summarized as follows:

-: Sample Efficiency: RL often requires millions of interactions to converge, which is impractical in real or high-fidelity simulated cyber environments.
-: Policy Stability: RL policies are prone to instability due to high variance in updates, especially under sparse rewards. SL provides smoother gradients and more stable convergence when trained on curated examples.
-: Exploration vs. Exploitation Balance: RL excels at discovering novel strategies by exploring the environment. SL distills these discoveries into reusable knowledge, ensuring exploitation of proven tactics.
-: Generalization Across Threats: RL agents trained against one adversary often fail to generalize to unseen behaviors. SL, especially via multi-task learning, enables knowledge transfer across specialized policies, supporting broad generalization.
-: Operational Practicality: Realistic cyber ranges and enterprise systems cannot support unlimited trial-and-error exploration. A hybrid RL–SL approach reduces the reliance on costly interactions, making deployment more feasible.

To empirically validate the presented approach, CybORG platform—a standardized simulation environment—was developed under the DARPA-funded CAGE challenge [8]. This platform provides realistic adversarial scenarios for autonomous defensive agent training and evaluation. Experimental results show that the proposed hybrid RL-SL framework consistently outperforms RL-only baselines across metrics such as cumulative reward, training efficiency, and policy robustness in both specialized and generalized setups.

Main Contributions of this Work:

-: A hybrid RL–SL framework for training autonomous cyber-defense agents, improving learning efficiency and policy stability over traditional RL methods.
-: A scalable generalization extension using multi-task learning to train a single agent capable of defending against diverse adversarial behaviors.
-: An empirical analysis of observation space augmentation with temporal context, highlighting its impact on learning performance.
-: A comprehensive evaluation within the CybORG environment demonstrating that hybrid-trained agents outperform RL-only baselines in both specialized and generalized defense scenarios.

The remainder of this paper is organized as follows: Section 2 discusses background; Section 3 presents the hybrid learning framework; Section 4 details our experiments and results; and Section 5 concludes with insights and future directions.

2. Background

2.1. Reinforcement Learning Environments for Autonomous Cyber Operations

Given the highly dynamic nature of cyber environments, where adversarial tactics, techniques, and procedures evolve unpredictably, RL environments for cyber defense must simulate both diverse attack behaviors and realistic system responses. To address this, Molina-Markham et al. [9] proposed a framework for designing RL environments tailored for autonomous cyber defense. Their method introduced multiple network configurations that simulate different vulnerabilities and topologies, creating distinct “games” within which RL agents can be trained. However, such diversity comes with increased computational complexity and lacks a unified framework for knowledge transfer across environments.

Li et al. [10] addressed this fragmentation by introducing a unified simulation–emulation platform. Their system bridges high-fidelity emulated networks and faster simulation environments, enabling agents to train efficiently while retaining realism. This dual-environment design also facilitates staged training where early policies are learned in simulations and later fine-tuned in emulations. CyGIL [11], another key contribution, provides an infrastructure for deploying and training autonomous agents in realistic settings. This supports both red- and blue-team operations and a variety of network topologies. Despite its flexibility, CyGIL lacks built-in support for multi-agent or adversarial co-evolution scenarios. CyberBattleSim [12], developed by Microsoft, allows adversarial simulations between red and blue agents. Although it provides basic RL hooks, it requires significant manual effort to model new exploits and attack patterns. Recent work by Piplai et al. [13] introduced semantic network knowledge into CyberBattleSim, enabling more intelligent agent behavior through contextual learning. Kunz et al. [14] extended this by enabling concurrent training of red and blue agents, enhancing realism, and enabling adversarial coevolution.

2.2. Reinforcement Learning-Based Generic Defensive Agent Training Methods

Generic defensive agents aim to generalize across multiple types of attacks. This can be approached via different RL architectures: single-agent, hierarchical, and ensemble methods.

Single-agent methods, such as those explored in Kunz et al. [14], aim to train one agent to defend across various scenarios. Although simple and efficient, such models often struggle with generalization due to limited exposure to diverse attack behaviors during training. Hierarchical RL approaches, as demonstrated by Foley et al. [15], structure agents into a decision-making hierarchy. A high-level agent classifies the type of attack and delegates the response to a specialized lower-level agent. This modularity improves performance across complex threat landscapes, but requires reliable attack classification and inter-agent coordination, which may degrade in partially observable environments. Ensemble methods, though not deeply explored in the existing literature, conceptually involve multiple independent agents. Their decisions are aggregated to form a consensus action. Although this can increase robustness, it incurs computational overhead and latency, which are significant drawbacks in real-time defense systems.

Despite their contributions, these methods are often limited by one or more of the following: inefficient learning requiring extensive training data, unreliable performance in unfamiliar scenarios, or substantial computational resource demands. There remains a need for approaches that offer more stable, efficient, and broadly applicable learning within a single-agent framework.

2.3. Hybrid and Generalization-Centric Learning Approaches

To overcome limitations in learning efficiency and generalization, researchers have recently explored hybrid learning strategies and generalization-centric training in cyber defense. Nyberg and Johnson [6] proposed curriculum-based RL to train agents in increasingly harder adversarial scenarios. Their staged training improved robustness, but still relied solely on RL, limiting sample efficiency in sparse-reward environments. Dutta et al. [5] introduced a deep RL system trained under adversarial uncertainty. By injecting environment-level variability, the authors improved the generalization capacity of agents. However, their approach still faced instability during training and convergence issues, especially in high-dimensional state spaces. Wiebe et al. [7] explored cooperative multi-agent RL for cyber defense. Their agents developed collaborative tactics, showing high adaptability against coordinated attackers. Yet, the approach required significant coordination overhead and did not scale well in terms of policy transfer or single-agent deployment. Their method allowed rapid adaptation but suffered from limited effectiveness when the target environment diverged significantly from training conditions.

2.4. Comparison of Existing Approaches

Table 1 compares the approaches discussed along three key dimensions: agent coordination complexity, task adaptability, and training scalability. These metrics reflect practical constraints in deploying autonomous defensive systems in real-world networks. As shown in Table 1, the proposed hybrid RL–SL framework offers a unique combination of high task adaptability and training scalability while maintaining low agent coordination complexity. Unlike purely RL-based approaches, it leverages SL to improve sample efficiency and stability. Compared to multi-agent or hierarchical methods, it avoids coordination overheads, simplifying deployment. One limitation of the proposed framework is that it has been primarily validated in simulated environments (CybORG), and real-world deployment may require additional adaptation and validation.

While prior work has made significant progress in improving adaptability, scalability, or robustness in cyber-defense agents, most existing approaches suffer from one or more of the following limitations: reliance on pure RL (leading to high sample inefficiency), lack of support for temporal context, inability to generalize across adversaries within a single-agent paradigm, or overheads related to multi-agent coordination. This paper addresses these limitations through a novel hybrid learning framework that integrates RL and SL, supports temporal context integration, and scales to generalized training without requiring complex agent hierarchies or ensembles.

3. CybORG: An Autonomous Cyber Operations Research Gym

Drawing inspiration from OpenAI Gym [16], CybORG [17] provides a modular framework for applying machine learning techniques to Adversarial Cyber Operations (ACO). In particular, it offers well-defined environments for training decision-making models via RL in adversarial settings. A typical CybORG environment simulates a computer network segmented into multiple subnets and incorporates standard components such as hosts, servers, routers, switches, and firewalls. CybORG introduces multiple roles: an attacker (red agent), a defender (blue agent), and a benign user (green agent). The red agent attempts to compromise the network and disrupt operations, while the blue agent protects against intrusions by detecting and mitigating attacks. The green agent generates legitimate traffic, thereby increasing realism by simulating normal user activity. This combination enables dynamic evaluation of RL-based defense strategies within realistic ACO scenarios.

CybORG provides two baseline red agents:

(i)

red_meander, and

(i i)

b_line. The simulated network (Figure 1) consists of three subnets: a user subnet, an enterprise subnet, and an operational subnet hosting critical services. The blue agent, typically triggered by a monitoring system, periodically scans each host to identify malicious activity. If malware is detected, the blue agent either removes it or restores the host from backup if it is too severely compromised.

Red agents begin their campaign by compromising a user host in subnet 1 through exploitation and privilege escalation. They then pivot to enterprise servers and ultimately attempt to reach the operational subnet to disrupt critical services. The two red agents differ in their attack strategies. The red_meander agent progresses sequentially, compromising each host subnet by subnet, while the b_line agent leverages knowledge of the topology to prioritize direct access to the operational subnet, bypassing intermediate targets when possible. Both agents employ the same exploitation methods, but their search strategies differ. Table 2 and Table 3 summarize the respective action spaces of red and blue agents.

The blue agent’s observation space provides host- and system-level situational awareness. For each host, the agent observes:

(i)

network interfaces (IP, subnet, interface name),

(i i)

active processes,

(i i i)

session details (owners, IDs, process links, types),

(i v)

system specifications (OS version, architecture, host name), and

(v)

user information (groups, usernames, credentials). Although described semantically, CybORG employs wrappers that convert these observations into a fixed 52-dimensional binary vector. Each element of this vector represents a specific system condition (e.g., process presence, privilege level, service state). Importantly, this representation is adversary-invariant: the observation format remains constant across red agent types, forcing the defender to learn robust mappings from system indicators to defensive strategies.

Red agent observations begin with knowledge of a single user host (User0) in subnet 1 and expand as the agent discovers more of the network. Both red and blue action/observation spaces are wrapped to be directly compatible with common RL frameworks such as Stable Baselines and RLlib.

CybORG defines action spaces as combinations of actions and required parameters, reflecting the operational reality that tools must be executed with appropriate inputs. Reward design plays a critical role in shaping agent behavior. For example, in ACO scenarios, if the defender is not penalized for failing to prevent service disruption, the learned policy may fail to prioritize timely intervention. Table 4 and Table 5 outline reward assignments for key defensive outcomes.

In the CAGE Challenge 2 environment [8], the defensive reward function is designed to align with realistic operational objectives. The total per-step reward is formalized in Equation (1), which aggregates (i) small penalties for non-critical host compromises, (ii) larger penalties for critical server compromises, (iii) costs for using the expensive Restore action, and (iv) a severe penalty for a successful impact on the operational server. The subtraction term

({host}_{Red} - 1)

removes the unavoidable initial foothold (User0), preventing the defender from being penalized for that persistent access.

R_{Blue} = - 0.1 \cdot ({host}_{Red} - 1) - {server}_{Red} - {Blue}_{Restore} - 10 \cdot ({Red}_{Impact})

(1)

where:

${host}_{Red}$ is the number of hosts with red root access;
${server}_{Red}$ is the number of critical servers with red root access;
${Blue}_{Restore} \in {0, 1}$ indicates whether the blue agent executed a restore action this step;
${Red}_{Impact} \in {0, 1}$ indicates whether the red agent executed an impact action on the operational server this step.

Notably, even if the defender successfully removes the adversary from multiple hosts, the red agent always retains persistent access to User0. This irrevocable foothold enables continuous re-entry into the network, mimicking realistic persistent threat behavior. This design choice highlights the need for defensive policies capable of responding to repeated incursions rather than single remediation events.

4. Hybrid Learning Framework for Autonomous Cyber Defense

The components of the proposed framework are illustrated in Figure 2, and details are discussed below.

4.1. Training a Defensive Agent for a Specific Adversarial Agent

In the proposed framework, each defensive agent is independently trained using RL to counter a particular adversarial agent. This design choice is motivated by the observation that adversaries often employ heterogeneous attack strategies, ranging from opportunistic exploitation to sustained, goal-oriented campaigns. A generic defensive model trained against aggregated attack data risks diluting its effectiveness, whereas adversary-specific training enables the defender to adapt to nuanced behaviors with higher precision.

Prior to training, a comprehensive hyper-parameter (HP) optimization study is conducted to enhance model stability and convergence. RL algorithms such as Proximal Policy Optimization (PPO) are highly sensitive to HP settings, including learning rate, clipping ratio, batch size, and discount factor. Improper tuning may lead to unstable learning, catastrophic forgetting, or overfitting to limited state-action trajectories. To address this, optimization frameworks such as Optuna are employed to systematically explore the HP space, using metrics such as average episodic reward and policy entropy as evaluation criteria. Once the optimal HP configuration is identified, training is carried out within simulated or emulated network environments that closely approximate real-world operational conditions. These environments incorporate realistic host configurations, background traffic from benign agents, and red agent behaviors aligned with known tactics, techniques, and procedures. Over extensive training iterations, the defensive agents refine their strategies, progressively improving their ability to neutralize the behaviors of their assigned adversaries.

Although RL provides adaptive and adversary-specific defensive policies, it also suffers from high computational cost and limited transparency. To address these challenges, the framework integrates SL to distill and generalize the policies learned through RL. A structured data collection process is embedded in the framework to construct a dataset of observation–action tuples. These tuples are extracted from episodes in which the RL-trained defensive agent engages with its corresponding adversary. To ensure data quality, only episodes that achieve a cumulative reward within a predefined margin of the best-performing episode are retained. This selective filtering ensures that the dataset represents high-quality decision-making trajectories, rather than noisy or suboptimal behaviors.

The curated datasets are subsequently used to train SL models, such as Random Forest classifiers, XGBoost, or neural networks. HP optimization is again employed to improve model performance and generalization across state distributions. The SL component offers two key benefits:

(i)

reduced inference latency, enabling faster decision making during deployment, and

(i i)

improved interoperability, as SL models can often provide importance of features or rule-based insights into defensive actions. Once trained, the SL-based defensive agent undergoes rigorous evaluation against its target adversary.

In summary, the integration of RL and SL in adversary-specific training enables the development of defensive agents that combine the adaptivity of RL with the efficiency and interoperability of SL. This dual stage process provides a principled pathway from adversary-tailored simulation training to reliable deployment in operational environments.

4.2. Training a Generic Defensive Agent

Although adversary-specific defensive agents offer high precision against individual attack strategies, their deployment at scale poses challenges. Maintaining a large portfolio of specialized agents can become computationally expensive and operationally complex, particularly in dynamic environments where new adversarial behaviors emerge rapidly. To address this limitation, the framework also supports the training of a generic defensive agent—a single agent capable of defending against multiple adversaries with diverse strategies.

The construction of the generic agent begins with the consolidation of datasets generated during adversary-specific training. Each specialized defensive agent contributes high-quality observation–action tuples collected under the filtering process described earlier. These datasets are merged into a unified training corpus. To prevent sequential bias (e.g., records from one agent dominating the early portion of the dataset), the data is randomized prior to training. This ensures that the SL algorithm encounters a balanced mixture of adversarial scenarios, thereby improving its ability to generalize across different attack profiles.

With the prepared consolidated dataset, the generic defensive agent is trained using SL algorithms such as XGBoost, Random Forest, or deep learning architectures, depending on the complexity of the input feature space. As in the adversary-specific case, HP optimization is conducted to maximize generalization and robustness. Cross-validation is used to safeguard against overfitting, ensuring that the model maintains defensive performance across the full spectrum of adversarial behaviors.

The generic defensive agent is then evaluated under mixed-adversary conditions, where it is exposed to attack strategies from multiple red agents. Performance evaluation emphasizes the agent’s ability to generalize, measuring overall episodic reward and adaptability. Agents satisfying the performance criteria are eligible for deployment in real-world or production-like environments. In these settings, the generic agent continuously monitors system states and recommends defensive actions, offering a scalable alternative to managing multiple adversary-specific agents.

4.3. Discussion

The framework thus provides two complementary pathways:

(i)

adversary-specific agents trained with RL and distilled via SL for high-precision defense, and

(i i)

a generic agent trained through dataset consolidation for scalable defense across diverse adversaries. The combination allows operators to tailor defensive strategies to operational requirements: when facing well-characterized threats, adversary-specific agents can be deployed for maximum effectiveness, whereas in dynamic or resource-constrained environments, the generic agent offers broader coverage. Together, these components advance the state of the art in autonomous cyber defense by balancing specialization with generalization, and by bridging simulation-based training with deployable real-world systems.

5. Performance Evaluation

5.1. Training and Experiment Setup

For performance evaluation, experiments are conducted using the CybORG environment, described in detail in Section 3. CybORG provides a controlled yet realistic setting for adversarial cyber operations, enabling the training and assessment of defensive agents under diverse attack scenarios. The environment incorporates multiple types of agents and network configurations, allowing systematic benchmarking of defensive strategies.

In our framework, PPO is employed for RL-based training, as it has demonstrated superior stability and effectiveness over alternative methods such as Deep Q-Networks when designing defensive cyber agents [18]. For the SL stage, multiple algorithms were explored, including Random Forests, XGBoost, and neural models. Among these, the Random Forest classifier achieved the most consistent performance following hyperparameter optimization. Consequently, the results presented in this section focus on blue agents trained via the proposed RL–SL framework with PPO for RL training and Random Forest for SL distillation.

Evaluation metrics are aligned with the reward structures defined in CybORG, which are designed to quantify the effectiveness of defensive actions in mitigating adversarial activity. Specifically, the blue agent receives good rewards for critical defensive outcomes, as summarized in Table 4. Conversely, relatively large negative rewards are assigned when the red team achieves key objectives, such as escalating privileges or compromising operational hosts, as shown in Table 5. Together, these reward mechanisms capture both the benefits of effective defense and the costs of adversarial success, providing a balanced and interpretable basis for evaluating defensive agent performance.

The following subsections present detailed evaluation results for both adversary-specific defensive agents and the generic defensive agent described in Section 4. For performance evaluation, the following agents are evaluated:

-: $B L$ and $R E D$ : Defensive agents trained using RL to counter the $b_l i n e$ and $r e d_m e a n d e r$ attacking agents, respectively.
-: $B L_{H}$ and $R E D_{H}$ : Defensive agents trained using the proposed framework specifically for the $b_l i n e$ and $r e d_m e a n d e r$ attacking agents.
-: $B L_{R}$ and $R E D_{R}$ : Defensive agents trained with the proposed framework, where the observation space is augmented with the immediate reward received, tailored for the $b_l i n e$ and $r e d_m e a n d e r$ attacking agents.
-: $B L_{A R}$ and $R E D_{A R}$ : Defensive agents trained using the proposed framework, with the observation space enhanced to include both the reward from the previous action and the action itself, designed for the $b_l i n e$ and $r e d_m e a n d e r$ attacking agents.

The proposed framework seamlessly integrates RL with SL, enabling each trained defensive agent to engage in strategic encounters against its designated attacking counterpart. During these interactions, the agent systematically captures observation–action, observation–reward–action, or observation–previous–action–reward–action tuples to enhance its learning process. Each defensive agent undergoes rigorous training over 2000 episodes, with each episode spanning 100 steps. To ensure high-quality learning, data is selectively retained from only the top

5 %

of the highest-performing episodes, optimizing the SL training process for defensive agents. The training of defensive agents and the evaluation of performance are performed in the network topology shown in Figure 1. Each defensive agent is evaluated against attacking agents for 100 episodes and each episode consists of 100 steps. The mean total episode reward is used as the performance metric along with the

95 %

confidence interval (CI).

Figure 3 presents the mean total reward along with the

95 %

CI for different defensive agents evaluated against

b_l i n e

and

r e d_m e a n d e r

. When

b_l i n e

is attacking, defensive agents specifically trained against

b_l i n e

using the proposed framework achieve a statistically significant improvement in mean reward compared to

B L

(the agent trained solely using RL for

b_l i n e

). Additionally, defensive agents trained using the proposed framework with augmented observations also outperform

B L

and demonstrate better performance than

B L_{H}

. However, while their improvement over

B L_{H}

is evident, it is not statistically significant. Conversely, a defensive agent trained against

r e d_m e a n d e r

performs poorly against

b_l i n e

, as it was not trained for that specific attack strategy. Similarly, when the network is attacked by

r e d_m e a n d e r

, defensive agents trained specifically against

r e d_m e a n d e r

using the proposed framework achieve a statistically significant performance boost compared to the RL-trained agent designed for

r e d_m e a n d e r

. However, a defensive agent trained for

b_l i n e

struggles against

r e d_m e a n d e r

, reinforcing the importance of training against the specific attack type.

The proposed framework comes with a mechanism for training a generalized defensive agent, here referred to as

G A_{P}

(since reward and action augmentation has shown similar performance above, therefore for this study results are reported without augmentation). The performance of

G A_{P}

is evaluated against another generalized agent,

G A_{F P}

, which differs in its approach to SL. While

G A_{P}

utilizes data from only the top

5 %

of highest-performing episodes,

G A_{F P}

leverages the entire dataset collected from interactions between defensive and attacking agents. Additionally, for comparison we consider an alternative strategy where a hierarchical agent can accurately predict the attack type and deploy the most appropriate defensive agent accordingly; here, the agent is referred to as

H A

. Figure 4 displays the mean total reward of generalized defensive agents with a

95 %

CI.

G A_{P}

outperforms all competitors, including

H A

which can accurately deploy a specialized agent based on the attack type. In contrast,

G A_{F P}

exhibits weaker performance due to its reliance on both high and low performance episodes data for SL.

To further validate the robustness of the proposed framework, we repeat the experiments using a reduced step size of 50. The motivation for this is that, by halving the step size, we can assess whether the observed performance trends with a larger step size remain consistent under different temporal resolutions. This evaluation is particularly important for scenarios where decision making may need to occur more rapidly, and thus the defensive agent must operate effectively with less information aggregated per step.

Figure 5 reports the mean total reward with 95% CIs for all variants of the defensive agent when episodes are shortened to 50 steps. When

b_l i n e

is attacking, the agents trained through the proposed framework (

B L_{H}

,

B L_{R}

,

B L_{A R}

) demonstrate comparable performance, as their CIs substantially overlap and statistically significantly exceed

B L

.

B L_{R}

and

B L_{A R}

achieve slightly better mean rewards compared to

B L_{H}

, but these improvements are not statistically significant. In contrast, the RED-based agents (RED,

R E D_{H}

,

R E D_{R}

,

R E D_{A R}

) perform noticeably worse against

b_l i n e

, with their mean rewards being more negative and CIs not overlapping with those of the BL family. This indicates a statistically significant degradation when defenders trained against red-oriented scenarios are deployed against

b_l i n e

.

When the attacker is

r e d_m e a n d e r

, the trend reverses. The BL-based agents now perform poorly, while the RED variants trained with the proposed framework (

R E D_{H}

,

R E D_{R}

, and

R E D_{A R}

) achieve substantially better mean rewards. Their CIs do not overlap with those of the BL family, confirming that these improvements are statistically significant. In addition, these framework-trained RED agents also outperform the plain RED agent, which highlights the effectiveness of the proposed training approach under more complex adversarial behavior.

The comparison between Figure 5 and the 100-step evaluation in Figure 3 reveals consistent findings across both temporal horizons. In both cases, agents trained specifically against a given attack using the proposed framework achieve statistically significant improvements on that same attack, while generalization across attack types remains limited. The modest, non-significant advantage of

B L_{R}

and

B L_{A R}

over

B L_{H}

at 50 steps mirrors the 100-step results, where observation reward augmentation improves

B L_{H}

performance but does not yield statistically significant gains over the best

B L_{H}

. The same is true for the agents trained for

r e d_m e a n d e r

. Overall, the consistency of these results across different step sizes demonstrates the robustness of the proposed framework.

The evaluation is further extended to 50-step episodes in order to examine whether the performance of generalized defensive agents remains consistent under a finer temporal resolution. Figure 6 presents the mean reward with a 95% CI for the same set of agents. The observed trends closely mirror those reported for the 100-step experiments. Specifically,

G A_{P}

continues to outperform both

G A_{F P}

and

H A

, despite

H A

being able to identify the attack type and deploy a specialized agent accordingly.

G A_{F P}

again exhibits weaker performance due to its reliance on both high- and low-performing episodes for SL, which dilutes the quality of training data. The consistent superiority of

G A_{P}

across both 100-step and 50-step settings demonstrates that the proposed framework is robust to changes in temporal granularity and is effective in training a single generalized agent capable of defending against diverse attack strategies.

5.2. Summary of Results

While the previous subsection presented detailed performance results and learning dynamics through figures and associated analysis, it is useful to consolidate these findings into a concise summary. This section provides such a summary in the form of comparative tables that capture the mean total reward (along with 95% CIs) achieved by both specialized and generic defensive agents. Results are reported for experiments with 100-step and 50-step episodes against both the

b_l i n e

and

r e d_m e a n d e r

attackers.

Table 6 and Table 7 complement the graphical results by enabling direct numerical comparison across agents, facilitating a clearer assessment of the impact of the proposed hybrid RL–SL approach and the effect of observation-space augmentation. The tables also provide a compact reference point for benchmarking future research.

Several key patterns emerge from these summary results. First, across both episode lengths, specialized agents trained with the proposed hybrid RL–SL approach consistently outperform their RL-only counterparts, as evidenced by higher (less negative) total reward values and narrower confidence intervals. Observation-space augmentation with previous rewards and actions further enhances robustness, with

B L_{A R}

and

R E D_{A R}

showing the strongest performance overall. Second, while generic agents naturally underperform compared to specialized ones, they achieve competitive results across both adversarial strategies, demonstrating the proposed framework’s capacity for generalization. Finally, the performance gap between 100-step and 50-step experiments highlights the influence of episode length on defensive learning dynamics, with shorter episodes producing less extreme but still consistent trends.

6. Conclusions and Future Work

This study proposed a hybrid learning framework that integrates RL and SL to improve the training of autonomous cyber-defensive agents. The approach achieved notable gains over RL-only methods in terms of overall performance, demonstrating that combining RL and SL can yield more robust and effective defensive strategies. The framework was further extended to develop generalized defensive agents. In particular, the proposed

G A_{P}

agent, trained using only data of high performing episodes, consistently outperformed both a generalized agent trained on the full dataset (

G A_{F P}

) and a hierarchical agent (

H A

) capable of deploying specialized defenders. Importantly, this superiority was maintained across different temporal granularities: both 100-step and 50-step episodes. This consistency highlights the robustness of the framework across varying temporal resolutions.

In addition, experiments with BL- and RED-conditioned agents showed that the proposed framework can successfully train defenders specialized for different adversarial behaviors. Against simpler attack strategies (e.g.,

b_l i n e

), BL-based agents trained through the proposed framework performed best, whereas against more complex strategies (e.g.,

r e d_m e a n d e r

), the framework-trained RED variants (

R E D_{H}

,

R E D_{R}

,

R E D_{A R}

) significantly outperformed both BL-based agents and the baseline RED agent. These results underscore the adaptability of the framework and its capacity to produce agents that specialize effectively within their targeted domains.

Overall, this study demonstrates that combining RL with SL enables the design of scalable, adaptive, and robust cyber-defense agents capable of handling diverse adversarial strategies. Future work will expand this line of research by introducing new network topologies and incorporating additional red agents to better evaluate the performance of the framework under complex and dynamic conditions. Such extensions will allow for a deeper understanding of the generalization limits of the framework and its potential for real-world deployment in evolving cyber-threat environments.

While the proposed hybrid RL–SL framework demonstrates improved performance and generalization in autonomous cyber-defense agents, several limitations should be noted. First, the evaluation is conducted within the CybORG simulation environment, which, although realistic, may not capture all complexities and unpredictable behaviors of real-world enterprise networks. Second, the experiments focus on a limited set of attacker strategies (

b_l i n e

and

r e d_m e a n d e r

); thus, performance against entirely novel or more sophisticated adversaries remains to be validated. Third, the framework currently relies on manually designed observation-space augmentation, which may require adaptation for other cyber scenarios or environments. Finally, computational resources required for training hybrid RL–SL agents, especially in multi-task generalization, can be significant, potentially limiting scalability for very large or high-fidelity networks.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. (please specify the reason for restriction, e.g.,the data are not publicly available due to privacy or ethical restrictions.)

Conflicts of Interest

Author Muhammad Omer Farooq was employed by the company HSR Innovations and Consulting. He declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Farooq, M.O. Robust Defensive Cyber Agent for Multi-Adversary Defense. IEEE Trans. Mach. Learn. Commun. Netw. 2025, 3, 1030–1049. [Google Scholar] [CrossRef]
Ibrahim, M.; Elhafiz, R. Security Analysis of Cyber-Physical Systems Using Reinforcement Learning. Sensors 2023, 23, 1634. [Google Scholar] [CrossRef] [PubMed]
Raio, S.; Corder, K.; Parker, T.W.; Shearer, G.G.; Edwards, J.S.; Thogaripally, M.R.; Park, S.J.; Nelson, F.F. Reinforcement Learning as a Path to Autonomous Intelligent Cyber-Defense Agents in Vehicle Platforms. Appl. Sci. 2023, 13, 11621. [Google Scholar] [CrossRef]
Oh, S.H.; Kim, J.; Nah, J.H.; Park, J. Employing Deep Reinforcement Learning to Cyber-Attack Simulation for Enhancing Cybersecurity. Electronics 2024, 13, 555. [Google Scholar] [CrossRef]
Dutta, A.; Chatterjee, S.; Bhattacharya, A.; Halappanavar, M. Deep Reinforcement Learning for Cyber System Defense under Dynamic Adversarial Uncertainties. arXiv 2023, arXiv:2302.01595. [Google Scholar] [CrossRef]
Nyberg, J.; Johnson, P. Training Automated Defense Strategies Using Graph-Based Cyber Attack Simulations. arXiv 2023, arXiv:2304.11084. [Google Scholar] [CrossRef]
Wiebe, J.; Al Mallah, R.; Li, L. Learning Cyber Defence Tactics from Scratch with Multi-Agent Reinforcement Learning. arXiv 2023, arXiv:2310.05939. [Google Scholar] [CrossRef]
Cage Challenge 2. Available online: https://github.com/cage-challenge/cage-challenge-2 (accessed on 29 July 2025).
Molina-Markham, A.; Miniter, C.; Powell, B.; Ridley, A. Network Environment Design for Autonomous Cyberdefense. arXiv 2021, arXiv:2103.07583. [Google Scholar] [CrossRef]
Li, L.; Rami, J.P.S.E.; Taylor, A.; Rao, J.H.; Kunz, T. Unified Emulation-Simulation Training Environment for Autonomous Cyber Agents. In International Conference on Machine Learning for Networking; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar]
Li, L.; Fayad, R.; Taylor, A. CyGIL: A Cyber Gym for Training Autonomous Agents over Emulated Network Systems. arXiv 2021, arXiv:2109.03331. [Google Scholar] [CrossRef]
Cyber Battle Sim. Available online: https://github.com/microsoft/CyberBattleSim (accessed on 20 July 2025).
Piplai, A.; Anoruo, M.; Fasaye, K.; Joshi, A.; Finin, T.; Ridley, A. Knowledge Guided Two-player Reinforcement Learning for Cyber Attacks and Defenses. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 1342–1349. [Google Scholar] [CrossRef]
Kunz, T.; Fisher, C.; Novara-Gsell, J.L.; Nguyen, C.; Li, L. A Multiagent CyberBattleSim for RL Cyber Operation Agents. In Proceedings of the 2022 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 14–16 December 2022. [Google Scholar]
Foley, M.; Hicks, C.; Highnam, K.; Mavroudis, V. Autonomous Network Defence Using Reinforcement Learning. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security ASIA CCS ’22, New York, NY, USA, 30 May–3 June 2022; pp. 1252–1254. [Google Scholar]
OpenAI Gym. Available online: https://gymnasium.farama.org (accessed on 5 August 2025).
Baillie, C.; Standen, M.; Schwartz, J.; Docking, M.; Bowman, D.; Kim, J. CybORG: An Autonomous Cyber Operations Research Gym. arXiv 2020, arXiv:2002.10667. [Google Scholar] [CrossRef]
Sultana, M.; Taylor, A.; Li, L. Autonomous network cyber offence strategy through deep reinforcement learning. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III, International Society for Optics and Photonics, Online, 12–17 April 2021; SPIE: Bellingham, WA, USA, 2021; Volume 11746, p. 1174622. [Google Scholar]

Figure 1. A networking scenario in CybORG.

Figure 2. Proposed framework for defensive agents training.

Figure 3. Performance comparison of defensive agents trained against specific attacking opponent.

Figure 4. Performance comparison of defensive generic agents.

Figure 5. Performance comparison of defensive agents trained against specific attacking opponent—50 Steps.

Figure 6. Performance comparison of defensive generic agents—50 Steps.

Table 1. Comparison of cyber-defense learning approaches with the proposed framework.

Author / Year	Technique	Performance	Advantages	Disadvantages	Remarks (Relevance to This Work)
Microsoft (CyberBattleSim) [12]	RL (Single Agent)	Moderate reward gains in simple attack scenarios	Simple design, low coordination overhead	Poor adaptability to evolving threats, limited scalability	Highlights need for more adaptive and generalizable defense policies
Kunz et al. (2023) [14]	Multi-agent RL	Improved detection and response coordination	Better coordination, multi-agent decision-making	High coordination complexity, lower scalability	Motivates simplified solutions with multi-behavior support
Foley et al. (2022) [15]	Hierarchical RL	High performance in specialized tasks	Strong adaptability, hierarchical control	Complex architecture, longer training time	Shows benefits of structured learning but lacks simplicity
Dutta et al. (2023) [5]	Deep RL	Moderate cumulative rewards under dynamic threats	Strong exploration capabilities	Sample inefficient, poor generalization	Demonstrates RL potential but highlights limitations addressed by our hybrid approach
Wiebe et al. (2023) [7]	Multi-agent RL	High adaptability across attacker types	Learns from scratch, good generalization	High complexity, coordination overhead	Reinforces the need for scalable single-agent alternatives
Nyberg & Johnson (2023) [6]	Graph-based RL	High performance and transferability	Strong generalization and scalability	Limited sample efficiency, no temporal context	Closest to our approach but lacks hybrid optimization and temporal augmentation
This Work (2025)	Hybrid RL + SL	Higher cumulative rewards and faster convergence across adversaries	Combines exploration and stability, better sample efficiency, improved generalization	Slightly increased training pipeline complexity	Demonstrates clear improvements over RL-only baselines in performance and adaptability

Table 2. Representative actions available to red agents.

Action	Description
Sleep	Idle action with no parameters.
DiscoverRemoteSystems	Performs a ping sweep of a subnet and returns active IP addresses.
DiscoverNetworkServices	Conducts a port scan on a given IP address, returning open ports and associated services.
ExploitRemoteService	Attempts service exploitation to obtain a reverse shell on a target host. Requires an IP address.
EternalBlue	Exploits a Windows vulnerability to gain SYSTEM access.
PrivilegeEscalate	Attempts to escalate from user-level to root (Linux) or SYSTEM (Windows) privileges. Requires an existing shell.
FTPDirectoryTraversal	Exploits improper access controls to traverse restricted directories.
SQLInjection	Injects malicious code to exploit database services.
Impact	Degrades the availability of a service. Requires host name as a parameter.

Table 3. Representative actions available to the blue agent.

Action	Description
Sleep	Idle action with no parameters.
Analyse	Detects malware on a host. Requires host name as a parameter.
Remove	Removes a red agent’s user-level shell.
Restore	Reinstates a host to a clean state from backup. This disrupts services and therefore incurs a large penalty.
Misinform	Deploys a decoy service.
Monitor	Monitors the state of a host.

Table 4. Per-turn blue penalty for red administrative access.

Subnet	Hosts	Blue Penalty
Subnet 1	User Hosts	−0.1
Subnet 2	Enterprise Servers	−1
Subnet 3	Operational Server	−1
Subnet 3	Operational Host	−0.1

Table 5. Penalties for successful red actions.

Agent	Host	Action	Blue Penalty
Red	Operational Server	Impact	−10
Blue	Any	Restore	−1

Table 6. Mean total reward (±95% CI) of specialized defensive agents against

b_l i n e

and

r e d_m e a n d e r

attackers in 100-step and 50-step episodes.

Table 6. Mean total reward (±95% CI) of specialized defensive agents against

b_l i n e

and

r e d_m e a n d e r

attackers in 100-step and 50-step episodes.

Agent	$b_line$		$red_meander$
Agent	100-Step	50-Step	100-Step	50-Step
$B L$	−52 ± 5.5	−24.5 ± 3.9	−85 ± 8.0	−37 ± 6.2
$R E D$	−67 ± 7.5	−34 ± 4.6	−58 ± 5.0	−27 ± 3.5
$B L_{H}$	−42.2 ± 5.0	−19.5 ± 3.0	−82.44 ± 6.5	−40 ± 4.8
$R E D_{H}$	−65 ± 6.3	−32 ± 4.5	−47.44 ± 4.5	−25 ± 3.4
$B L_{R}$	−41 ± 5.2	−18 ± 3.8	−81.2 ± 4.9	−40 ± 3.6
$R E D_{R}$	−64 ± 5.9	−34 ± 4.0	−45.44 ± 4.3	−22.5 ± 3.4
$B L_{A R}$	−39.7 ± 4.8	−17.5 ± 3.3	−80.8 ± 5.2	−41 ± 4.1
$R E D_{A R}$	−62.6 ± 5.6	−33 ± 3.8	−42.84 ± 4.8	−21.8 ± 3.5

Table 7. Mean total reward (±95% CI) of generic defensive agents against

b_l i n e

and

r e d_m e a n d e r

attackers in 100-step and 50-step episodes.

Table 7. Mean total reward (±95% CI) of generic defensive agents against

b_l i n e

and

r e d_m e a n d e r

attackers in 100-step and 50-step episodes.

Agent	$b_line$		$red_meander$
Agent	100-Step	50-Step	100-Step	50-Step
$G A_{P}$	−44 ± 4.0	−23 ± 3.0	−49 ± 4.3	−35 ± 3.3
$G A_{F P}$	−54 ± 5.0	−33 ± 4.0	−58.6 ± 4.8	−43 ± 4.0
$H A$	−52 ± 5.5	−30 ± 4.2	−58 ± 5.0	−41.5 ± 3.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Farooq, M.O. Adaptive Cyber Defense Through Hybrid Learning: From Specialization to Generalization. Future Internet 2025, 17, 464. https://doi.org/10.3390/fi17100464

AMA Style

Farooq MO. Adaptive Cyber Defense Through Hybrid Learning: From Specialization to Generalization. Future Internet. 2025; 17(10):464. https://doi.org/10.3390/fi17100464

Chicago/Turabian Style

Farooq, Muhammad Omer. 2025. "Adaptive Cyber Defense Through Hybrid Learning: From Specialization to Generalization" Future Internet 17, no. 10: 464. https://doi.org/10.3390/fi17100464

APA Style

Farooq, M. O. (2025). Adaptive Cyber Defense Through Hybrid Learning: From Specialization to Generalization. Future Internet, 17(10), 464. https://doi.org/10.3390/fi17100464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Cyber Defense Through Hybrid Learning: From Specialization to Generalization

Abstract

1. Introduction

2. Background

2.1. Reinforcement Learning Environments for Autonomous Cyber Operations

2.2. Reinforcement Learning-Based Generic Defensive Agent Training Methods

2.3. Hybrid and Generalization-Centric Learning Approaches

2.4. Comparison of Existing Approaches

3. CybORG: An Autonomous Cyber Operations Research Gym

4. Hybrid Learning Framework for Autonomous Cyber Defense

4.1. Training a Defensive Agent for a Specific Adversarial Agent

4.2. Training a Generic Defensive Agent

4.3. Discussion

5. Performance Evaluation

5.1. Training and Experiment Setup

5.2. Summary of Results

6. Conclusions and Future Work

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI