Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems

Mogollón-Gutiérrez, Óscar; Escudero García, David; Sancho Núñez, José Carlos; DeCastro-García, Noemí

doi:10.3390/engproc2026123002

Open AccessProceeding Paper

Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems^†

by

Óscar Mogollón-Gutiérrez

^1,*

,

David Escudero García

²,

José Carlos Sancho Núñez

¹ and

Noemí DeCastro-García

³

¹

Centro Universitario de Mérida, Universidad de Extremadura, 06800 Mérida, Spain

²

Research Institute of Applied Science in Cybersecurity (RIASC), Universidad de León, 24071 León, Spain

³

Departamento de Matemáticas, Universidad de León, 24071 León, Spain

^*

Author to whom correspondence should be addressed.

^†

Presented at the First Summer School on Artificial Intelligence in Cybersecurity, Cancun, Mexico, 3–7 November 2025.

Eng. Proc. 2026, 123(1), 2; https://doi.org/10.3390/engproc2026123002

Published: 29 January 2026

(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)

Download

Browse Figure

Versions Notes

Abstract

Network intrusion detection is an activity of increasing importance due to the annual rise in attacks. In the literature, the use of machine learning solutions is one of the most common mechanisms for improving the performance of detection systems. One of the main problems with these approaches is data imbalance: the volume of malicious traffic is much lower than normal traffic, making it difficult to create an effective model. The use of ensemble models, which combine several individual models, can increase robustness against data imbalance. To try to improve the effectiveness of ensemble models against imbalance, in this work, we apply reinforcement learning to combine the individual predictions of the ensemble models, with the objective of improving predictions compared to classic weighted voting algorithms.

Keywords:

reinforcement learning; machine learning; intrusion detection

1. Introduction

The rapid development of Internet technologies, as well as the implementation of different types of applications, has made protecting network services increasingly challenging. Classic security measures, such as firewalls and access control systems, are necessary to protect the network from external attacks. However, these tools have limitations in detecting internal attacks or complicated intrusion cases. Consequently, intrusion detection systems (IDSs) have become an essential set of defenses designed to detect malicious activities that evade simpler security countermeasures like firewalls. IDSs can be host-based or network-based, each with a different approach, from monitoring unauthorized logins to analyzing network packets [1].

In particular, the use of machine learning tools is one of the most used approaches in the literature to improve the effectiveness of IDSs [2]. One of the main problems in applying machine learning, not only to the problem of intrusion detection [3], but to others like malware detection [4], is data imbalance: in general, the proportion of malicious traffic will be significantly lower than normal, making it difficult for the model to learn to distinguish them effectively. Furthermore, intrusion detection requires constant network monitoring [3] and implies that models must be kept updated. Imbalance is particularly negative in this scenario [5], as it can lead to an underweighting of malicious instances.

To limit the impact of imbalance, one of the proposed techniques is the application of ensemble models [6], in which several different models are trained in parallel and whose predictions are combined, generally with a weighted voting algorithm [7] that gives more importance to the predictions of the best models.

In this work, we analyze the effectiveness of applying reinforcement learning (RL) to combine the individual predictions of each model in the ensemble and thus improve the overall prediction. We chose learning because it has proven effective for improving IDS systems [8] and has the capacity to process large volumes of data, not only from a detection standpoint [9], but also to dynamically select defensive measures once an attack has been detected [10]. Our hypothesis is that the application of reinforcement learning can provide better predictions than voting algorithms by analyzing the patterns in the predictions of the individual models more deeply. For example, one of the models might perfectly distinguish between normal traffic and a denial-of-service attack, but be unable to distinguish between the other attacks. Although by itself it would be of little use and a voting algorithm might assign it a low weight in the ensemble, it should be taken into account if the prediction it gives tends to be accurate. Therefore, we believe that the adaptability of RL can make intrusion detection more robust and intelligent to mitigate evolving cyber threats.

The rest of the article is organized as follows. In Section 2, we present the fundamentals of reinforcement learning, including its theoretical formulation and its application in the context of intrusion detection. Section 3 reviews related works that have addressed similar problems, analyzing previous approaches based on ensemble models and reinforcement learning techniques. In Section 4, we detail the experimental methodology used, including the description of the datasets, the models employed, and the evaluation procedures. Subsequently, in Section 5, the results obtained are presented and analyzed, comparing the performance of the proposed approach with traditional strategies. Finally, in Section 6, the conclusions of the work are presented and possible future lines of research are discussed.

2. Fundamentals of Reinforcement Learning

Reinforcement learning [11] is a branch of machine learning whose objective is to obtain agents that receive information from an environment and act to maximize a numerical reward function, which guides the model to achieve the desired objective, whether it be moving a robot or setting a defense policy to mitigate an attack. Reinforcement learning is commonly formalized as a Markov Decision Process (MDP), a tuple <

S, A, T, R

>, where S denotes the set of states, A the set of actions, T a map that determines the transition from a state–action pair to a new state, and R the reward function that assigns a reward to each action performed in each possible state.

Following the example of attack mitigation, the state reflects the information from the environment that the agent receives, such as the number of flows marked as suspicious by the IDS from that IP; the actions define how the RL agent can interact with the environment, for example, by allowing, blocking, or analyzing the connection in greater depth; T reflects the state change that occurs when taking an action in a specific state. In the case of a robot’s movement, the relative position can be part of the state, so moving in any direction will modify the state for the next iteration. However, in other problems such as attack mitigation, the state transition will generally be completely unknown. Finally, the reward assigns a numerical value to the action taken in a specific state, depending on how it contributes to achieving the agent’s objective. The reward guides the agent’s behavior, so it is important to set it appropriately: in the mitigation example, if only whether the action is correct (blocking malicious and allowing the rest) is taken into account, the agent might choose to analyze a large number of connections more deeply to avoid errors, but the analysis entails an additional cost and may not be feasible. Therefore, the reward could include a component that penalizes the cost of the selected measure to avoid this situation.

According to the RL approach, the agent interacts sequentially with the environment: at each moment it receives a state, selects an action that triggers a transition to a new state, receives a reward, and repeats the process until the objective is reached or a prefixed number of iterations is exceeded. The agent’s objective is to determine a policy, which assigns to each state an action that maximizes the accumulated reward. Therefore, the reward must have a certain correspondence with the objective to be achieved. The accumulated reward is modeled using a value function such as the Bellman Q-function (Equation (1)). This function is the expected accumulated reward for taking action a in a state s.

π

denotes the policy,

R_{t}

the immediate reward for taking action a in state s, and

γ

is a discount factor: higher values give more importance to long-term rewards than to immediate ones.

Q (s, a) = E_{π} [R_{t + 1} + γ Q (s_{t + 1}, a_{t + 1}) | s_{t} = s, a_{t} = a]

(1)

By maximizing the optimal Q-function

Q_{*} (s, a)

, the optimal policy

π_{*} (a | s)

can be obtained. However, in many cases, T and R are not known in advance [12], so the Bellman Q-function cannot be used directly, and it is necessary to use some approach, generally iterative, to approximate the value function and obtain the optimal policy. In this work, we have opted to use Deep Q-Networks, which are described below.

Deep Q-Networks

Deep Q-Networks (DQN) [13] apply neural networks as approximators to Q-learning, which approximates the Q-value function iteratively using Equation (2).

α

is a parameter that controls the size of the update. The algorithm iteratively updates the estimated values using the estimation of the optimal action’s value in the next state.

Q (s_{t}, a_{t}) = (1 - α) Q (s_{t}, a_{t}) + α [R_{t + 1} + γ m a x_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]

(2)

To balance exploitation (selecting actions that maximize C and exploration (selecting new actions to determine their effectiveness), Q-learning adopts an

ϵ

-greedy strategy during training. This strategy states that

a_{t}

is a random action with probability

ϵ

and the optimal action (according to current knowledge) with probability

1 - ϵ

. The value of

ϵ

is usually changed dynamically during training, for example, by using a higher value at the start of training and decreasing it as it progresses, to make use of the collected knowledge to optimize Q.

The algorithm can be implemented as a table that records the estimated Q value for each action–state pair, which is updated as the algorithm advances. However, if the state space has high dimensionality, the tabular implementation of Q may be unfeasible. In this case, it is necessary to use some type of approximator for Q. A widely used approach is Deep Reinforcement Learning (DRL) [11], which uses a neural network as an approximator of the value function and is capable of dealing with high-dimensional state spaces. In this work, we use DQNs [13], as they have been used in other works focused on intrusion detection with good results [12,14].

In DQN, the neural network calculates an estimate of Q for each state and selects an action a according to the

ϵ

-greedy strategy. The estimated value of Q for the action is calculated according to Equation (3).

y = R_{t} + γ m a x_{a} Q (s_{t + 1}, a)

(3)

The network parameters are updated by gradient descent on the mean squared loss function defined in Equation (4).

\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - Q (s_{i}, a_{i}))}^{2}

(4)

We use the DQN implementation from the Python package stable-baselines3 [15].

3. Related Work

One of the main problems in detecting network attacks is data imbalance [16]: attacks tend to represent a relatively small proportion of the total traffic, and some of these attacks can be rare. This means it will be difficult for the model to identify these attacks given their scarcity. The application of ensemble models is one of the main approaches to deal with data imbalance [6] due to their general robustness to difficult instances. Works like [7] apply ensembles to the problem of intrusion detection with good results: in this work, they manage to reduce false positives by 50% using Bayesian selection to eliminate the models with the worst performance. In [17], a genetic algorithm is applied along with the ensemble to combine predictions, with an improvement of 0.05 in the F1-score. This leads us to think that RL can also significantly improve the performance of an ensemble by combining its predictions, given its greater learning capacity.

On the other hand, there are some works that apply RL to cybersecurity problems, in particular to deal with network attacks. Most try to exploit RL’s capabilities to adapt to changes in data distribution and learn with a limited amount of labeled data [9]. Several works have obtained good results by applying RL [12,18] to these problems. In [12], several RL models are applied as classifiers to detect intrusions using the NSL-KDD [19] and AWID [20] datasets. In both datasets, the RL model DDQN [21] outperforms a series of commonly used traditional ML models, such as Random Forest or Gradient Boosting, with the additional advantage of lower prediction times. In [22], RL is applied to predict the volume of different types of network traffic (transport and application layer) in IoT networks in order to detect denial of service (DoS) attacks. The RL model is compared with three other traffic prediction algorithms. RL improves the F1-score by 0.1 compared to the second-best performing algorithm, while maintaining a limited false positive rate.

Other works use the adaptability of RL and introduce additional data preprocessing techniques to guide learning and achieve some objective: generally, to improve detection in imbalanced datasets. The work in [23] applies RL to network anomaly detection in a semi-supervised environment, with only a limited amount of labeled instances. The sampling of instances in each episode is modified to maximize the use of labeled anomalous instances, and the chosen reward function penalizes false positives. The algorithm improves the area under the curve (AUC) by up to 10% compared to other anomaly detection algorithms. In [24], RL is used to generate synthetic instances to train a detector, in order to strengthen models against machine learning model evasion attacks. This approach improves the F1-score by up to 30% in the presence of evasion attacks.

In relation to our approach, some works in other fields apply RL as a tool to select the models that should be kept within the ensemble. The ensemble can be integrated directly into the RL classifier as in [25]. In this case, the algorithms use multiple DQN approximations (with shared weights in the innermost layers) to produce a diverse set of estimates that improve generalization and minimize the amount of data needed to achieve acceptable performance. On the other hand, RL can be applied to improve a set of traditional models. The work in [14] uses ensembles to train with imbalanced data and applies RL to select which models to remove from the ensemble to keep only the most informative models. This strategy outperforms other specialized ensembles on different datasets. In this work, we seek not so much to select models but to more effectively combine the predictions of the existing models, assuming that if the model does not provide useful information, the RL algorithm will not take it into account.

4. Experimental Methodology

In this section, we describe the development of the experiments and the analysis of the results.

4.1. Experimental Protocol

Our objective is to determine whether RL algorithms can improve the performance of ensemble models by effectively merging the predictions of each individual model. Therefore, the RL model will be used as a classifier. In this context, the classic MDP parameters can be modeled as follows:

States $S_{i}$ : Instances from the dataset. The RL model will receive the stacked predictions from each model in the ensemble as input. Therefore, for the RL model, each instance will be characterized by the probabilities of belonging to each class given by each model, instead of the original features. This is illustrated in Figure 1. The label of each instance will remain unchanged.
Actions $a_{i}$ : The possible labels to assign to the instance. The RL model will thus function as a classifier.
Reward $R_{i}$ : We follow the strategy of previous works [12] and use a binary reward. Its value will be 1 if the selected label for the instance is correct and 0 otherwise. Although some works like [26] use rewards that take data imbalance into account, in [12] the binary reward achieves the best performance on the same problem we address in this work.

The procedure followed in the experiments for each dataset is as follows:

The base ensemble is trained, and the predictions of each component from the training and test subsets are extracted.
The predictions from each model in the ensemble are stacked and will be used as input to the RL model, for both training and testing.
The hyperparameters of the RL model are optimized, and the best configuration is selected.
The RL model is trained, and its predictions on the test set are extracted.
We calculate different metrics to evaluate the quality of the predictions from both the base model and the RL model.

To fully train and test the selected configurations, we follow the convention [12] for RL: instances are passed to the model in random batches. We continue training until the model has processed an amount of instances equal to twice the number of instances in the training set.

The process described in this section is repeated 5 times with different initializations for each dataset, with the objective of determining if the results can be statistically significant.

4.2. Models and Hyperparameters

As the base ensemble, we use the XGBoost model. The outputs of this model (the prediction probabilities for each class) are used as the state input for the RL agent. For the RL model, we apply hyperparameter optimization following the recommendations of the library’s authors [15]. We use the Optuna library [27] to carry out the optimization. We use the Tree Parzen Estimator sampler [28] and configure the number of iterations for optimization to 1000.

Table 1 shows the hyperparameter search space for the reinforcement learning model.

To adjust the hyperparameters, we take 80% of the training set to fit each configuration and the remaining 20% to evaluate its suitability.

4.3. Datasets

We will use two datasets in this work: NSL-KDD [19] and UNSW-NB15 [29]. NSL-KDD contains features extracted from synthetic traffic used for DARPA IDS evaluation. This version of the dataset includes modifications to address deficiencies of the KDD99 dataset on which it is based. The dataset contains normal traffic and 4 types of attacks: denial of service, probe, remote access, and privilege escalation. UNSW-NB15 combines real normal traffic with synthetic malicious traffic that attempts to improve similarity to real traffic. The dataset contains 9 types of attacks.

Both datasets include features that reflect both traffic statistics (bytes sent/received) and content (number of HTTP POST and GET requests).

For both datasets, we will use the training and test data split provided by the authors. The class distribution for both datasets is shown in Table 2. As can be observed, in both datasets, there is significant imbalance among the classes, so we can effectively evaluate whether the use of RL allows for improving the performance of ensembles in data imbalance situations.

4.4. Analysis

We will use the

F_{1}

-score as the performance metric for the models. It is defined in Equation (5), where

T P

denotes true positives,

F P

false positives, and

F N

false negatives. We use this metric because it is more informative than accuracy when working with imbalanced data, as is the case in this work.

F_{1} = \frac{2 * T P}{2 * T P + F P + F N}

(5)

To apply the metric to multiclass problems, it is necessary to obtain an average of the

F_{1}

-score for the different classes. We include the weighted and macro averages of the

F_{1}

-score to give a more complete view of the performance. The weighted average simply sums the counts of

T P

,

F P

, and

F N

for each class, so it can give a more optimistic value if any of the classes is very infrequent. The macro average computes

F_{1}

independently for each class and calculates the mean, which can lead to lower values if the

F_{1}

-score for any of the minority classes is low.

Furthermore, we will perform a statistical analysis of the weighted and macro

F_{1}

-score values obtained to determine the validity of our approach. The process is detailed below:

1.: First, we apply the Kolmogorov–Smirnov test for normality with the Lilliefors correction. We obtain non-significant results in all cases, so we will apply parametric inferential analyses and use the mean as a measure of central tendency.
2.: We apply the Student’s t-test for paired samples between the $F_{1}$ values obtained by the base ensemble and the predictions adjusted by RL. If there are significant differences, we will carry out a descriptive analysis.

For all analyses, a significance threshold of

α = 0.05

is set.

5. Discussion of Results

In this section, the results obtained with the DQN-based approach on the NSL-KDD and UNSW-NB15 datasets are presented and analyzed. To do this, the results obtained in the five repetitions of the experiment are compared, and the effectiveness of the prediction combination strategy in each case is discussed.

5.1. Results in NSL-KDD

Table 3 shows the results of the statistical analyses of the results, and Table 4 shows the results obtained for the NSL-KDD dataset with different random seed initializations in the training of the DQN model.

The analysis indicates that there are statistically significant differences between the results of the base XGBoost model and the RL model for both macro

F_{1}

and weighted

F_{1}

(Table 3,

ρ < 0.01

for both metrics). Therefore, we can conclude that the observed differences are not due to chance, but reflect a real impact of the DQN model’s performance on improving performance.

It is observed that the best configuration, corresponding to seed 1062237619, achieves a macro

F_{1}

value of

0.6426

and a weighted

F_{1}

of

0.8050

, surpassing the other initializations. These results are superior to those achieved by the base XGBoost model, which obtains, for the same seed, a macro

F_{1}

of

0.6026

and a weighted

F_{1}

of

0.7788

. The average improvement is

0.041

in macro

F_{1}

and

0.035

for weighted

F_{1}

. This reinforces the idea that an adaptive combination of predictions can be more effective than a static aggregation, allowing DQN to assign greater weight to the predictions that truly provide relevant information.

To analyze the results in more detail, Table 5 and Table 6 show the normalized confusion matrices of the best results achieved by the base XGBoost model and the RL model, respectively. It can be seen that the RL model improves the performance of the base XGBoost model, although the improvement is not uniform across all classes. The DoS, Probe, and R2L classes see improvements in their hit rate (values on the diagonal) of

0.157

,

0.044

, and

0.040

, respectively. The Normal class has a slight improvement (

0.01

), while the U2R class (which already had a very high base performance of

0.970

) suffers a slight decrease of

0.003

.

The most substantial improvement occurs in the DoS class. On the other hand, the R2L class, despite improving, still has a low hit rate (0.085), starting from 0.045 in the base model. Therefore, although there is a significant improvement in performance, the model does not guarantee that the improvement will be balanced or particularly benefit the classes with the worst results.

This disparity in improvement may be due to several factors. On the one hand, there is the use of a binary reward function: no distinction is made between correct predictions of different classes, so minority classes will implicitly have less weight due to their lower frequency. A reward function that gives more weight to minority classes could serve to improve performance in minority classes. However, in [12], it is mentioned that this strategy is not useful with individual models. On the other hand, for the global prediction to improve upon that of the individual models, it is necessary for them to have a higher hit rate than selecting a random class. As can be seen in the confusion matrix (Table 5), the hit rate for the R2L class is

0.045

, so it is not possible to guarantee that there will be an improvement in this case.

In any case, for the five seeds tested, DQN produces improvements in all cases, for both

F_{1}

averages. However, there is a certain variability in DQN’s performance, particularly in weighted

F_{1}

, where the model reaches

0.8303

in the best case and

0.8004

in the worst, even when evaluating on the same dataset. This observed variability between different runs is an aspect to consider in the implementation of this approach, as small differences in the learned policy can significantly impact the model’s performance.

5.2. Results in UNSW-NB15

Table 7 shows the results of the statistical analyses, and Table 8 shows the results obtained for the UNSW-NB15 dataset with different random seed initializations in the DQN model training. The analysis indicates that there are statistically significant differences between the results of the base XGBoost model and the RL model for both macro

F_{1}

and weighted

F_{1}

(Table 7). Therefore, we can conclude that the observed differences are not due to chance. However, the nature of these differences is mixed and markedly different from that observed in NSL-KDD. The RL model shows a slight but consistent improvement in weighted

F_{1}

(statistic

t = 5.675

,

ρ = 0.0048

), but a statistically significant worsening in macro

F_{1}

(statistic

t = - 6.775

,

ρ = 0.0025

). As shown in Table 8, the RL model’s macro

F_{1}

is consistently lower than that of the base XGBoost model across all seeds, while the weighted

F_{1}

improves in all cases. The average “improvement” is

- 0.043

in macro

F_{1}

(a worsening) and

+ 0.015

for weighted

F_{1}

. This strongly suggests that the DQN has learned a policy that optimizes performance for the majority classes (increasing weighted

F_{1}

) at the expense of the minority classes (sinking macro

F_{1}

). The configuration with the highest RL macro

F_{1}

(seed 2112) achieves a value of

0.4696

, which is lower than the macro

F_{1}

of its corresponding base model (

0.5072

). To analyze this phenomenon in more detail, Table 9 and Table 10 show the normalized confusion matrices of the results for seed 2112. A very disparate behavior can be observed: the RL model has learned to completely sacrifice several classes. The Analysis (0), Backdoors (1), and Worms (9) classes see their true positive rate (TPR) plummet to

0.0

(from

0.034

,

0.160

, and

0.432

, respectively). Performance also worsens for the Exploits (3), Normal (4), Reconnaissance (7), and Shellcode (8) classes. All this lost precision seems to be transferred to dramatically improve the DoS (2) class, which goes from a TPR of

0.122

to

0.577

, and the Generic (6) class, which goes from

0.775

to

0.857

. The Fuzzers (5) class (with a base TPR of

0.971

) remains almost unchanged. This disparity may be due to the use of a binary reward function. The model does not distinguish between correct predictions of different classes. In this dataset, it seems the DQN agent has determined that it is more “profitable” (in terms of total reward) to be correct on the DoS and Generic classes than on minority classes like Analysis, Backdoors, or Worms, collapsing its policy to a solution that ignores the latter. For the five seeds tested, DQN produces consistent improvements in weighted

F_{1}

, but consistent worsening in macro

F_{1}

. Unlike NSL-KDD, there is very low variability in DQN’s performance for weighted

F_{1}

, where the model reaches

0.8058

in the best case and

0.7986

in the worst. This low variability, combined with the worsening of macro

F_{1}

, suggests that the model consistently converges to a sub-optimal policy (for balanced performance) on this dataset.

6. Conclusions and Future Work

In this work, we have explored the use of reinforcement learning for combining predictions in ensemble models applied to network intrusion detection. Our proposal is based on RL’s ability to dynamically adapt the weighting of each model within the ensemble, thus overcoming the limitations of traditional weighted voting methods. The results obtained on the NSL-KDD and UNSW-NB15 datasets have shown a mixed picture regarding DQN’s ability to manage environments with high class imbalance. For NSL-KDD, the RL-based approach managed to outperform the base XGBoost model, with significant improvements in both the macro

F_{1}

and weighted

F_{1}

metrics. This suggests that, in this scenario, the DQN was able to learn a beneficial combination policy.

However, in UNSW-NB15, the results demonstrate the risks of this approach: while the weighted

F_{1}

(influenced by majority classes) saw a slight improvement, the macro

F_{1}

was significantly harmed. As discussed in Section 5, the RL agent learned to optimize the reward by sacrificing the performance of multiple minority classes (like ’Analysis’ or ’Backdoors’), demonstrating that performance improvement is not guaranteed and can, in fact, exacerbate the imbalance rather than mitigate it.

Therefore, depending on the objective of the ensemble, the application of RL may require additional effort to ensure the improvement aligns with specific parameters, whether it be better detection in minority classes or specific classes.

As future lines of work, the application of other reinforcement learning approaches, such as policy-based methods (e.g., PPO or A2C), will be explored to evaluate their performance in the task of combining predictions. The design of reward functions that explicitly penalize failure in minority classes will also be investigated to guide the agent towards more balanced solutions than those obtained in UNSW-NB15. Secondly, the impact of data preprocessing techniques, such as class balancing through synthetic sample generation or the selection of more representative features, will be investigated to further improve the model’s generalization capability.

Author Contributions

Conceptualization, D.E.G. and N.D.-G.; methodology, D.E.G.; software, D.E.G.; validation, D.E.G., Ó.M.-G. and J.C.S.N.; formal analysis, D.E.G.; investigation, D.E.G.; writing—original draft, D.E.G.; writing—review and editing, Ó.M.-G., J.C.S.N. and N.D.-G.; supervision, N.D.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This publication is part of the project Data science for an artificial intelligence model in cybersecurity (C073/23) with key X67, funded by the European Union NextGeneration-EU, Recovery, Transformation and Resilience Plan, through INCIBE. It has also been carried out within the framework of the funds from the Recovery, Transformation and Resilience Plan, financed by the European Union (Next Generation)—National Institute of Cybersecurity (INCIBE) in Project C108/23 “Detection of Identity Document Forgery using Computer Vision and Artificial Intelligence Techniques”.

Institutional Review Board Statement

Not applicable. The study did not involve humans or animals.

Informed Consent Statement

Not applicable. The study did not involve humans.

Data Availability Statement

The datasets used in this study are publicly available. The NSL-KDD dataset [19] is available at https://www.unb.ca/cic/datasets/nsl.html (accessed on 20 October 2025). The UNSW-NB15 dataset [29] is available at https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/ (accessed on 20 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following list of abbreviations is used in this manuscript:

RL	Reinforcement Learning
IDS	Intrusion Detection System
MDP	Markov Decision Process
DQN	Deep Q-Networks
DRL	Deep Reinforcement Learning
LR	Logistic Regression
DT	Decision Tree
KNN	K-Nearest Neighbors
MLP	Multilayer Perceptron
DoS	Denial of Service
AUC	Area Under the Curve

References

Sethi, K.; Sai Rupesh, E.; Kumar, R.; Bera, P.; Venu Madhav, Y. A context-aware robust intrusion detection system: A reinforcement learning-based approach. Int. J. Inf. Secur. 2019, 19, 657–678. [Google Scholar] [CrossRef]
Thakkar, A.; Lohiya, R. A survey on intrusion detection system: Feature selection, model, performance measures, application perspective, challenges, and future research directions. Artif. Intell. Rev. 2021, 55, 453–563. [Google Scholar] [CrossRef]
Shahraki, A.; Abbasi, M.; Taherkordi, A.; Jurcut, A.D. A comparative study on online machine learning techniques for network traffic streams analysis. Comput. Netw. 2022, 207, 108836. [Google Scholar] [CrossRef]
Gibert, D.; Mateu, C.; Planes, J. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. J. Netw. Comput. Appl. 2020, 153, 102526. [Google Scholar] [CrossRef]
Kegelmeyer, W.P.; Chiang, K.; Ingram, J. Streaming Malware Classification in the Presence of Concept Drift and Class Imbalance. In Proceedings of the 2013 12th International Conference on Machine Learning and Applications, Miami, FL, USA, 4–7 December 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 48–53. [Google Scholar] [CrossRef]
Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
Wu, Z.; Gao, P.; Cui, L.; Chen, J. An Incremental Learning Method Based on Dynamic Ensemble RVM for Intrusion Detection. IEEE Trans. Netw. Serv. Manag. 2022, 19, 671–685. [Google Scholar] [CrossRef]
Alavizadeh, H.; Alavizadeh, H.; Jang-Jaccard, J. Deep Q-Learning Based Reinforcement Learning Approach for Network Intrusion Detection. Computers 2022, 11, 41. [Google Scholar] [CrossRef]
Louati, F.; Ktata, F.B.; Amous, I. Enhancing Intrusion Detection Systems with Reinforcement Learning: A Comprehensive Survey of RL-based Approaches and Techniques. SN Comput. Sci. 2024, 5. [Google Scholar] [CrossRef]
Yungaicela-Naula, N.M.; Vargas-Rosales, C.; Pérez-Díaz, J.A. SDN/NFV-based framework for autonomous defense against slow-rate DDoS attacks by using reinforcement learning. Future Gener. Comput. Syst. 2023, 149, 637–649. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5064–5078. [Google Scholar] [CrossRef]
Lopez-Martin, M.; Carro, B.; Sanchez-Esguevillas, A. Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Syst. Appl. 2020, 141, 112963. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Usman, M.; Chen, H. EMRIL: Ensemble Method based on ReInforcement Learning for binary classification in imbalanced drifting data streams. Neurocomputing 2024, 605, 128259. [Google Scholar] [CrossRef]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Gupta, N.; Jindal, V.; Bedi, P. CSE-IDS: Using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in network-based intrusion detection systems. Comput. Secur. 2022, 112, 102499. [Google Scholar] [CrossRef]
Shyaa, M.A.; Zainol, Z.; Abdullah, R.; Anbar, M.; Alzubaidi, L.; Santamaría, J. Enhanced Intrusion Detection with Data Stream Classification and Concept Drift Guided by the Incremental Learning Genetic Programming Combiner. Sensors 2023, 23, 3736. [Google Scholar] [CrossRef] [PubMed]
Mohamed, S.; Ejbali, R. Deep SARSA-based reinforcement learning approach for anomaly network intrusion detection system. Int. J. Inf. Secur. 2022, 22, 235–247. [Google Scholar] [CrossRef]
Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009; IEEE: Ottawa, ON, Canada, 2009; pp. 1–6. [Google Scholar] [CrossRef]
Kolias, C.; Kambourakis, G.; Stavrou, A.; Gritzalis, S. Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset. IEEE Commun. Surv. Tutor. 2016, 18, 184–208. [Google Scholar] [CrossRef]
van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16), Phoenix, AZ, USA, 12–17 February 2016; AAAI Press: Palo Alto, CA, USA, 2016; pp. 2094–2100. [Google Scholar]
Nie, L.; Sun, W.; Wang, S.; Ning, Z.; Rodrigues, J.J.P.C.; Wu, Y.; Li, S. Intrusion Detection in Green Internet of Things: A Deep Deterministic Policy Gradient-Based Algorithm. IEEE Trans. Green Commun. Netw. 2021, 5, 778–788. [Google Scholar] [CrossRef]
Pang, G.; van den Hengel, A.; Shen, C.; Cao, L. Toward Deep Supervised Anomaly Detection: Reinforcement Learning from Partially Labeled Anomaly Data. In KDD ’21 Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; ACM: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Apruzzese, G.; Andreolini, M.; Marchetti, M.; Venturi, A.; Colajanni, M. Deep Reinforcement Adversarial Learning Against Botnet Evasion Attacks. IEEE Trans. Netw. Serv. Manag. 2020, 17, 1975–1987. [Google Scholar] [CrossRef]
Agarwal, R.; Schuurmans, D.; Norouzi, M. An Optimistic Perspective on Offline Reinforcement Learning. Proc. Mach. Learn. Res. 2020, 119, 104–114. [Google Scholar]
Al-Fawa’reh, M.; Abu-Khalaf, J.; Szewczyk, P.; Kang, J.J. MalBoT-DRL: Malware Botnet Detection Using Deep Reinforcement Learning in IoT Networks. IEEE Internet Things J. 2024, 11, 9610–9629. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In NIPS’11 Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 2546–2554. [Google Scholar]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Input to the RL model.

Table 1. Hyperparameters of the DQN model.

Parameters	Values
learning_rate	$[0.00001, 1]$
batch_size	$[16, 32, 64, 100, 128, 256, 512]$
buffer_size	$[10, 000, 100, 000, 1, 000, 000]$
exploration_final_eps	$[0, 0.2]$
exploration_fraction	$[0, 0.5]$
target_update_interval	$[1, 1000, 5000, 10, 000, 15000, 20, 000]$
learning_starts	$[0, 1000, 5000, 10, 000, 20, 000]$
train_freq	$[1, 4, 8, 16, 128, 256, 1000]$
subsample_steps	$[1, 2, 4, 8]$
net_arch	${[64, 64], [256, 256]}$

Table 2. Distribution of the datasets.

Dataset	Class	Frequency
		Training	Test
NSL-KDD	Normal	67,343 (53.45%)	9711 (43.07%)
	DoS	45,927 (36.45%)	7460 (33.08%)
	Probe	11,656 (9.25%)	2421 (10.73%)
	R2L	995 (0.79%)	2885 (12.21%)
	U2R	52 (0.041%)	67 (0.88%)
	Total	125,973	22,544
UNSW-NB15	Analysis	2000 (1.14%)	677 (0.82%)
	Backdoor	1746 (0.99%)	583 (0.70%)
	DoS	12,264 (6.99%)	4089 (4.96%)
	Exploit	33,393 (19.04%)	11,132 (13.52%)
	Normal	56,000 (31.93%)	37,000 (44.93%)
	Fuzzers	18,184 (10.37%)	6062 (7.36%)
	Generic	40,000 (22.81%)	18,871 (22.92%)
	Recon	10,491 (5.98%)	3496 (4.24%)
	Shellcode	1133 (0.64%)	378 (0.45%)
	Worm	130 (0.07%)	44 (0.05%)
	Total	175,341	82,332

Table 3. Results of the Student’s t-test comparing the metrics of the base XGBoost model and the RL model for NSL-KDD.

Metric	Statistic	$ρ$ -Value
$F_{1}$ -macro	5.0971	$0.0070$
$F_{1}$ -weighted	6.0187	$0.0038$

Table 4. Results for NSL-KDD.

Experiment	Macro $F_{1}$		Weighted $F_{1}$
(Seed)	Base	RL	Base	RL
1062237619	0.6026	0.6426	0.7788	0.8050
2112	0.5797	0.6105	0.7804	0.8027
249651232	0.5651	0.6347	0.7680	0.8209
308118868	0.5754	0.5976	0.7709	0.8004
798844875	0.5865	0.6273	0.7857	0.8303
Mean	0.5818	0.6225	0.7768	0.8119

Table 5. Normalized confusion matrix of the base XGBoost model (Best seed: 1062237619).

		Predicted Class
		Normal	DoS	Probe	R2L	U2R
True Class	Normal	0.7786	0.0286	0.0000	0.0000	0.1928
	DoS	0.0454	0.6848	0.0000	0.0000	0.2697
	Probe	0.0004	0.0025	0.3500	0.0004	0.6467
	R2L	0.0050	0.0050	0.0000	0.0450	0.9450
	U2R	0.0044	0.0213	0.0038	0.0000	0.9704

Table 6. Normalized confusion matrix of the RL model (Best seed: 1062237619).

		Predicted Class
		Normal	DoS	Probe	R2L	U2R
True Class	Normal	0.7884	0.0320	0.0000	0.0000	0.1795
	DoS	0.0582	0.8414	0.0000	0.0000	0.1004
	Probe	0.0000	0.0795	0.3936	0.0007	0.5261
	R2L	0.0000	0.3150	0.0050	0.0850	0.5950
	U2R	0.0084	0.0232	0.0004	0.0001	0.9679

Table 7. Results of the Student’s t-test comparing the metrics of the base XGBoost model and RL model for UNSW-NB15.

Metric	Statistic	$ρ$ -Value
$F_{1}$ -macro	−6.7754	$0.0025$
$F_{1}$ -weighted	5.6754	$0.0048$

Table 8. Results for UNSW-NB15.

Experiment	Macro $F_{1}$		Weighted $F_{1}$
(Seed)	Base	RL	Base	RL
1062237619	0.4913	0.4631	0.7914	0.7986
2112	0.5072	0.4696	0.7842	0.8058
249651232	0.5052	0.4426	0.7886	0.8042
308118868	0.4820	0.4479	0.7863	0.8038
798844875	0.4917	0.4388	0.7912	0.8017
Mean	0.4955	0.4524	0.7883	0.8028

Table 9. Normalized confusion matrix of the base XGBoost model (Seed: 2112).

		Predicted Class
		An	Ba	DoS	Ex	No	Fu	Ge	Re	Sh	Wo
True Class	Analysis	0.0340	0.1403	0.1152	0.5628	0.1433	0.0000	0.0044	0.0000	0.0000	0.0000
	Backdoors	0.0395	0.1595	0.1269	0.4717	0.1818	0.0103	0.0051	0.0000	0.0051	0.0000
	DoS	0.0408	0.1551	0.1215	0.6026	0.0399	0.0081	0.0127	0.0068	0.0122	0.0002
	Exploits	0.0162	0.0567	0.0313	0.8232	0.0299	0.0034	0.0136	0.0159	0.0093	0.0005
	Normal	0.0106	0.0325	0.0277	0.1727	0.5762	0.0008	0.1391	0.0028	0.0376	0.0000
	Fuzzers	0.0000	0.0002	0.0028	0.0208	0.0024	0.9713	0.0009	0.0001	0.0014	0.0001
	Generic	0.0126	0.0000	0.0012	0.0227	0.1848	0.0001	0.7746	0.0004	0.0034	0.0000
	Reconnaissance	0.0051	0.0203	0.0074	0.1430	0.0066	0.0000	0.0034	0.8012	0.0129	0.0000
	Shellcode	0.0000	0.0000	0.0132	0.1190	0.1005	0.0026	0.0238	0.0053	0.7355	0.0000
	Worms	0.0000	0.0000	0.0227	0.4318	0.0682	0.0000	0.0000	0.0000	0.0455	0.4318

Table 10. Normalized confusion matrix of the RL model (Seed: 2112).

		Predicted Class
		An	Ba	DoS	Ex	No	Fu	Ge	Re	Sh	Wo
True Class	Analysis	0.0000	0.0000	0.6573	0.2659	0.0532	0.0000	0.0222	0.0015	0.0000	0.0000
	Backdoors	0.0000	0.0000	0.6398	0.2281	0.0652	0.0103	0.0154	0.0360	0.0051	0.0000
	DoS	0.0000	0.0000	0.5769	0.3656	0.0171	0.0078	0.0127	0.0081	0.0117	0.0000
	Exploits	0.0000	0.0000	0.2182	0.7251	0.0149	0.0038	0.0213	0.0128	0.0040	0.0000
	Normal	0.0000	0.0000	0.1501	0.1290	0.3895	0.0018	0.3098	0.0028	0.0170	0.0000
	Fuzzers	0.0000	0.0000	0.0051	0.0192	0.0015	0.9717	0.0018	0.0002	0.0005	0.0000
	Generic	0.0000	0.0000	0.0029	0.0208	0.1168	0.0001	0.8570	0.0005	0.0019	0.0000
	Reconnaissance	0.0000	0.0000	0.0698	0.1287	0.0031	0.0003	0.0057	0.7898	0.0026	0.0000
	Shellcode	0.0000	0.0000	0.0608	0.2037	0.0847	0.0000	0.0635	0.0053	0.5820	0.0000
	Worms	0.0000	0.0000	0.0909	0.6818	0.0682	0.1136	0.0455	0.0000	0.0000	0.0000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mogollón-Gutiérrez, Ó.; Escudero García, D.; Sancho Núñez, J.C.; DeCastro-García, N. Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems. Eng. Proc. 2026, 123, 2. https://doi.org/10.3390/engproc2026123002

AMA Style

Mogollón-Gutiérrez Ó, Escudero García D, Sancho Núñez JC, DeCastro-García N. Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems. Engineering Proceedings. 2026; 123(1):2. https://doi.org/10.3390/engproc2026123002

Chicago/Turabian Style

Mogollón-Gutiérrez, Óscar, David Escudero García, José Carlos Sancho Núñez, and Noemí DeCastro-García. 2026. "Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems" Engineering Proceedings 123, no. 1: 2. https://doi.org/10.3390/engproc2026123002

APA Style

Mogollón-Gutiérrez, Ó., Escudero García, D., Sancho Núñez, J. C., & DeCastro-García, N. (2026). Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems. Engineering Proceedings, 123(1), 2. https://doi.org/10.3390/engproc2026123002

Article Menu

Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems^†

Abstract

1. Introduction

2. Fundamentals of Reinforcement Learning

Deep Q-Networks

3. Related Work

4. Experimental Methodology

4.1. Experimental Protocol

4.2. Models and Hyperparameters

4.3. Datasets

4.4. Analysis

5. Discussion of Results

5.1. Results in NSL-KDD

5.2. Results in UNSW-NB15

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems †

Abstract

1. Introduction

2. Fundamentals of Reinforcement Learning

Deep Q-Networks

3. Related Work

4. Experimental Methodology

4.1. Experimental Protocol

4.2. Models and Hyperparameters

4.3. Datasets

4.4. Analysis

5. Discussion of Results

5.1. Results in NSL-KDD

5.2. Results in UNSW-NB15

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems^†