Next Article in Journal
Emotion Recognition Using Electrocardiogram Trajectory Variation in Attention Networks
Previous Article in Journal
Dynamic Facial Expression Recognition by Concatenation of Raw, Semi-Raw, and Distance Features
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems †

by
Óscar Mogollón-Gutiérrez
1,*,
David Escudero García
2,
José Carlos Sancho Núñez
1 and
Noemí DeCastro-García
3
1
Centro Universitario de Mérida, Universidad de Extremadura, 06800 Mérida, Spain
2
Research Institute of Applied Science in Cybersecurity (RIASC), Universidad de León, 24071 León, Spain
3
Departamento de Matemáticas, Universidad de León, 24071 León, Spain
*
Author to whom correspondence should be addressed.
Presented at the First Summer School on Artificial Intelligence in Cybersecurity, Cancun, Mexico, 3–7 November 2025.
Eng. Proc. 2026, 123(1), 2; https://doi.org/10.3390/engproc2026123002
Published: 29 January 2026
(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)

Abstract

Network intrusion detection is an activity of increasing importance due to the annual rise in attacks. In the literature, the use of machine learning solutions is one of the most common mechanisms for improving the performance of detection systems. One of the main problems with these approaches is data imbalance: the volume of malicious traffic is much lower than normal traffic, making it difficult to create an effective model. The use of ensemble models, which combine several individual models, can increase robustness against data imbalance. To try to improve the effectiveness of ensemble models against imbalance, in this work, we apply reinforcement learning to combine the individual predictions of the ensemble models, with the objective of improving predictions compared to classic weighted voting algorithms.

1. Introduction

The rapid development of Internet technologies, as well as the implementation of different types of applications, has made protecting network services increasingly challenging. Classic security measures, such as firewalls and access control systems, are necessary to protect the network from external attacks. However, these tools have limitations in detecting internal attacks or complicated intrusion cases. Consequently, intrusion detection systems (IDSs) have become an essential set of defenses designed to detect malicious activities that evade simpler security countermeasures like firewalls. IDSs can be host-based or network-based, each with a different approach, from monitoring unauthorized logins to analyzing network packets [1].
In particular, the use of machine learning tools is one of the most used approaches in the literature to improve the effectiveness of IDSs [2]. One of the main problems in applying machine learning, not only to the problem of intrusion detection [3], but to others like malware detection [4], is data imbalance: in general, the proportion of malicious traffic will be significantly lower than normal, making it difficult for the model to learn to distinguish them effectively. Furthermore, intrusion detection requires constant network monitoring [3] and implies that models must be kept updated. Imbalance is particularly negative in this scenario [5], as it can lead to an underweighting of malicious instances.
To limit the impact of imbalance, one of the proposed techniques is the application of ensemble models [6], in which several different models are trained in parallel and whose predictions are combined, generally with a weighted voting algorithm [7] that gives more importance to the predictions of the best models.
In this work, we analyze the effectiveness of applying reinforcement learning (RL) to combine the individual predictions of each model in the ensemble and thus improve the overall prediction. We chose learning because it has proven effective for improving IDS systems [8] and has the capacity to process large volumes of data, not only from a detection standpoint [9], but also to dynamically select defensive measures once an attack has been detected [10]. Our hypothesis is that the application of reinforcement learning can provide better predictions than voting algorithms by analyzing the patterns in the predictions of the individual models more deeply. For example, one of the models might perfectly distinguish between normal traffic and a denial-of-service attack, but be unable to distinguish between the other attacks. Although by itself it would be of little use and a voting algorithm might assign it a low weight in the ensemble, it should be taken into account if the prediction it gives tends to be accurate. Therefore, we believe that the adaptability of RL can make intrusion detection more robust and intelligent to mitigate evolving cyber threats.
The rest of the article is organized as follows. In Section 2, we present the fundamentals of reinforcement learning, including its theoretical formulation and its application in the context of intrusion detection. Section 3 reviews related works that have addressed similar problems, analyzing previous approaches based on ensemble models and reinforcement learning techniques. In Section 4, we detail the experimental methodology used, including the description of the datasets, the models employed, and the evaluation procedures. Subsequently, in Section 5, the results obtained are presented and analyzed, comparing the performance of the proposed approach with traditional strategies. Finally, in Section 6, the conclusions of the work are presented and possible future lines of research are discussed.

2. Fundamentals of Reinforcement Learning

Reinforcement learning [11] is a branch of machine learning whose objective is to obtain agents that receive information from an environment and act to maximize a numerical reward function, which guides the model to achieve the desired objective, whether it be moving a robot or setting a defense policy to mitigate an attack. Reinforcement learning is commonly formalized as a Markov Decision Process (MDP), a tuple < S , A , T , R >, where S denotes the set of states, A the set of actions, T a map that determines the transition from a state–action pair to a new state, and R the reward function that assigns a reward to each action performed in each possible state.
Following the example of attack mitigation, the state reflects the information from the environment that the agent receives, such as the number of flows marked as suspicious by the IDS from that IP; the actions define how the RL agent can interact with the environment, for example, by allowing, blocking, or analyzing the connection in greater depth; T reflects the state change that occurs when taking an action in a specific state. In the case of a robot’s movement, the relative position can be part of the state, so moving in any direction will modify the state for the next iteration. However, in other problems such as attack mitigation, the state transition will generally be completely unknown. Finally, the reward assigns a numerical value to the action taken in a specific state, depending on how it contributes to achieving the agent’s objective. The reward guides the agent’s behavior, so it is important to set it appropriately: in the mitigation example, if only whether the action is correct (blocking malicious and allowing the rest) is taken into account, the agent might choose to analyze a large number of connections more deeply to avoid errors, but the analysis entails an additional cost and may not be feasible. Therefore, the reward could include a component that penalizes the cost of the selected measure to avoid this situation.
According to the RL approach, the agent interacts sequentially with the environment: at each moment it receives a state, selects an action that triggers a transition to a new state, receives a reward, and repeats the process until the objective is reached or a prefixed number of iterations is exceeded. The agent’s objective is to determine a policy, which assigns to each state an action that maximizes the accumulated reward. Therefore, the reward must have a certain correspondence with the objective to be achieved. The accumulated reward is modeled using a value function such as the Bellman Q-function (Equation (1)). This function is the expected accumulated reward for taking action a in a state s. π denotes the policy, R t the immediate reward for taking action a in state s, and γ is a discount factor: higher values give more importance to long-term rewards than to immediate ones.
Q ( s , a ) = E π [ R t + 1 + γ Q ( s t + 1 , a t + 1 ) | s t = s , a t = a ]
By maximizing the optimal Q-function Q ( s , a ) , the optimal policy π ( a | s ) can be obtained. However, in many cases, T and R are not known in advance [12], so the Bellman Q-function cannot be used directly, and it is necessary to use some approach, generally iterative, to approximate the value function and obtain the optimal policy. In this work, we have opted to use Deep Q-Networks, which are described below.

Deep Q-Networks

Deep Q-Networks (DQN) [13] apply neural networks as approximators to Q-learning, which approximates the Q-value function iteratively using Equation (2). α is a parameter that controls the size of the update. The algorithm iteratively updates the estimated values using the estimation of the optimal action’s value in the next state.
Q ( s t , a t ) = ( 1 α ) Q ( s t , a t ) + α [ R t + 1 + γ m a x a Q ( s t + 1 , a ) Q ( s t , a t ) ]
To balance exploitation (selecting actions that maximize C and exploration (selecting new actions to determine their effectiveness), Q-learning adopts an ϵ -greedy strategy during training. This strategy states that a t is a random action with probability ϵ and the optimal action (according to current knowledge) with probability 1 ϵ . The value of ϵ is usually changed dynamically during training, for example, by using a higher value at the start of training and decreasing it as it progresses, to make use of the collected knowledge to optimize Q.
The algorithm can be implemented as a table that records the estimated Q value for each action–state pair, which is updated as the algorithm advances. However, if the state space has high dimensionality, the tabular implementation of Q may be unfeasible. In this case, it is necessary to use some type of approximator for Q. A widely used approach is Deep Reinforcement Learning (DRL) [11], which uses a neural network as an approximator of the value function and is capable of dealing with high-dimensional state spaces. In this work, we use DQNs [13], as they have been used in other works focused on intrusion detection with good results [12,14].
In DQN, the neural network calculates an estimate of Q for each state and selects an action a according to the ϵ -greedy strategy. The estimated value of Q for the action is calculated according to Equation (3).
y = R t + γ m a x a Q ( s t + 1 , a )
The network parameters are updated by gradient descent on the mean squared loss function defined in Equation (4).
1 n i = 1 n ( y i Q ( s i , a i ) ) 2
We use the DQN implementation from the Python package stable-baselines3 [15].

3. Related Work

One of the main problems in detecting network attacks is data imbalance [16]: attacks tend to represent a relatively small proportion of the total traffic, and some of these attacks can be rare. This means it will be difficult for the model to identify these attacks given their scarcity. The application of ensemble models is one of the main approaches to deal with data imbalance [6] due to their general robustness to difficult instances. Works like [7] apply ensembles to the problem of intrusion detection with good results: in this work, they manage to reduce false positives by 50% using Bayesian selection to eliminate the models with the worst performance. In [17], a genetic algorithm is applied along with the ensemble to combine predictions, with an improvement of 0.05 in the F1-score. This leads us to think that RL can also significantly improve the performance of an ensemble by combining its predictions, given its greater learning capacity.
On the other hand, there are some works that apply RL to cybersecurity problems, in particular to deal with network attacks. Most try to exploit RL’s capabilities to adapt to changes in data distribution and learn with a limited amount of labeled data [9]. Several works have obtained good results by applying RL [12,18] to these problems. In [12], several RL models are applied as classifiers to detect intrusions using the NSL-KDD [19] and AWID [20] datasets. In both datasets, the RL model DDQN [21] outperforms a series of commonly used traditional ML models, such as Random Forest or Gradient Boosting, with the additional advantage of lower prediction times. In [22], RL is applied to predict the volume of different types of network traffic (transport and application layer) in IoT networks in order to detect denial of service (DoS) attacks. The RL model is compared with three other traffic prediction algorithms. RL improves the F1-score by 0.1 compared to the second-best performing algorithm, while maintaining a limited false positive rate.
Other works use the adaptability of RL and introduce additional data preprocessing techniques to guide learning and achieve some objective: generally, to improve detection in imbalanced datasets. The work in [23] applies RL to network anomaly detection in a semi-supervised environment, with only a limited amount of labeled instances. The sampling of instances in each episode is modified to maximize the use of labeled anomalous instances, and the chosen reward function penalizes false positives. The algorithm improves the area under the curve (AUC) by up to 10% compared to other anomaly detection algorithms. In [24], RL is used to generate synthetic instances to train a detector, in order to strengthen models against machine learning model evasion attacks. This approach improves the F1-score by up to 30% in the presence of evasion attacks.
In relation to our approach, some works in other fields apply RL as a tool to select the models that should be kept within the ensemble. The ensemble can be integrated directly into the RL classifier as in [25]. In this case, the algorithms use multiple DQN approximations (with shared weights in the innermost layers) to produce a diverse set of estimates that improve generalization and minimize the amount of data needed to achieve acceptable performance. On the other hand, RL can be applied to improve a set of traditional models. The work in [14] uses ensembles to train with imbalanced data and applies RL to select which models to remove from the ensemble to keep only the most informative models. This strategy outperforms other specialized ensembles on different datasets. In this work, we seek not so much to select models but to more effectively combine the predictions of the existing models, assuming that if the model does not provide useful information, the RL algorithm will not take it into account.

4. Experimental Methodology

In this section, we describe the development of the experiments and the analysis of the results.

4.1. Experimental Protocol

Our objective is to determine whether RL algorithms can improve the performance of ensemble models by effectively merging the predictions of each individual model. Therefore, the RL model will be used as a classifier. In this context, the classic MDP parameters can be modeled as follows:
  • States S i : Instances from the dataset. The RL model will receive the stacked predictions from each model in the ensemble as input. Therefore, for the RL model, each instance will be characterized by the probabilities of belonging to each class given by each model, instead of the original features. This is illustrated in Figure 1. The label of each instance will remain unchanged.
  • Actions a i : The possible labels to assign to the instance. The RL model will thus function as a classifier.
  • Reward R i : We follow the strategy of previous works [12] and use a binary reward. Its value will be 1 if the selected label for the instance is correct and 0 otherwise. Although some works like [26] use rewards that take data imbalance into account, in [12] the binary reward achieves the best performance on the same problem we address in this work.
The procedure followed in the experiments for each dataset is as follows:
  • The base ensemble is trained, and the predictions of each component from the training and test subsets are extracted.
  • The predictions from each model in the ensemble are stacked and will be used as input to the RL model, for both training and testing.
  • The hyperparameters of the RL model are optimized, and the best configuration is selected.
  • The RL model is trained, and its predictions on the test set are extracted.
  • We calculate different metrics to evaluate the quality of the predictions from both the base model and the RL model.
To fully train and test the selected configurations, we follow the convention [12] for RL: instances are passed to the model in random batches. We continue training until the model has processed an amount of instances equal to twice the number of instances in the training set.
The process described in this section is repeated 5 times with different initializations for each dataset, with the objective of determining if the results can be statistically significant.

4.2. Models and Hyperparameters

As the base ensemble, we use the XGBoost model. The outputs of this model (the prediction probabilities for each class) are used as the state input for the RL agent. For the RL model, we apply hyperparameter optimization following the recommendations of the library’s authors [15]. We use the Optuna library [27] to carry out the optimization. We use the Tree Parzen Estimator sampler [28] and configure the number of iterations for optimization to 1000.
Table 1 shows the hyperparameter search space for the reinforcement learning model.
To adjust the hyperparameters, we take 80% of the training set to fit each configuration and the remaining 20% to evaluate its suitability.

4.3. Datasets

We will use two datasets in this work: NSL-KDD [19] and UNSW-NB15 [29]. NSL-KDD contains features extracted from synthetic traffic used for DARPA IDS evaluation. This version of the dataset includes modifications to address deficiencies of the KDD99 dataset on which it is based. The dataset contains normal traffic and 4 types of attacks: denial of service, probe, remote access, and privilege escalation. UNSW-NB15 combines real normal traffic with synthetic malicious traffic that attempts to improve similarity to real traffic. The dataset contains 9 types of attacks.
Both datasets include features that reflect both traffic statistics (bytes sent/received) and content (number of HTTP POST and GET requests).
For both datasets, we will use the training and test data split provided by the authors. The class distribution for both datasets is shown in Table 2. As can be observed, in both datasets, there is significant imbalance among the classes, so we can effectively evaluate whether the use of RL allows for improving the performance of ensembles in data imbalance situations.

4.4. Analysis

We will use the F 1 -score as the performance metric for the models. It is defined in Equation (5), where T P denotes true positives, F P false positives, and F N false negatives. We use this metric because it is more informative than accuracy when working with imbalanced data, as is the case in this work.
F 1 = 2 T P 2 T P + F P + F N
To apply the metric to multiclass problems, it is necessary to obtain an average of the F 1 -score for the different classes. We include the weighted and macro averages of the F 1 -score to give a more complete view of the performance. The weighted average simply sums the counts of T P , F P , and F N for each class, so it can give a more optimistic value if any of the classes is very infrequent. The macro average computes F 1 independently for each class and calculates the mean, which can lead to lower values if the F 1 -score for any of the minority classes is low.
Furthermore, we will perform a statistical analysis of the weighted and macro F 1 -score values obtained to determine the validity of our approach. The process is detailed below:
1.
First, we apply the Kolmogorov–Smirnov test for normality with the Lilliefors correction. We obtain non-significant results in all cases, so we will apply parametric inferential analyses and use the mean as a measure of central tendency.
2.
We apply the Student’s t-test for paired samples between the F 1 values obtained by the base ensemble and the predictions adjusted by RL. If there are significant differences, we will carry out a descriptive analysis.
For all analyses, a significance threshold of α = 0.05 is set.

5. Discussion of Results

In this section, the results obtained with the DQN-based approach on the NSL-KDD and UNSW-NB15 datasets are presented and analyzed. To do this, the results obtained in the five repetitions of the experiment are compared, and the effectiveness of the prediction combination strategy in each case is discussed.

5.1. Results in NSL-KDD

Table 3 shows the results of the statistical analyses of the results, and Table 4 shows the results obtained for the NSL-KDD dataset with different random seed initializations in the training of the DQN model.
The analysis indicates that there are statistically significant differences between the results of the base XGBoost model and the RL model for both macro F 1 and weighted F 1 (Table 3, ρ < 0.01 for both metrics). Therefore, we can conclude that the observed differences are not due to chance, but reflect a real impact of the DQN model’s performance on improving performance.
It is observed that the best configuration, corresponding to seed 1062237619, achieves a macro F 1 value of 0.6426 and a weighted F 1 of 0.8050 , surpassing the other initializations. These results are superior to those achieved by the base XGBoost model, which obtains, for the same seed, a macro F 1 of 0.6026 and a weighted F 1 of 0.7788 . The average improvement is 0.041 in macro F 1 and 0.035 for weighted F 1 . This reinforces the idea that an adaptive combination of predictions can be more effective than a static aggregation, allowing DQN to assign greater weight to the predictions that truly provide relevant information.
To analyze the results in more detail, Table 5 and Table 6 show the normalized confusion matrices of the best results achieved by the base XGBoost model and the RL model, respectively. It can be seen that the RL model improves the performance of the base XGBoost model, although the improvement is not uniform across all classes. The DoS, Probe, and R2L classes see improvements in their hit rate (values on the diagonal) of 0.157 , 0.044 , and 0.040 , respectively. The Normal class has a slight improvement ( 0.01 ), while the U2R class (which already had a very high base performance of 0.970 ) suffers a slight decrease of 0.003 .
The most substantial improvement occurs in the DoS class. On the other hand, the R2L class, despite improving, still has a low hit rate (0.085), starting from 0.045 in the base model. Therefore, although there is a significant improvement in performance, the model does not guarantee that the improvement will be balanced or particularly benefit the classes with the worst results.
This disparity in improvement may be due to several factors. On the one hand, there is the use of a binary reward function: no distinction is made between correct predictions of different classes, so minority classes will implicitly have less weight due to their lower frequency. A reward function that gives more weight to minority classes could serve to improve performance in minority classes. However, in [12], it is mentioned that this strategy is not useful with individual models. On the other hand, for the global prediction to improve upon that of the individual models, it is necessary for them to have a higher hit rate than selecting a random class. As can be seen in the confusion matrix (Table 5), the hit rate for the R2L class is 0.045 , so it is not possible to guarantee that there will be an improvement in this case.
In any case, for the five seeds tested, DQN produces improvements in all cases, for both F 1 averages. However, there is a certain variability in DQN’s performance, particularly in weighted F 1 , where the model reaches 0.8303 in the best case and 0.8004 in the worst, even when evaluating on the same dataset. This observed variability between different runs is an aspect to consider in the implementation of this approach, as small differences in the learned policy can significantly impact the model’s performance.

5.2. Results in UNSW-NB15

Table 7 shows the results of the statistical analyses, and Table 8 shows the results obtained for the UNSW-NB15 dataset with different random seed initializations in the DQN model training. The analysis indicates that there are statistically significant differences between the results of the base XGBoost model and the RL model for both macro F 1 and weighted F 1 (Table 7). Therefore, we can conclude that the observed differences are not due to chance. However, the nature of these differences is mixed and markedly different from that observed in NSL-KDD. The RL model shows a slight but consistent improvement in weighted F 1 (statistic t = 5.675 , ρ = 0.0048 ), but a statistically significant worsening in macro F 1 (statistic t = 6.775 , ρ = 0.0025 ). As shown in Table 8, the RL model’s macro F 1 is consistently lower than that of the base XGBoost model across all seeds, while the weighted F 1 improves in all cases. The average “improvement” is 0.043 in macro F 1 (a worsening) and + 0.015 for weighted F 1 . This strongly suggests that the DQN has learned a policy that optimizes performance for the majority classes (increasing weighted F 1 ) at the expense of the minority classes (sinking macro F 1 ). The configuration with the highest RL macro F 1 (seed 2112) achieves a value of 0.4696 , which is lower than the macro F 1 of its corresponding base model ( 0.5072 ). To analyze this phenomenon in more detail, Table 9 and Table 10 show the normalized confusion matrices of the results for seed 2112. A very disparate behavior can be observed: the RL model has learned to completely sacrifice several classes. The Analysis (0), Backdoors (1), and Worms (9) classes see their true positive rate (TPR) plummet to 0.0 (from 0.034 , 0.160 , and 0.432 , respectively). Performance also worsens for the Exploits (3), Normal (4), Reconnaissance (7), and Shellcode (8) classes. All this lost precision seems to be transferred to dramatically improve the DoS (2) class, which goes from a TPR of 0.122 to 0.577 , and the Generic (6) class, which goes from 0.775 to 0.857 . The Fuzzers (5) class (with a base TPR of 0.971 ) remains almost unchanged. This disparity may be due to the use of a binary reward function. The model does not distinguish between correct predictions of different classes. In this dataset, it seems the DQN agent has determined that it is more “profitable” (in terms of total reward) to be correct on the DoS and Generic classes than on minority classes like Analysis, Backdoors, or Worms, collapsing its policy to a solution that ignores the latter. For the five seeds tested, DQN produces consistent improvements in weighted F 1 , but consistent worsening in macro F 1 . Unlike NSL-KDD, there is very low variability in DQN’s performance for weighted F 1 , where the model reaches 0.8058 in the best case and 0.7986 in the worst. This low variability, combined with the worsening of macro F 1 , suggests that the model consistently converges to a sub-optimal policy (for balanced performance) on this dataset.

6. Conclusions and Future Work

In this work, we have explored the use of reinforcement learning for combining predictions in ensemble models applied to network intrusion detection. Our proposal is based on RL’s ability to dynamically adapt the weighting of each model within the ensemble, thus overcoming the limitations of traditional weighted voting methods. The results obtained on the NSL-KDD and UNSW-NB15 datasets have shown a mixed picture regarding DQN’s ability to manage environments with high class imbalance. For NSL-KDD, the RL-based approach managed to outperform the base XGBoost model, with significant improvements in both the macro F 1 and weighted F 1 metrics. This suggests that, in this scenario, the DQN was able to learn a beneficial combination policy.
However, in UNSW-NB15, the results demonstrate the risks of this approach: while the weighted F 1 (influenced by majority classes) saw a slight improvement, the macro F 1 was significantly harmed. As discussed in Section 5, the RL agent learned to optimize the reward by sacrificing the performance of multiple minority classes (like ’Analysis’ or ’Backdoors’), demonstrating that performance improvement is not guaranteed and can, in fact, exacerbate the imbalance rather than mitigate it.
Therefore, depending on the objective of the ensemble, the application of RL may require additional effort to ensure the improvement aligns with specific parameters, whether it be better detection in minority classes or specific classes.
As future lines of work, the application of other reinforcement learning approaches, such as policy-based methods (e.g., PPO or A2C), will be explored to evaluate their performance in the task of combining predictions. The design of reward functions that explicitly penalize failure in minority classes will also be investigated to guide the agent towards more balanced solutions than those obtained in UNSW-NB15. Secondly, the impact of data preprocessing techniques, such as class balancing through synthetic sample generation or the selection of more representative features, will be investigated to further improve the model’s generalization capability.

Author Contributions

Conceptualization, D.E.G. and N.D.-G.; methodology, D.E.G.; software, D.E.G.; validation, D.E.G., Ó.M.-G. and J.C.S.N.; formal analysis, D.E.G.; investigation, D.E.G.; writing—original draft, D.E.G.; writing—review and editing, Ó.M.-G., J.C.S.N. and N.D.-G.; supervision, N.D.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This publication is part of the project Data science for an artificial intelligence model in cybersecurity (C073/23) with key X67, funded by the European Union NextGeneration-EU, Recovery, Transformation and Resilience Plan, through INCIBE. It has also been carried out within the framework of the funds from the Recovery, Transformation and Resilience Plan, financed by the European Union (Next Generation)—National Institute of Cybersecurity (INCIBE) in Project C108/23 “Detection of Identity Document Forgery using Computer Vision and Artificial Intelligence Techniques”.

Institutional Review Board Statement

Not applicable. The study did not involve humans or animals.

Informed Consent Statement

Not applicable. The study did not involve humans.

Data Availability Statement

The datasets used in this study are publicly available. The NSL-KDD dataset [19] is available at https://www.unb.ca/cic/datasets/nsl.html (accessed on 20 October 2025). The UNSW-NB15 dataset [29] is available at https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/ (accessed on 20 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following list of abbreviations is used in this manuscript:
RLReinforcement Learning
IDSIntrusion Detection System
MDPMarkov Decision Process
DQNDeep Q-Networks
DRLDeep Reinforcement Learning
LRLogistic Regression
DTDecision Tree
KNNK-Nearest Neighbors
MLPMultilayer Perceptron
DoSDenial of Service
AUCArea Under the Curve

References

  1. Sethi, K.; Sai Rupesh, E.; Kumar, R.; Bera, P.; Venu Madhav, Y. A context-aware robust intrusion detection system: A reinforcement learning-based approach. Int. J. Inf. Secur. 2019, 19, 657–678. [Google Scholar] [CrossRef]
  2. Thakkar, A.; Lohiya, R. A survey on intrusion detection system: Feature selection, model, performance measures, application perspective, challenges, and future research directions. Artif. Intell. Rev. 2021, 55, 453–563. [Google Scholar] [CrossRef]
  3. Shahraki, A.; Abbasi, M.; Taherkordi, A.; Jurcut, A.D. A comparative study on online machine learning techniques for network traffic streams analysis. Comput. Netw. 2022, 207, 108836. [Google Scholar] [CrossRef]
  4. Gibert, D.; Mateu, C.; Planes, J. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. J. Netw. Comput. Appl. 2020, 153, 102526. [Google Scholar] [CrossRef]
  5. Kegelmeyer, W.P.; Chiang, K.; Ingram, J. Streaming Malware Classification in the Presence of Concept Drift and Class Imbalance. In Proceedings of the 2013 12th International Conference on Machine Learning and Applications, Miami, FL, USA, 4–7 December 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 48–53. [Google Scholar] [CrossRef]
  6. Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
  7. Wu, Z.; Gao, P.; Cui, L.; Chen, J. An Incremental Learning Method Based on Dynamic Ensemble RVM for Intrusion Detection. IEEE Trans. Netw. Serv. Manag. 2022, 19, 671–685. [Google Scholar] [CrossRef]
  8. Alavizadeh, H.; Alavizadeh, H.; Jang-Jaccard, J. Deep Q-Learning Based Reinforcement Learning Approach for Network Intrusion Detection. Computers 2022, 11, 41. [Google Scholar] [CrossRef]
  9. Louati, F.; Ktata, F.B.; Amous, I. Enhancing Intrusion Detection Systems with Reinforcement Learning: A Comprehensive Survey of RL-based Approaches and Techniques. SN Comput. Sci. 2024, 5. [Google Scholar] [CrossRef]
  10. Yungaicela-Naula, N.M.; Vargas-Rosales, C.; Pérez-Díaz, J.A. SDN/NFV-based framework for autonomous defense against slow-rate DDoS attacks by using reinforcement learning. Future Gener. Comput. Syst. 2023, 149, 637–649. [Google Scholar] [CrossRef]
  11. Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5064–5078. [Google Scholar] [CrossRef]
  12. Lopez-Martin, M.; Carro, B.; Sanchez-Esguevillas, A. Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Syst. Appl. 2020, 141, 112963. [Google Scholar] [CrossRef]
  13. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  14. Usman, M.; Chen, H. EMRIL: Ensemble Method based on ReInforcement Learning for binary classification in imbalanced drifting data streams. Neurocomputing 2024, 605, 128259. [Google Scholar] [CrossRef]
  15. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
  16. Gupta, N.; Jindal, V.; Bedi, P. CSE-IDS: Using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in network-based intrusion detection systems. Comput. Secur. 2022, 112, 102499. [Google Scholar] [CrossRef]
  17. Shyaa, M.A.; Zainol, Z.; Abdullah, R.; Anbar, M.; Alzubaidi, L.; Santamaría, J. Enhanced Intrusion Detection with Data Stream Classification and Concept Drift Guided by the Incremental Learning Genetic Programming Combiner. Sensors 2023, 23, 3736. [Google Scholar] [CrossRef] [PubMed]
  18. Mohamed, S.; Ejbali, R. Deep SARSA-based reinforcement learning approach for anomaly network intrusion detection system. Int. J. Inf. Secur. 2022, 22, 235–247. [Google Scholar] [CrossRef]
  19. Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009; IEEE: Ottawa, ON, Canada, 2009; pp. 1–6. [Google Scholar] [CrossRef]
  20. Kolias, C.; Kambourakis, G.; Stavrou, A.; Gritzalis, S. Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset. IEEE Commun. Surv. Tutor. 2016, 18, 184–208. [Google Scholar] [CrossRef]
  21. van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16), Phoenix, AZ, USA, 12–17 February 2016; AAAI Press: Palo Alto, CA, USA, 2016; pp. 2094–2100. [Google Scholar]
  22. Nie, L.; Sun, W.; Wang, S.; Ning, Z.; Rodrigues, J.J.P.C.; Wu, Y.; Li, S. Intrusion Detection in Green Internet of Things: A Deep Deterministic Policy Gradient-Based Algorithm. IEEE Trans. Green Commun. Netw. 2021, 5, 778–788. [Google Scholar] [CrossRef]
  23. Pang, G.; van den Hengel, A.; Shen, C.; Cao, L. Toward Deep Supervised Anomaly Detection: Reinforcement Learning from Partially Labeled Anomaly Data. In KDD ’21 Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; ACM: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
  24. Apruzzese, G.; Andreolini, M.; Marchetti, M.; Venturi, A.; Colajanni, M. Deep Reinforcement Adversarial Learning Against Botnet Evasion Attacks. IEEE Trans. Netw. Serv. Manag. 2020, 17, 1975–1987. [Google Scholar] [CrossRef]
  25. Agarwal, R.; Schuurmans, D.; Norouzi, M. An Optimistic Perspective on Offline Reinforcement Learning. Proc. Mach. Learn. Res. 2020, 119, 104–114. [Google Scholar]
  26. Al-Fawa’reh, M.; Abu-Khalaf, J.; Szewczyk, P.; Kang, J.J. MalBoT-DRL: Malware Botnet Detection Using Deep Reinforcement Learning in IoT Networks. IEEE Internet Things J. 2024, 11, 9610–9629. [Google Scholar] [CrossRef]
  27. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
  28. Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In NIPS’11 Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 2546–2554. [Google Scholar]
  29. Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–6. [Google Scholar] [CrossRef]
Figure 1. Input to the RL model.
Figure 1. Input to the RL model.
Engproc 123 00002 g001
Table 1. Hyperparameters of the DQN model.
Table 1. Hyperparameters of the DQN model.
ParametersValues
learning_rate [ 0.00001 , 1 ]
batch_size [ 16 , 32 , 64 , 100 , 128 , 256 , 512 ]
buffer_size [ 10 , 000 , 100 , 000 , 1 , 000 , 000 ]
exploration_final_eps [ 0 , 0.2 ]
exploration_fraction [ 0 , 0.5 ]
target_update_interval [ 1 , 1000 , 5000 , 10 , 000 , 15000 , 20 , 000 ]
learning_starts [ 0 , 1000 , 5000 , 10 , 000 , 20 , 000 ]
train_freq [ 1 , 4 , 8 , 16 , 128 , 256 , 1000 ]
subsample_steps [ 1 , 2 , 4 , 8 ]
net_arch { [ 64 , 64 ] , [ 256 , 256 ] }
Table 2. Distribution of the datasets.
Table 2. Distribution of the datasets.
DatasetClassFrequency
TrainingTest
NSL-KDDNormal67,343 (53.45%)9711 (43.07%)
DoS45,927 (36.45%)7460 (33.08%)
Probe11,656 (9.25%)2421 (10.73%)
R2L995 (0.79%)2885 (12.21%)
U2R52 (0.041%)67 (0.88%)
Total125,97322,544
UNSW-NB15Analysis2000 (1.14%)677 (0.82%)
Backdoor1746 (0.99%)583 (0.70%)
DoS12,264 (6.99%)4089 (4.96%)
Exploit33,393 (19.04%)11,132 (13.52%)
Normal56,000 (31.93%)37,000 (44.93%)
Fuzzers18,184 (10.37%)6062 (7.36%)
Generic40,000 (22.81%)18,871 (22.92%)
Recon10,491 (5.98%)3496 (4.24%)
Shellcode1133 (0.64%)378 (0.45%)
Worm130 (0.07%)44 (0.05%)
Total175,34182,332
Table 3. Results of the Student’s t-test comparing the metrics of the base XGBoost model and the RL model for NSL-KDD.
Table 3. Results of the Student’s t-test comparing the metrics of the base XGBoost model and the RL model for NSL-KDD.
MetricStatistic ρ -Value
F 1 -macro5.0971 0.0070
F 1 -weighted6.0187 0.0038
Table 4. Results for NSL-KDD.
Table 4. Results for NSL-KDD.
ExperimentMacro F 1 Weighted F 1
(Seed) Base RL Base RL
10622376190.60260.64260.77880.8050
21120.57970.61050.78040.8027
2496512320.56510.63470.76800.8209
3081188680.57540.59760.77090.8004
7988448750.58650.62730.78570.8303
Mean0.58180.62250.77680.8119
Table 5. Normalized confusion matrix of the base XGBoost model (Best seed: 1062237619).
Table 5. Normalized confusion matrix of the base XGBoost model (Best seed: 1062237619).
Predicted Class
Normal DoS Probe R2L U2R
True ClassNormal0.77860.02860.00000.00000.1928
DoS0.04540.68480.00000.00000.2697
Probe0.00040.00250.35000.00040.6467
R2L0.00500.00500.00000.04500.9450
U2R0.00440.02130.00380.00000.9704
Table 6. Normalized confusion matrix of the RL model (Best seed: 1062237619).
Table 6. Normalized confusion matrix of the RL model (Best seed: 1062237619).
Predicted Class
Normal DoS Probe R2L U2R
True ClassNormal0.78840.03200.00000.00000.1795
DoS0.05820.84140.00000.00000.1004
Probe0.00000.07950.39360.00070.5261
R2L0.00000.31500.00500.08500.5950
U2R0.00840.02320.00040.00010.9679
Table 7. Results of the Student’s t-test comparing the metrics of the base XGBoost model and RL model for UNSW-NB15.
Table 7. Results of the Student’s t-test comparing the metrics of the base XGBoost model and RL model for UNSW-NB15.
MetricStatistic ρ -Value
F 1 -macro−6.7754 0.0025
F 1 -weighted5.6754 0.0048
Table 8. Results for UNSW-NB15.
Table 8. Results for UNSW-NB15.
ExperimentMacro F 1 Weighted F 1
(Seed) Base RL Base RL
10622376190.49130.46310.79140.7986
21120.50720.46960.78420.8058
2496512320.50520.44260.78860.8042
3081188680.48200.44790.78630.8038
7988448750.49170.43880.79120.8017
Mean0.49550.45240.78830.8028
Table 9. Normalized confusion matrix of the base XGBoost model (Seed: 2112).
Table 9. Normalized confusion matrix of the base XGBoost model (Seed: 2112).
Predicted Class
An Ba DoS Ex No Fu Ge Re Sh Wo
True ClassAnalysis0.03400.14030.11520.56280.14330.00000.00440.00000.00000.0000
Backdoors0.03950.15950.12690.47170.18180.01030.00510.00000.00510.0000
DoS0.04080.15510.12150.60260.03990.00810.01270.00680.01220.0002
Exploits0.01620.05670.03130.82320.02990.00340.01360.01590.00930.0005
Normal0.01060.03250.02770.17270.57620.00080.13910.00280.03760.0000
Fuzzers0.00000.00020.00280.02080.00240.97130.00090.00010.00140.0001
Generic0.01260.00000.00120.02270.18480.00010.77460.00040.00340.0000
Reconnaissance0.00510.02030.00740.14300.00660.00000.00340.80120.01290.0000
Shellcode0.00000.00000.01320.11900.10050.00260.02380.00530.73550.0000
Worms0.00000.00000.02270.43180.06820.00000.00000.00000.04550.4318
Table 10. Normalized confusion matrix of the RL model (Seed: 2112).
Table 10. Normalized confusion matrix of the RL model (Seed: 2112).
Predicted Class
An Ba DoS Ex No Fu Ge Re Sh Wo
True ClassAnalysis0.00000.00000.65730.26590.05320.00000.02220.00150.00000.0000
Backdoors0.00000.00000.63980.22810.06520.01030.01540.03600.00510.0000
DoS0.00000.00000.57690.36560.01710.00780.01270.00810.01170.0000
Exploits0.00000.00000.21820.72510.01490.00380.02130.01280.00400.0000
Normal0.00000.00000.15010.12900.38950.00180.30980.00280.01700.0000
Fuzzers0.00000.00000.00510.01920.00150.97170.00180.00020.00050.0000
Generic0.00000.00000.00290.02080.11680.00010.85700.00050.00190.0000
Reconnaissance0.00000.00000.06980.12870.00310.00030.00570.78980.00260.0000
Shellcode0.00000.00000.06080.20370.08470.00000.06350.00530.58200.0000
Worms0.00000.00000.09090.68180.06820.11360.04550.00000.00000.0000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mogollón-Gutiérrez, Ó.; Escudero García, D.; Sancho Núñez, J.C.; DeCastro-García, N. Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems. Eng. Proc. 2026, 123, 2. https://doi.org/10.3390/engproc2026123002

AMA Style

Mogollón-Gutiérrez Ó, Escudero García D, Sancho Núñez JC, DeCastro-García N. Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems. Engineering Proceedings. 2026; 123(1):2. https://doi.org/10.3390/engproc2026123002

Chicago/Turabian Style

Mogollón-Gutiérrez, Óscar, David Escudero García, José Carlos Sancho Núñez, and Noemí DeCastro-García. 2026. "Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems" Engineering Proceedings 123, no. 1: 2. https://doi.org/10.3390/engproc2026123002

APA Style

Mogollón-Gutiérrez, Ó., Escudero García, D., Sancho Núñez, J. C., & DeCastro-García, N. (2026). Reinforcement Learning for the Optimization of Adaptive Intrusion Detection Systems. Engineering Proceedings, 123(1), 2. https://doi.org/10.3390/engproc2026123002

Article Metrics

Back to TopTop