A New Hybrid Method: CDRL-QNN for Stable IoT Intrusion Detection

Küçükkara, Muhammed Yusuf; Atban, Furkan; Bayılmış, Cüneyt

doi:10.3390/math14101608

Open AccessArticle

A New Hybrid Method: CDRL-QNN for Stable IoT Intrusion Detection

by

Muhammed Yusuf Küçükkara

^1,2,†

,

Furkan Atban

^1,2,†

and

Cüneyt Bayılmış

^3,*

¹

Department of Computer Engineering, Faculty of Technology, Sakarya University of Applied Sciences, Sakarya 54050, Türkiye

²

Department of Computer Engineering, Institute of Natural Sciences, Sakarya University, Sakarya 54050, Türkiye

³

Department of Computer Engineering, Faculty of Computer and Information Sciences, Sakarya University, Sakarya 54050, Türkiye

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2026, 14(10), 1608; https://doi.org/10.3390/math14101608

Submission received: 1 April 2026 / Revised: 25 April 2026 / Accepted: 4 May 2026 / Published: 9 May 2026

(This article belongs to the Special Issue Cybersecurity and Data Protection: Modern Methods and New Applications)

Download

Browse Figures

Versions Notes

Abstract

The rapid expansion of the Internet of Things (IoT) has increased the risk of large-scale Distributed Denial-of-Service (DDoS) attacks. In high-availability IoT environments, the operational costs of false positives and false negatives are asymmetric, whereas conventional deep learning models usually optimize static accuracy-based objectives. To address this, we propose CDRL-QNN, a cost-aware and chaos-driven reinforcement learning quantum neural network framework in which a parameterized quantum circuit serves as the action-value function approximator within a Deep Q-Network (DQN) agent. The framework incorporates asymmetric operational penalties through both the reward function and sample-wise weighted Bellman optimization, while a logistic-map-based deterministic perturbation mechanism is used to promote exploration under constrained quantum-circuit training conditions. Evaluated on a computationally constrained balanced subset of the CIC-DDoS2019 dataset, the proposed framework reduced false negatives from 49 to 33 without increasing false positives, improving recall from 0.9673 to 0.9780 and F1-score from 0.9738 to 0.9793 while lowering operational cost. These findings suggest that hybrid quantum representations can be integrated into cost-sensitive reinforcement learning pipelines for IoT intrusion detection under constrained experimental conditions.

Keywords:

IoT security; DDoS detection; cost-sensitive learning; reinforcement learning; quantum neural networks; chaos-driven exploration; operational stability

MSC:

68T05

1. Introduction

The Internet of Things (IoT) has been widely adopted over the past decade across numerous critical domains, ranging from smart cities and industrial control systems to healthcare infrastructures and energy grids. With billions of devices connected globally, IoT has formed a large-scale, distributed, and heterogeneous digital ecosystem. However, this rapid expansion has simultaneously led to an enlarged attack surface, as security architectures have not matured at the same pace as deployment. Device firmware vulnerabilities, weak authentication mechanisms, and inadequate update practices in IoT environments have paved the way for large-scale exploitation operations [1]. In particular, Mirai and its derivative botnets have compromised millions of vulnerable IoT devices and orchestrated Internet-scale Distributed Denial-of-Service (DDoS) attacks, severely threatening the availability of critical services [2,3]. In this context, IoT security should be regarded not merely as a technical attack detection problem, but also as a challenge of operational continuity and system reliability.

The impact of DDoS attacks extends far beyond network saturation. Service interruptions may lead to Service Level Agreement (SLA) violations, reputational damage, and direct financial losses. Large-scale attacks have been reported to cause multi-million-dollar damages and, in cloud-based or shared infrastructures, may even affect non-target tenants, amplifying operational and financial consequences [4,5]. Furthermore, under DDoS conditions, significant degradation is observed in Quality-of-Service (QoS) metrics such as latency, jitter, and packet loss, thereby exponentially increasing operational risk in high-availability systems [6]. At this point, merely detecting the attack is insufficient; high False Positive (FP) rates may also trigger service disruptions, unnecessary mitigation mechanisms, and alarm fatigue, ultimately undermining system reliability [7,8]. Similarly, a high False Negative (FN) rate may allow malicious traffic to remain undetected, enabling attacks to persist within the system, compromise service availability, and expose critical infrastructures to severe operational and financial risks. Therefore, in IoT-based intrusion detection systems, the fundamental issue is not simply improving accuracy but minimizing operational cost and ensuring reliable decision-making.

Deep learning (DL)-based intrusion detection systems have been widely employed in IoT-DDoS detection due to their ability to automatically extract features from high-dimensional network traffic. Nevertheless, conventional supervised DL models generally rely on static training methodologies. Concept drift, distribution shifts, and the emergence of zero-day attacks in network traffic can cause performance degradation over time, while periodic retraining imposes significant computational and scalability burdens [9]. Moreover, the problem of catastrophic forgetting may lead to the loss of previously learned knowledge when adapting to new attack patterns [9]. These limitations indicate that static classification models may be insufficient to provide sustainable solutions in dynamic and adversarial IoT environments.

In this regard, Reinforcement Learning (RL) has attracted attention in cybersecurity due to its ability to provide adaptive and self-learning decision mechanisms. In particular, Deep Q-Network (DQN)-based approaches dynamically update decision policies by penalizing incorrect classifications through reward-driven optimization, thereby demonstrating higher adaptability compared to static models [10,11]. However, most RL-based IDS approaches in the literature still evaluate performance primarily through conventional metrics such as accuracy or F1-score, without explicitly integrating the costs of false alarms and missed attacks into the optimization process. In high-reliability systems, the costs associated with false alarms and missed intrusions are asymmetric, and decision threshold selection directly determines the level of operational risk [12,13]. Recent studies on cost-sensitive learning and decision calibration have shown that incorporating cost structures at the decision stage can be more effective than merely modifying the training loss function [12,14]. This highlights the need for a methodological shift in intrusion detection systems—from accuracy-driven approaches toward cost-aware decision calibration.

On the other hand, Quantum Neural Networks (QNNs) have begun to attract interest in cybersecurity applications due to their high expressibility and potential to represent complex decision boundaries [15,16]. However, the training of variational quantum circuits involves significant optimization challenges, particularly the barren plateau phenomenon characterized by vanishing gradients [17,18]. As circuit expressibility increases, the variance of the loss function decreases, and gradients may vanish exponentially, limiting the trainability of QNN-based models [19]. Therefore, the application of QNNs in operational security contexts depends not only on their representational power but also on achieving optimization stability. In the context of IoT intrusion detection, this optimization issue is not merely a theoretical training difficulty. If the QNN component cannot be trained in a stable manner, the resulting decision policy may become unreliable under asymmetric operational risk, leading to unstable false-positive/false-negative trade-offs. In high-availability IoT environments, such instability is particularly important because missed attacks may disrupt critical services, whereas excessive false alarms may trigger unnecessary mitigation actions and degrade service continuity. For this reason, optimization stability in QNN-based security models is directly relevant to operational reliability, not only to model trainability.

Because optimization instability in QNN-based intrusion detection may directly affect the reliability of security decisions, stabilization strategies become important not only from a learning perspective but also from an operational security perspective. Deterministic yet high-entropy chaotic dynamics, such as those generated by the logistic map, can provide comprehensive exploration behavior compared to purely random search, thereby reducing the likelihood of convergence to local minima [20]. Similarly, chaos-driven RL approaches demonstrate that maintaining exploration through intrinsic chaotic dynamics can improve learning stability and adaptability [21,22]. Nevertheless, the integration of chaos-driven exploration within a cost-sensitive RL framework—particularly combined with QNN-based value functions for IoT-DDoS detection—has not yet been addressed in the literature.

In this study, to fill this gap, we propose a novel hybrid framework named CDRL-QNN (Cost-Aware and Chaos-Driven Reinforcement Learning Quantum Neural Network). In the proposed model, the QNN is positioned not as a conventional classifier but as the value function of a reinforcement learning agent. The costs associated with false alarms and missed attacks are explicitly incorporated through both the reward function and sample-wise weighted Bellman-loss minimization, enabling cost-sensitive learning. Additionally, a logistic-map-based chaotic exploration mechanism is employed to enhance optimization stability in high-dimensional policy spaces. The model is evaluated using the realistic and comprehensive CIC-DDoS2019 benchmark dataset [23].

Rather than focusing solely on marginal improvements in accuracy, this study introduces a new perspective to the IoT intrusion detection literature by emphasizing operational cost minimization and decision calibration. By integrating cost-aware optimization, adaptive reinforcement learning, and chaos-driven exploration within a QNN-based framework, the proposed approach aims to provide a more reliable and stable decision-making mechanism for high-availability IoT environments.

The main contributions of this study are summarized as follows:

We propose CDRL-QNN, an integrated hybrid quantum–classical reinforcement learning framework that combines QNN-based value approximation, cost-aware learning, and chaos-driven exploration for IoT intrusion detection.
We incorporate asymmetric operational penalties through both reward shaping and sample-wise weighted Bellman-loss minimization for policy calibration.
We integrate a deterministic chaos-driven exploration mechanism to enhance convergence stability in QNN-based reinforcement learning.

The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 details the dataset, the reinforcement learning formulation, the cost-aware reward mechanism, and the chaos-driven exploration module. Section 4 describes the experimental setup. Section 5 presents the results and operational analysis. Section 6 discusses the findings and limitations, and Section 7 concludes the paper and outlines future directions.

2. Related Work

This section provides a comprehensive review of the literature concerning IoT intrusion detection, focusing on deep learning architectures, cost-sensitive optimization, quantum-inspired models, and reinforcement learning strategies. The review highlights the progression from static classification to adaptive, chaos-driven optimization frameworks. This comparative perspective helps position the present study more precisely: the aim is not to argue that each individual component is new in isolation but to investigate whether their integration can improve cost-aware decision calibration in IoT intrusion detection.

2.1. Deep Learning and Hybrid Approaches in IoT Security

Deep learning (DL) models have become the standard for detecting DDoS attacks in IoT networks due to their feature extraction capabilities. Recent studies have largely focused on hybrid architectures to improve detection accuracy. Hızal et al. and Almaraz-Rivera et al. demonstrated high-accuracy DDoS detection using DNN, CNN, and LSTM models across various layers of IoT traffic [24,25]. To address the computational constraints of IoT devices, Abualhassan et al. conducted a resource-efficiency analysis of lightweight models like MobileNetV3, while Sajid et al. proposed a hybrid XGBoost/CNN-LSTM architecture to balance detection rates with feature reduction [26,27].

More complex architectures have also been explored. Wahab et al. utilized Transformers combined with CNNs, addressing class imbalance via SMOTE [28]. Similarly, Ain et al. and Yang et al. employed hybrid autoencoders and metaheuristic optimizers (e.g., Pelican Optimization) to prune features and enhance robustness [29,30]; while these methods achieve high classification accuracy, they primarily rely on static training paradigms. Furthermore, studies on distributed security, such as those by Rahmati and Pagano, Sorour et al., and Begum et al., have integrated Federated Learning and blockchain to ensure privacy and scalability [31,32,33]. However, these static or distributed DL approaches often lack the dynamic adaptability required to handle adversarial perturbations without frequent retraining.

2.2. Reinforcement Learning for Adaptive Threat Detection

To overcome the rigidity of static models, Deep Reinforcement Learning (DRL) has been adopted for dynamic decision-making in cybersecurity [34]. Satpathy et al. proposed an actor-critic DRL framework for cloud-based DDoS detection, utilizing reward shaping to handle class imbalance [35]. In the context of botnet detection, Al-Fawa’reh et al. introduced the MalBoT-DRL model, which uses an attention-based reward mechanism to adapt to concept drift [36].

Further advancements include the work of Suresh and Jose, who employed Proximal Policy Optimization (PPO) to dynamically adjust ensemble weights for varying attack contexts [37]. Similarly, Hossain, Kanimozhi, and Ramesh utilized DQN and DRL-based routing optimization to enhance self-learning capabilities in SDN environments [10,38]. Sethi et al. and Jayakrishna and Prasanth also demonstrated that context-aware RL agents could effectively mitigate attacks in vehicular networks (VANETs) [39,40]. Despite these advances, most RL-based IDSs focus on maximizing standard accuracy metrics. They rarely incorporate chaos-driven exploration to prevent the agent from converging to sub-optimal policies (local minima) in high-dimensional parameter spaces.

2.3. Cost-Sensitive Learning and Operational Stability

In high-availability IoT environments, the operational cost of false negatives (FNs) is a critical constraint. Generic DL models often ignore the asymmetry between the cost of a false alarm and a missed attack. Recent approaches have attempted to address this: Prasad et al. utilized cosine similarity-based dataset balancing, while Zahoora et al. developed a Cost-Sensitive Pareto Ensemble for ransomware detection [41,42]. Nissar et al. explicitly integrated cost-sensitive learning into CAN-bus anomaly detection, and Nayak et al. introduced an “anti-phishing score” to weigh FPs against FNs [43,44].

However, these methods typically address cost-sensitivity at the data level (resampling) or the ensemble level. There is a lack of frameworks that embed cost constraints directly into the optimization loop of a neural network, particularly within a reinforcement learning agent, to ensure operational stability under adversarial noise.

2.4. Quantum Neural Networks and Chaos-Driven Optimization

Quantum Neural Networks (QNNs) offer theoretical advantages in expressibility and processing high-dimensional data. Recent applications in intrusion detection include the work of Kim and Madhavi, who used quantum outlier analysis, and Küçükkara et al., who applied QNNs to the CIC-DDoS2019 dataset [15,45]. Kukliansky et al. and Kadi et al. further explored QNNs’ resilience to noise and the impact of quantum encoding strategies [46,47]. Additionally, Nalayini et al. proposed a quantum-inspired evolutionary selection method for SDNs [48].

Despite their potential, QNNs suffer from significant optimization challenges, notably the “barren plateau” problem where gradients vanish, and vulnerability to adversarial attacks [49,50,51]. To address optimization stability, chaos theory has been applied in other domains. Matsuki et al. introduced Chaos-based Reinforcement Learning (CBRL), and Naruse et al. utilized laser chaos for bandit problems [21,52]. Bilban and Inan demonstrated that chaotic Lévy flights could improve PPO stability in autonomous vehicles [53]. However, the integration of chaos-driven exploration specifically to stabilize the training of cost-aware QNNs for IoT intrusion detection remains an unaddressed research gap. This gap is particularly relevant for IoT security because instability in QNN training may weaken the consistency of operationally sensitive intrusion decisions under asymmetric false-positive and false-negative costs.

Table 1 summarizes the key studies discussed. The reviewed literature shows that prior studies have addressed important but mostly separate aspects of IoT intrusion detection, including high-accuracy deep learning architectures, adaptive reinforcement learning, cost-sensitive optimization, and quantum-inspired modeling. However, these dimensions are rarely integrated within a single framework. In particular, the combination of a QNN-based value function, cost-aware reinforcement learning, and chaos-driven exploration has not been sufficiently examined for IoT DDoS detection. Accordingly, the contribution of the present study is not framed as a universal superiority claim over all existing methods but as the design and evaluation of an integrated framework that brings these components together within a unified cost-aware decision-making pipeline.

3. Materials and Methods

3.1. Dataset Information

This study utilizes a processed subset of the CICDDoS2019 dataset [23], a widely adopted benchmark for DDoS detection research. The dataset contains 3000 labeled samples represented by seven retained numerical features—Fwd Packet Length Min, Fwd Packet Length Mean, Packet Length Min, Packet Length Mean, Down/Up Ratio, Avg Packet Size, and Avg Fwd Segment Size—and a binary class label (0: Benign, 1: Attack). This specific subset size was deliberately chosen to navigate the severe computational bottleneck inherent in simulating Parameterized Quantum Circuits (PQCs) on classical hardware. In a reinforcement learning framework, the QNN must be evaluated and updated continuously across thousands of episodes and experience replay steps. Simulating these quantum states for the entire dataset is currently computationally prohibitive without fault-tolerant quantum hardware. To ensure fair evaluation and stable cost-aware optimization, balanced sampling is applied, resulting in an equal class distribution (50% benign, 50% attack).

Prior to training, standard preprocessing steps are performed. Missing values are replaced with zero, while positive and negative infinity values are clipped to

10^{6}

and

- 10^{6}

, respectively, to ensure numerical stability. Features with near-zero variance (standard deviation

< 10^{- 8}

) are removed, although all seven features in the processed dataset are retained. Subsequently, Z-score normalization is applied to all features to standardize the input space and improve convergence behavior.

In the reinforcement learning framework, the state representation extends beyond raw features. In addition to the seven normalized inputs, the state vector includes one auxiliary attack-probability channel and one contextual summary channel, forming a 9-dimensional representation in total. The attack-probability term is not taken from the ground-truth label of the current sample; rather, it is included as an auxiliary estimate available at decision time, while short-term contextual information is incorporated through window-based temporal statistics (window size

N = 5

). This design enables cost-aware decision-making by integrating both instantaneous and contextual information during optimization while avoiding direct label leakage into the state representation.

The attack-probability component is computed by applying a pre-trained base classifier to the current feature vector and extracting the corresponding attack-class probability via predict_proba. Therefore, this quantity is not derived from the ground-truth label of the current sample and is not produced from the agent’s own previous decisions; rather, it serves as an auxiliary decision-time contextual estimate.

3.2. Quantum Neural Networks

Quantum Neural Networks (QNNs) are variational quantum models that combine parameterized quantum circuits (PQCs) with classical optimization procedures [54,55]. A typical QNN consists of three main components: data encoding, a trainable variational circuit, and measurement. Classical input features are first mapped to quantum states through an encoding scheme (e.g., rotation-based encoding), after which a parameterized unitary transformation is applied. The trainable parameters are updated using classical gradient-based or gradient-free optimization algorithms based on measurement outcomes [56].

In this study, the QNN is not positioned as a standalone supervised classifier. Instead, it is employed as a value function approximator within a reinforcement learning framework. Specifically, the QNN estimates action-value (Q) functions, enabling policy updates through Bellman-based optimization rather than direct minimization of classification loss. This formulation allows the model to adapt its decision boundaries according to cost-aware reward signals rather than solely maximizing classification accuracy.

Despite their theoretical expressibility advantages, QNNs face well-known optimization challenges. As circuit depth and expressibility increase, gradients may diminish significantly, leading to unstable or slow convergence behavior [57]. This challenge, often discussed in the context of variational quantum algorithms, limits the practical trainability of deep quantum circuits. Therefore, the application of QNNs in operational security contexts requires careful architectural design and stabilization strategies.

To mitigate optimization instability without making strong theoretical claims, this work integrates the QNN within a cost-aware reinforcement learning loop and augments exploration through a deterministic chaos-driven mechanism. Rather than asserting a resolution to fundamental gradient issues, the proposed approach aims to empirically enhance convergence stability and operational reliability under asymmetric cost constraints.

3.3. Reinforcement Learning and Bellman Optimization

Reinforcement Learning (RL) provides a sequential decision-making framework in which an agent learns to select actions by interacting with an environment and receiving feedback in the form of rewards [58]. In contrast to static supervised learning, RL enables adaptive policy updates based on long-term cumulative reward rather than minimizing instantaneous classification loss [59].

In the present study, reinforcement learning is not introduced as a claim that RL is universally superior to simpler cost-sensitive supervised classifiers. Rather, it is used as a sequential decision-making framework that allows asymmetric operational penalties to be incorporated directly into cumulative policy updates. This formulation is particularly useful here because the agent optimizes long-term cost-aware reward under evolving state representations that include both instantaneous traffic features and short-horizon contextual statistics, instead of operating only on isolated static feature vectors.

In this study, the intrusion detection problem is formulated as a Markov Decision Process (MDP), defined by the tuple

(S, A, P, R, γ)

, where

S

denotes the state space,

A

the action space,

P

the state transition probability,

R

the reward function, and

γ \in [0, 1]

the discount factor. The state representation incorporates normalized traffic features and short-term contextual statistics, while the action space consists of binary decisions (Benign or Attack).

Instead of directly minimizing classification error, the proposed framework optimizes the expected cumulative reward defined as:

Q^{π} (s, a) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t} ∣ s_{0} = s, a_{0} = a],

(1)

where

Q^{π} (s, a)

denotes the action-value function under policy

π

. The optimal policy

π^{*}

is obtained by maximizing the expected return.

The Q-function is updated using the Bellman optimality equation:

Q (s, a) \leftarrow r + γ max_{a^{'}} Q (s^{'}, a^{'}) .

(2)

In practice, the QNN serves as a function approximator for

Q (s, a; θ)

, where

θ

represents the trainable quantum circuit parameters. The temporal-difference target is computed using Bellman optimization, while the detailed training loss is implemented as a sample-wise weighted Smooth L1 objective, as described in Section 3.5.5. A target network with parameters

θ^{-}

is periodically updated to stabilize training.

The Bellman target itself is not modified in the present study. Instead, cost sensitivity is introduced at the optimization stage by reweighting the per-sample temporal-difference loss according to the operational importance of different error types. Accordingly, the proposed formulation should be interpreted as a cost-sensitive training extension built on the standard DQN Bellman target, rather than as a new proof of Bellman optimality.

By updating QNN parameters through Bellman-based optimization rather than cross-entropy loss, the model aligns decision boundaries with long-term cost-aware reward signals. This allows the agent to calibrate its policy according to asymmetric penalties associated with false positives and false negatives.

Rather than focusing solely on immediate classification correctness, this reinforcement learning formulation enables adaptive optimization toward operational stability and cost efficiency. Consequently, the decision-making process becomes sensitive to the long-term impact of misclassification under dynamic IoT traffic conditions.

3.4. Chaos-Driven Exploration Mechanism

Efficient exploration is a fundamental challenge in reinforcement learning, particularly in high-dimensional and non-convex optimization landscapes [60]. Conventional exploration strategies such as

ϵ

-greedy or Gaussian noise injection rely on stochastic perturbations, which may be insufficient to consistently escape suboptimal policies under complex decision boundaries [61].

To enhance exploration behavior, this study incorporates a deterministic chaos-driven mechanism based on the logistic map

x_{t + 1} = r x_{t} (1 - x_{t}),

(3)

where

x_{t} \in (0, 1)

and r is the control parameter. For appropriate values of r, the logistic map generates bounded yet high-entropy dynamics, providing structured variability without purely random fluctuations.

The logistic map was selected in this study as a practical chaos generator because of its simple one-dimensional formulation, bounded output range, and negligible computational overhead during repeated reinforcement learning updates. These properties make it convenient for state-level perturbation without introducing additional architectural complexity. In this work, the logistic map is used as a representative deterministic chaotic process rather than as evidence that it is superior to alternative chaotic maps.

In the present study, the control parameter was fixed at

r = 3.9

as an empirical operating point that produces sufficiently rich chaotic behavior while preserving bounded state-level perturbations during repeated reinforcement learning updates. This value was selected for practical stability and implementation convenience in the current experimental setting, rather than as a theoretically proven optimum. Accordingly, the present study does not claim that

r = 3.9

is universally superior to other logistic-map settings or to alternative chaotic generators. A broader sensitivity analysis over different r values remains an important direction for future work.

Rather than perturbing the quantum circuit parameters directly, the chaotic signal is integrated into the state observation. At each time step, the state vector is perturbed by a chaos-derived component, effectively inducing controlled variability in the agent’s perception of the environment without altering the dimensionality of the state representation. This design preserves the stability of the QNN parameter space while encouraging broader policy exploration.

From an optimization perspective, the chaos-driven perturbation increases diversity in visited state-action pairs, reducing the likelihood of premature convergence to policies associated with suboptimal operational cost. Unlike purely random noise, deterministic chaos maintains long-term dynamical structure, which can contribute to more stable exploration trajectories across training episodes.

Importantly, this mechanism does not claim to theoretically eliminate gradient-related challenges in variational quantum circuits. Instead, it empirically promotes adaptive policy refinement by preventing the reinforcement learning agent from repeatedly exploiting narrow regions of the decision space. In the present experimental setting, the integration of chaos-driven exploration was intended to promote broader policy exploration and more stable adaptive updates under asymmetric penalty conditions.

In summary, the chaos-driven exploration module functions as a deterministic diversification strategy that complements Bellman-based optimization, supporting more stable and operationally robust decision-making in IoT intrusion detection scenarios.

3.5. Proposed Framework: CDRL-QNN

This study proposes a hybrid framework named CDRL-QNN (Cost-Aware and Chaos-Driven Reinforcement Learning Quantum Neural Network), designed to enhance operational stability in IoT intrusion detection systems. As illustrated in Figure 1, the framework integrates three core components: (i) a Quantum Neural Network (QNN) acting as a value function approximator, (ii) a cost-sensitive reinforcement learning agent optimized through Bellman updates, and (iii) a deterministic chaos-driven exploration mechanism.

Unlike conventional intrusion detection architectures, where neural networks operate as static classifiers trained via cross-entropy minimization, the proposed framework reformulates the detection task as a sequential decision-making problem. The QNN does not directly output class probabilities as a final decision; instead, it estimates action-value functions

Q (s, a; θ)

within the reinforcement learning loop. The policy is then derived from these value estimates, allowing decisions to be shaped by long-term cumulative reward rather than instantaneous classification accuracy.

In the proposed CDRL-QNN configuration, the parameters of the hybrid QNN-DQN value function, including the quantum circuit parameters, are updated during training through the reinforcement learning optimization loop. This differs from the frozen-reference baseline used for comparison, in which the quantum layer is kept fixed and only the classical layers are updated.

Operational cost asymmetry is explicitly incorporated into the learning process through both the reward function and sample-wise weighted Bellman optimization. Misclassifications are penalized according to their associated risk levels, enabling the agent to calibrate its decision boundary based on asymmetric false-positive and false-negative costs. This design shifts the optimization objective from purely accuracy-driven training toward cost-aware decision calibration.

To support stable exploration in the non-convex optimization landscape induced by variational quantum circuits, a chaos-driven perturbation module is integrated into the state representation. The deterministic chaotic signal increases diversity in explored state-action trajectories without directly disturbing quantum parameters, thereby preserving circuit stability while preventing premature policy convergence.

The overall training process proceeds iteratively. For each state observation derived from network traffic features and contextual statistics, the QNN estimates action values. The agent selects actions based on an exploration strategy augmented by chaotic perturbation. Rewards reflecting operational cost are assigned, and QNN parameters are updated using Bellman loss minimization. This closed-loop structure enables adaptive policy refinement under dynamic traffic conditions. The resulting framework enables cost-aware policy calibration supported by quantum-enhanced value approximation and structured exploration dynamics.

3.5.1. System Architecture

The system architecture operationalizes the CDRL-QNN framework as a closed-loop hybrid quantum–classical learning pipeline. The architecture is structured into three sequential processing layers: state construction, quantum value approximation, and reinforcement-based policy optimization.

First, incoming network traffic samples are transformed into a normalized numerical representation through preprocessing and feature scaling. In addition to the selected statistical features, short-term contextual information is incorporated via a sliding temporal window, producing a 9-dimensional state vector. This extended representation enables the agent to capture both instantaneous traffic patterns and short-horizon behavioral trends, approximating the Markov property required for reinforcement learning.

The constructed state is passed through a classical embedding block that maps the feature vector to a lower-dimensional latent space suitable for quantum encoding. This latent representation is angle-encoded into a 4-qubit variational quantum circuit composed of StronglyEntanglingLayers. The quantum circuit produces expectation values that serve as intermediate representations, which are subsequently transformed by a classical post-processing layer into Q-values corresponding to the available actions.

Policy execution is governed by a DQN-based reinforcement learning module. Action selection follows an

ε

-greedy strategy augmented with deterministic chaos-based state perturbation. The chaotic signal is injected at the observation level prior to value evaluation, increasing state-space diversity while preserving quantum parameter stability.

Following action execution, a cost-aware reward is computed according to asymmetric false positive and false negative penalties. The transition tuple is stored in an experience replay buffer, and the QNN parameters are updated via sample-wise weighted Bellman loss minimization using mini-batch sampling and a periodically updated target network.

Through this iterative interaction cycle—state encoding, quantum value estimation, cost-sensitive reward assignment, and Bellman-based parameter update—the architecture enables adaptive decision calibration under asymmetric operational risk conditions. The modular separation between state perturbation, quantum approximation, and reinforcement optimization ensures stability while maintaining flexibility for future extensions such as end-to-end quantum parameter training or noise-aware deployment.

3.5.2. Problem Formulation as a Markov Decision Process

The intrusion detection task is formulated as a Markov Decision Process (MDP), defined by the tuple

(S, A, P, R, γ),

(4)

where

S

denotes the state space,

A

the action space,

P

the state transition dynamics,

R

the reward function, and

γ \in [0, 1]

the discount factor.

State space ( $S$ ). Each state

s_{t} \in S

represents a processed network traffic instance augmented with short-term contextual information. The state vector includes the seven normalized numerical features together with two additional contextual components: an auxiliary attack-probability estimate and a short-horizon temporal summary computed over a sliding window. The attack-probability component is used only as an estimated contextual signal available at decision time and is not taken from the ground-truth label of the current instance. This formulation enables the agent to capture both instantaneous traffic characteristics and short-term behavioral trends while reducing the risk of circularity in state construction.

Action space ( $A$ ). The action space is binary:

A = {a_{0}, a_{1}},

(5)

where

a_{0}

corresponds to classifying the instance as Benign and

a_{1}

corresponds to Attack.

Transition dynamics ( $P$ ). The environment transitions to the next state

s_{t + 1}

after an action is taken. In the offline training setting used in this study, the prepared traffic subset is randomly permuted at the beginning of each episode, and transitions are then generated by sequentially traversing this permuted sample stream. Accordingly, the sequential dependency modeled here does not arise from live online interaction but from context-aware state transitions defined over ordered traffic samples within the prepared experimental stream, while short-horizon contextual summaries are updated over a sliding window. Although transition probabilities are not explicitly modeled, the Markov property is approximated by conditioning decisions on the current augmented state representation.

Reward function ( $R$ ). After each action, a scalar reward

r_{t}

is assigned based on the correctness of the decision and its associated operational cost. The reward structure is asymmetric, reflecting the different impacts of false positives and false negatives in high-availability IoT systems. Correct classifications yield positive rewards, while misclassifications incur penalties weighted according to predefined cost parameters.

Discount factor ( $γ$ ). The discount factor controls the balance between immediate and long-term rewards. By incorporating

γ

, the agent optimizes cumulative cost-aware performance rather than isolated classification outcomes.

Under this formulation, the objective is to learn an optimal policy

π^{*}

that maximizes the expected cumulative discounted reward:

π^{*} = arg max_{π} E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t}] .

(6)

The QNN parameterizes the action-value function

Q (s, a; θ)

and is updated using Bellman-based optimization. This formulation shifts the learning objective from minimizing classification loss to optimizing long-term cost-sensitive decision quality under dynamic traffic conditions.

3.5.3. Cost-Aware Reward Design

In high-availability IoT environments, the operational impact of misclassification is asymmetric. A false positive (FP) may trigger unnecessary mitigation mechanisms, degrade service quality, or interrupt legitimate traffic, while a false negative (FN) may allow malicious activity to persist. Therefore, the reward structure is designed to explicitly incorporate cost asymmetry into the learning objective.

Let

y_{t} \in {0, 1}

denote the true label at time t, and

a_{t} \in {0, 1}

the action selected by the agent. The reward function is defined as

r_{t} = \{\begin{matrix} + R_{c}, & if a_{t} = y_{t}, \\ - C_{F P}, & if a_{t} = 1 and y_{t} = 0, \\ - C_{F N}, & if a_{t} = 0 and y_{t} = 1 . \end{matrix}

(7)

where

R_{c} > 0

denotes the reward for correct classification, and

C_{F P}, C_{F N} > 0

represent the penalty coefficients for false positives and false negatives, respectively.

In this study, the penalty parameters are configured such that

C_{F N} > C_{F P}

, reflecting the greater operational risk associated with missed attacks in high-availability IoT systems. By embedding these costs directly into the reward signal, the reinforcement learning agent optimizes long-term expected cost rather than short-term classification accuracy.

The specific false-positive and false-negative cost settings used in this study were selected as application-dependent asymmetric risk coefficients rather than as universally optimal constants. In high-availability IoT settings, a missed attack may lead to prolonged service disruption, whereas a false alarm typically causes unnecessary mitigation or degraded service quality. For this reason, false negatives were assigned a higher relative cost than false positives throughout the present design. These values should therefore be interpreted as scenario-specific operational assumptions used to study cost-aware policy calibration under asymmetric risk.

Unlike traditional cost-sensitive approaches that modify class weights in supervised loss functions, the proposed design integrates cost constraints within the sequential decision-making process. Consequently, the learned policy adapts its decision boundary according to cumulative operational risk, enabling calibrated trade-offs between detection sensitivity and false alarm reduction.

This reward formulation ensures that the QNN-based value function is updated through Bellman optimization under cost-aware feedback, aligning the learning dynamics with practical deployment requirements.

In addition to asymmetric reward shaping, cost sensitivity is also incorporated during DQN optimization through sample-wise loss weighting. Specifically, false positive and false negative transitions are assigned different weighting coefficients in the Bellman loss, allowing the learning process to emphasize high-risk misclassification patterns during parameter updates. This dual design enables cost-awareness both at the environment-feedback level and at the optimization level.

3.5.4. Chaos-Driven State Perturbation

To enhance exploration diversity without directly perturbing quantum circuit parameters, a deterministic chaos-driven mechanism is integrated into the state representation. The chaotic sequence is generated using the logistic map:

z_{t + 1} = r z_{t} (1 - z_{t}),

(8)

where

z_{t} \in (0, 1)

and r is the control parameter governing system dynamics. For suitable values of r, the logistic map produces bounded yet high-entropy behavior, enabling structured variability across iterations.

At each time step t, the original state vector

s_{t}

is perturbed by a chaos-derived component without changing its dimensionality. Let

s_{t}^{(b a s e)}

denote the original state representation constructed from normalized traffic features and contextual statistics. The perturbed state is defined as

{\tilde{s}}_{t} = s_{t}^{(b a s e)} + α \cdot z_{t},

(9)

where

α

is a scaling coefficient controlling the influence of the chaotic signal.

This perturbation is applied at the observation level rather than at the parameter level. In other words, the quantum circuit parameters

θ

remain unaffected by direct noise injection. Instead, variability is introduced in the input state, encouraging the reinforcement learning agent to explore a broader range of state-action trajectories.

From a learning perspective, this mechanism increases diversity in visited regions of the state space while maintaining deterministic structure in the exploration process. Unlike purely stochastic noise, the chaotic signal evolves according to deterministic nonlinear dynamics, which can reduce repetitive exploitation of narrow policy regions during training.

The perturbation magnitude

α

is selected to preserve numerical stability after normalization, ensuring that the augmented state remains within a bounded range compatible with QNN encoding. This design supports controlled exploration without destabilizing the Bellman optimization process.

By incorporating chaos at the state level, the framework enhances adaptive policy refinement under asymmetric cost constraints while maintaining modular separation between value approximation and exploration control.

3.5.5. Training Procedure

The CDRL-QNN framework is trained using a Deep Q-Network (DQN)-style reinforcement learning procedure, where the QNN serves as the action-value function approximator. Training proceeds in an episodic fashion over the dataset.

At each time step t, the agent observes the perturbed state

{\tilde{s}}_{t}

, selects an action

a_{t}

according to an exploration policy, receives a cost-aware reward

r_{t}

, and transitions to the next state

{\tilde{s}}_{t + 1}

. The tuple

({\tilde{s}}_{t}, a_{t}, r_{t}, {\tilde{s}}_{t + 1})

is stored in a replay buffer.

Experience Replay. To reduce temporal correlation between consecutive updates and improve learning stability, mini-batches are sampled uniformly from the replay buffer during optimization. This mechanism enables more efficient reuse of past experiences and mitigates instability caused by sequential sample dependence. To preserve cost sensitivity during optimization, each replayed transition is assigned a sample-dependent weight before aggregation of the Bellman loss, allowing the optimizer to prioritize high-risk error types during gradient updates.

Target Network. A separate target network with parameters

θ^{-}

is maintained to compute the temporal difference target:

y_{t} = r_{t} + γ max_{a^{'}} Q ({\tilde{s}}_{t + 1}, a^{'}; θ^{-}) .

(10)

The target network parameters are periodically updated from the online QNN parameters

θ

, providing additional stabilization during Bellman updates.

Optimization Step. The QNN parameters are updated by minimizing a cost-sensitive, sample-weighted temporal-difference loss derived from the standard Bellman target:

L (θ) = \frac{1}{B} \sum_{i = 1}^{B} λ_{i} ℓ (y_{i} - Q ({\tilde{s}}_{i}, a_{i}; θ)),

(11)

where

ℓ (\cdot)

denotes the per-sample Smooth L1 (Huber) loss, and

λ_{i}

is a cost-sensitive weighting term defined according to the misclassification type:

λ_{i} = \{\begin{matrix} w_{F P}, & if the transition corresponds to a false positive, \\ w_{F N}, & if the transition corresponds to a false negative, \\ w_{N}, & otherwise . \end{matrix}

(12)

In this study, the optimization process uses per-sample Smooth L1 loss together with cost-sensitive weighting, where

w_{F P} = 2.0

,

w_{F N} = 5.0

, and

w_{N} = 1.0

. Here, the weighting coefficients were chosen to reflect an asymmetric operational-risk assumption in which false negatives are more harmful than false positives in the considered IoT security setting. Accordingly,

w_{F N}

was set higher than

w_{F P}

so that false-negative transitions exert a stronger influence during parameter updates. These coefficients were used as practical cost-modeling parameters for the present study and should not be interpreted as theoretically optimal or universally transferable values. This weighted objective does not alter the form of the Bellman target itself. Rather, it changes how temporal-difference errors contribute to parameter updates by assigning larger optimization weights to transitions associated with operationally more costly errors. Therefore, the proposed formulation is presented as a cost-sensitive extension of the standard DQN loss, not as a new mathematical proof of Bellman optimization.

Exploration Strategy. An

ϵ

-greedy policy is employed during training. The exploration rate

ϵ

is gradually decayed across episodes to shift from exploration toward exploitation as learning progresses. The chaos-driven perturbation mechanism operates concurrently with this policy to enhance diversity in explored trajectories.

Training continues until convergence criteria are met or a predefined number of episodes is completed. The resulting QNN parameters define a cost-aware policy that balances detection sensitivity and false alarm reduction under asymmetric penalty conditions.

4. Experimental Setup

4.1. Data Preprocessing

The label column is automatically detected; if unavailable, the last column is treated as the target variable. In the CIC-DDoS2019 dataset, “BENIGN” samples are mapped to 0 and all attack types to 1.

Only numerical features were retained, while non-informative or categorical columns were removed. Missing and infinite values were replaced with bounded numerical constants to ensure training stability. Features with near-zero variance were discarded, and Z-score normalization was applied to standardize the feature space.

To mitigate class imbalance, balanced sampling was performed when a sample limit was specified, ensuring equal representation of benign and attack instances. When dimensionality reduction was enabled, SelectKBest with ANOVA F-statistics was used to retain the most informative features. In the final experimental setting, seven numerical features were retained: Fwd Packet Length Min, Fwd Packet Length Mean, Packet Length Min, Packet Length Mean, Down/Up Ratio, Avg Packet Size, and Avg Fwd Segment Size. For the balanced 3000-sample setting, the subset was constructed by drawing 1500 benign and 1500 attack samples from the preprocessed candidate pool so that both classes were equally represented in the final experimental set. This balanced subset was used as a controlled evaluation setting motivated by the computational cost of repeated hybrid quantum–classical training.

This reduction should not be interpreted as an arbitrary simplification but as a computationally constrained design choice for the present study. Because the proposed framework repeatedly evaluates a variational quantum circuit within a reinforcement learning loop, increasing the input dimensionality directly increases simulation cost and reduces training tractability on classical hardware. Therefore, a compact subset of informative numerical features was adopted to preserve discriminative information while keeping the hybrid quantum–classical training process computationally feasible.

This preprocessing pipeline produces a normalized, numerically stable, and balanced dataset suitable for cost-aware reinforcement learning and QNN-based optimization.

4.2. Quantum Circuit Configuration

The proposed CDRL-QNN framework employs a hybrid quantum–classical neural architecture implemented using PennyLane v0.35.0 and integrated with PyTorch v2.1.0+cpu. The quantum component functions as the value approximation module within the reinforcement learning agent.

The architecture consists of three main components. First, a classical pre-processing block maps the 9-dimensional state vector to a 4-dimensional latent representation through two fully connected layers (

9 \to 64 \to 4

) with ReLU and Tanh activations. This transformation ensures that the inputs are scaled within the range

[- 1, 1]

, making them suitable for quantum angle encoding.

The quantum layer is constructed with four qubits and two variational layers. Classical inputs are embedded into the quantum circuit using angle embedding with Y-rotations, where each feature controls the rotation angle of a corresponding qubit. Following data encoding, two StronglyEntanglingLayers are applied. Each layer consists of parameterized single-qubit rotations and entangling CNOT operations arranged in a ring topology. In total, the quantum circuit contains 24 trainable parameters (2 layers × 4 qubits × 3 rotation parameters per qubit).

Measurement is performed by computing the expectation value of the Pauli-Z operator for each qubit, producing a 4-dimensional output vector within the range

[- 1, 1]

. These expectation values are then passed to a classical post-processing block (

4 \to 64 \to 2

), which outputs Q-values corresponding to the two possible actions: Benign and Attack.

The circuit is executed on the default.qubit state-vector simulator with backpropagation-based gradient computation. Gradient clipping is applied during training to improve numerical stability. The selected configuration (4 qubits, 2 layers) provides a balance between expressivity and optimization stability, helping preserve optimization stability in a compact circuit setting while maintaining computational feasibility on classical simulators.

4.3. Hyperparameter Settings

The CDRL-QNN framework was trained using a balanced subset of 3000 samples (50% benign, 50% attack) with the seven retained numerical features described in Section 4.1 and a temporal window size of 5 for state construction. The probability-based state representation (9-dimensional) was employed in all experiments.

The quantum component was configured with 4 qubits and 2 variational layers, resulting in 24 trainable quantum parameters. The classical hidden dimension was set to 64. This configuration provides a balance between representational capacity and optimization stability while remaining computationally feasible on a state-vector simulator.

For the DQN training procedure, the batch size was set to 64 and the replay buffer capacity to 30,000. The target network was updated every 300 steps to stabilize Q-value estimation. The learning rate was fixed at 0.001 using the Adam optimizer, with a discount factor

γ = 0.99

. Optimization was performed every 4 steps. Training was conducted for 800 episodes, and gradient clipping (maximum norm = 10.0) was applied to prevent instability during hybrid quantum–classical optimization.

Exploration followed an epsilon-greedy strategy with

ϵ

decaying exponentially from 1.0 to 0.001 (decay rate = 750,000 steps). To enhance robustness and exploration diversity, chaotic perturbations based on the logistic map (

r = 3.9

) were applied in 20% of training steps with a small perturbation factor (

ϵ_{chaos} = 0.05

). In this study,

r = 3.9

was used as a practical empirical setting to induce sufficiently rich chaotic behavior within a bounded range, rather than as a universally optimal value.

The reward structure assigns a lower penalty to false positives (

- 60

) and a higher penalty to false negatives (

- 100

), reflecting the greater operational risk of missed attacks. In addition, sample-wise cost-sensitive weighting was applied during Bellman optimization with

w_{F P} = 2.0

,

w_{F N} = 5.0

, and

w_{N} = 1.0

, so that false-negative transitions exert a stronger effect on parameter updates than false-positive transitions. In the present study, these cost settings were introduced as application-dependent asymmetric risk parameters rather than as universally optimal constants.

4.4. Evaluation Metrics

The performance of the proposed CDRL-QNN framework is evaluated using both conventional classification metrics and cost-sensitive operational metrics.

At the end of each episode, a confusion matrix is computed, consisting of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). Based on these values, the following standard classification metrics are calculated:

\begin{matrix} Accuracy & = \frac{T P + T N}{T P + T N + F P + F N}, \end{matrix}

(13)

\begin{matrix} Precision & = \frac{T P}{T P + F P}, \end{matrix}

(14)

\begin{matrix} Recall & = \frac{T P}{T P + F N}, \end{matrix}

(15)

\begin{matrix} F1-Score & = \frac{2 \cdot (Precision \cdot Recall)}{Precision + Recall} . \end{matrix}

(16)

In addition to conventional metrics, cost-sensitive evaluation is central to this study. Given the asymmetric operational risk in IoT environments, false positives and false negatives are weighted differently. The total operational cost is computed as

Total Cost = w_{F P} \cdot F P + w_{F N} \cdot F N,

(17)

where

w_{F P}

and

w_{F N}

represent the predefined cost weights for false positives and false negatives, respectively. In this study,

w_{F P} = 2.0

and

w_{F N} = 5.0

, assigning a higher operational cost to false negatives than to false positives.

These operational cost weights are used for evaluation and reporting, whereas reward penalties and sample-wise Bellman-loss weights are used during training to guide policy learning and parameter updates.

Furthermore, reinforcement learning–specific metrics are monitored during training. These include episode total reward (derived from the cost-aware reward function), average training loss, exploration rate (

ϵ

), and learning rate evolution. The reward and cost trajectories are analyzed to assess convergence behavior and optimization stability.

This combined evaluation strategy enables assessment not only of detection capability but also of operational reliability and cost-aware decision calibration.

5. Results and Operational Analysis

The proposed CDRL-QNN framework was evaluated not only against the original frozen-quantum reference configuration but also against additional classical baselines in order to provide a broader empirical context. In particular, the revised experiments include a classical DQN-based model and a Random Forest baseline alongside the hybrid RL-QNN variants. These comparisons were conducted on the same balanced subset setting to examine how the proposed framework behaves relative to both reinforcement learning and conventional supervised learning alternatives under controlled conditions.

Because the revised baseline set is broader than in the original version, the results are interpreted more cautiously. The purpose of the expanded comparison is not to claim universal superiority of the proposed model over all classical methods but to assess whether the integrated cost-aware hybrid RL-QNN design remains competitive within the evaluated setting.

To reduce dependence on a single training realization, the revised comparison also reports repeated-run statistics in the form of mean and standard deviation for the main performance and operational-cost metrics wherever repeated runs were available.

The final confusion matrices of the frozen-quantum reference model and the proposed CDRL-QNN framework are presented in Figure 2 and Figure 3, respectively.

5.1. Detection Performance Comparison

As shown in Figure 2 and Figure 3, the original frozen-reference and proposed configurations exhibit different false-negative behavior under the controlled balanced-subset setting. To broaden the empirical context beyond this pairwise comparison, an extended baseline analysis was also conducted with additional classical methods.

Table 2 summarizes the extended baseline comparison on the balanced 3000-sample setting.

As shown in Table 2, the proposed CDRL-QNN framework achieves competitive performance on the balanced 3000-sample setting, with an accuracy of 0.9750 and an F1-score of 0.9751. These results place the proposed model close to the Random Forest baseline in terms of classification performance while also preserving the methodological objective of integrating cost-aware reinforcement learning, chaos-driven exploration, and hybrid quantum value approximation within a unified decision framework. Nevertheless, the Random Forest baseline still achieves the lowest operational cost in this controlled comparison.

Importantly, the CDRL-QNN row in Table 2 corresponds to an additional alternative-seed validation run, intended to assess robustness against random initialization. It complements—but does not replace—the main controlled-comparison results reported throughout the manuscript (Recall = 0.9780, F1-score = 0.9793, Operational Cost = 223 in Section 5.2, Section 5.3 and Section 5.4, the Abstract, and the Conclusion), which were obtained under the original seed used in the frozen-reference comparison. The small numerical difference between the two seeds (F1: 0.9793 vs. 0.9751; Cost: 223 vs. 240) indicates that the proposed framework behaves consistently across random initializations within the evaluated balanced-subset setting, partially addressing the reviewer’s concern regarding single-run sensitivity in reinforcement learning. Therefore, the revised results should be interpreted as evidence of competitive methodological feasibility and seed-level stability rather than as a universal superiority claim over classical models.

To further examine whether the observed behavior is restricted to the original 3000-sample setting, an additional larger balanced-subset experiment was conducted using 10,000 samples (5000 benign and 5000 attack instances). In this available run, the proposed CDRL-QNN model showed stable convergence across training episodes and reached a final accuracy and F1-score of 0.9971 with an operational cost of 94.0 at Episode 500. This result provides supplementary evidence that the proposed framework remains feasible beyond the smallest controlled subset; however, it should still be interpreted cautiously, since the setting remains balanced and computationally constrained rather than reflecting full validation under realistic large-scale and imbalanced deployment conditions.

5.2. Operational Cost Reduction Analysis

A detailed confusion matrix analysis reveals that the baseline model produces 49 false negatives and 29 false positives. The proposed framework reduces false negatives to 33 while maintaining the same number of false positives (29). This comparison corresponds to the controlled reference evaluation between the frozen-reference QNN-DQN baseline and the proposed CDRL-QNN configuration.

Given the asymmetric cost weights used for operational evaluation, the total operational cost is computed as:

Total Cost = w_{F P} \cdot F P + w_{F N} \cdot F N .

(18)

Under the configured penalty weights, the final total cost of the proposed model is 223.00, with the lowest observed episode cost reaching 156.00 during training.

Because the proposed and baseline configurations differ in whether the quantum layer is trainable, these improvements should be interpreted cautiously. In the present study, they provide evidence that the integrated hybrid design can achieve improved policy calibration under the evaluated setting, rather than proving that the gains arise solely from the cost-aware component in isolation.

5.3. Reward Convergence and Stability Analysis

The cumulative episode reward increased from 10,141.00 in the baseline configuration to 11,391.00 in the proposed framework in the illustrated reference run. Since both models use identical exploration schedules and chaos-driven perturbation mechanisms, this difference is consistent with improved cost-sensitive policy adaptation in that run. However, such trajectory-level evidence should be interpreted together with the repeated-run summary statistics reported elsewhere in the revised results, rather than as a standalone proof of universal training stability.

Accordingly, the reward trajectory is used here as an illustrative example of learning behavior under the evaluated setting, while broader conclusions are based more cautiously on the repeated-run metric summaries currently available.

5.4. Trade-Off Analysis Between False Positives and False Negatives

In intrusion detection systems, the trade-off between false positives (FP) and false negatives (FN) directly impacts operational reliability. Excessive false positives may trigger unnecessary mitigation mechanisms, while false negatives allow malicious traffic to persist.

The proposed framework improves recall by reducing false negatives (49 → 33) without increasing false positives (29 → 29). This indicates that the decision boundary is recalibrated rather than shifted toward over-detection.

The ability to improve detection sensitivity while preserving false alarm stability supports the core objective of the CDRL-QNN framework: operational cost-aware decision calibration rather than purely accuracy-driven optimization.

6. Discussion

The experimental results indicate that the integrated CDRL-QNN design can improve operational calibration under asymmetric risk within the evaluated setting. In the frozen-reference baseline, the quantum layer is kept fixed, and only the classical layers are optimized. In contrast, the proposed configuration updates the hybrid QNN-DQN value function, including the quantum circuit parameters, while incorporating cost-aware reward shaping and sample-wise weighted Bellman optimization. Therefore, the observed gains should be interpreted as the outcome of the integrated trainable hybrid design rather than as an isolated effect of cost-aware optimization alone.

The reduction in false negatives (49 → 33) without an increase in false positives indicates improved decision calibration under asymmetric risk conditions in the evaluated setting. However, because the frozen-reference baseline and the proposed model do not share identical trainability conditions in the quantum layer, the present results should not be interpreted as a clean isolation of the cost-aware component alone. Rather, they suggest that the integrated hybrid design combining trainable quantum value approximation, cost-aware reinforcement learning, and chaos-driven exploration can yield operationally meaningful behavior under constrained experimental conditions.

The chaos-driven exploration mechanism is employed in both configurations, ensuring that exploration dynamics remain consistent across the compared settings. Therefore, the observed differences should not be attributed to exploration variability alone. Instead, they should be interpreted more cautiously as evidence of integrated framework behavior under differing trainability conditions.

From a quantum modeling perspective, the results suggest that a compact variational circuit can be integrated as a trainable value approximator within reinforcement learning pipelines. However, the present study does not claim quantum advantage in a computational complexity sense. Rather, it demonstrates that hybrid quantum representations can be integrated into cost-sensitive reinforcement learning frameworks to achieve stable and practically meaningful improvements.

In the expanded baseline comparison, the proposed method remained competitive with the classical DQN baseline, whereas the Random Forest baseline achieved stronger predictive performance on the controlled balanced subset. Therefore, the present findings should be interpreted as evidence of methodological integration and cost-aware hybrid design feasibility, rather than as proof that the proposed framework universally outperforms simpler classical alternatives. At the same time, the revised evaluation includes repeated-run summary statistics for the classical baselines (Classical DQN and Random Forest) and an additional alternative-seed validation run for the proposed CDRL-QNN model in Table 2, which together provide a more reliable basis for interpretation than a single training trajectory alone. Nevertheless, broader repeated-run analysis of the proposed model across all ablation and realism settings remains incomplete and should be regarded as an important direction for future work.

Several limitations should be acknowledged. The main limitation of the present study is practical rather than conceptual, arising from the computational burden of repeatedly simulating variational quantum circuits on classical hardware during continuous reinforcement learning updates. For this reason, the core experiments were conducted on a balanced subset of 3000 samples in order to provide a controlled and computationally feasible evaluation setting. Accordingly, the current results should be interpreted as evidence of methodological feasibility and cost-aware decision calibration under constrained experimental conditions, rather than as deployment-scale validation on the full dataset. In addition, quantum noise effects were not modeled because the experiments were performed on an ideal simulator. As reported in Section 5.1, the additional 10,000-sample balanced-subset experiment indicates that the observed behavior is not restricted to the smallest controlled setting alone, but it still reflects a balanced and computationally constrained evaluation rather than full validation under realistic large-scale and imbalanced deployment conditions. In addition, although the revised manuscript clarifies the distinction between the trainable proposed model and the frozen-reference baseline, a fully matched trainable ablation without the cost-aware component was not completed as a final replicated experiment in the present revision cycle. Therefore, the current findings should be interpreted as evidence of integrated framework behavior rather than as a perfectly isolated attribution of gains to cost-aware learning alone.

In addition, although the revised experiments broaden the baseline set beyond the original frozen-quantum reference model, they do not yet include a modern LSTM-based sequence model. Therefore, the present comparison should be interpreted as substantially expanded, but not exhaustive.

Another limitation is that the chaos mechanism was implemented using only the logistic map. Alternative chaotic generators were not compared in the current study. Therefore, the present findings should be interpreted as evidence for the feasibility of chaos-driven exploration within the proposed framework, rather than as proof that the logistic map is the optimal chaotic choice for IoT intrusion detection. In addition, the logistic-map control parameter was fixed at

r = 3.9

in the present experiments. Thus, the current findings should not be interpreted as evidence that this value is optimal, since a broader sensitivity analysis over alternative r settings was beyond the scope of the present study.

In addition, the present study does not validate direct deployment at the scale of millions of IoT devices. The experiments were conducted in an offline, flow-based setting and should not be interpreted as requiring one quantum-learning process per device. In a large-scale operational scenario, the proposed framework would more realistically function at the level of processed traffic flows, gateway observations, or edge-level aggregation rather than as an independent model instantiated for each device. Therefore, a million-device operation should be interpreted as a scalability consideration and future engineering direction, not as a capability validated by the current experiments.

Finally, the asymmetric cost coefficients used in both reward design and sample-wise Bellman weighting were fixed in the present experiments. These values were introduced as application-dependent risk parameters for the considered IoT security setting, not as universally optimal constants. Therefore, the current findings should not be interpreted as evidence that the selected cost values are optimal, since a broader sensitivity analysis over alternative cost settings was beyond the scope of the present study. Future work may explore end-to-end quantum parameter optimization, noise-aware simulations, broader sensitivity analyses, and additional ablation studies to isolate the individual contributions of cost-aware learning and chaos-driven exploration.

7. Conclusions

This study introduced CDRL-QNN, a cost-aware hybrid quantum–classical reinforcement learning framework for IoT intrusion detection. The proposed approach integrates asymmetric operational penalties through both reward shaping and sample-wise weighted Bellman optimization while employing a compact variational quantum circuit as a nonlinear function approximator within a DQN architecture.

Experimental results on a balanced subset of the CIC-DDoS2019 dataset demonstrated that embedding cost-aware learning into the reinforcement objective improves decision calibration under asymmetric risk conditions. Compared to a baseline configuration in which the quantum layer remains frozen and only classical layers are optimized, the proposed framework reduced false negatives from 49 to 33 without increasing false positives. This led to measurable improvements in recall (0.9673 → 0.9780), F1-score (0.9738 → 0.9793), and cumulative episode reward, while lowering total operational cost.

Because the proposed and reference configurations do not isolate all architectural and trainability factors in exactly the same way, the present findings should not be interpreted as proof that the observed gains arise solely from the cost-aware component in isolation. Rather, they indicate that the integrated CDRL-QNN design can support improved decision calibration under asymmetric operational risk within the constrained experimental setting considered in this study.

The results do not claim quantum computational advantage; rather, they suggest that hybrid quantum representations can be integrated into cost-sensitive reinforcement learning pipelines in a methodologically meaningful way under the constrained experimental setting considered in this study. The 4-qubit, 2-layer variational circuit provides a compact representation compatible with near-term quantum hardware constraints.

Future research may explore end-to-end training of quantum parameters, ablation studies isolating chaos-driven exploration effects, evaluation on imbalanced real-world traffic distributions, and noise-aware simulations to assess hardware feasibility. Extending the framework to larger datasets and multi-class intrusion scenarios also remains an important direction.

Overall, CDRL-QNN highlights the practical value of embedding operational cost modeling within hybrid reinforcement learning architectures for secure and stable IoT intrusion detection.

Author Contributions

Conceptualization, M.Y.K.; Methodology, M.Y.K., F.A. and C.B.; Software, M.Y.K.; Validation, F.A.; Resources, M.Y.K. and F.A.; Data curation, M.Y.K. and F.A.; Writing—original draft, M.Y.K. and F.A.; Writing—review and editing, M.Y.K., F.A. and C.B.; Visualization, M.Y.K.; Supervision, C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experiments in this study were conducted using the publicly available CIC-DDoS2019 dataset developed by the Canadian Institute for Cybersecurity (CIC). The dataset contains raw network traffic captures (PCAP files) and labeled flow-based features extracted via CICFlowMeter-V3. It is accessible at https://www.unb.ca/cic/datasets/ddos-2019.html (accessed on 10 September 2025). No new data were generated during this study. All data preprocessing, feature selection, normalization procedures, and model configurations are thoroughly described in the manuscript to ensure transparency and reproducibility. The dataset is distributed under a license requiring citation of the original dataset publication [23].

Acknowledgments

The authors sincerely thank the editor and the anonymous reviewers for their valuable and helpful comments, which improved the manuscript. M.Y.K. was personally supported during his doctoral studies by the TUBITAK 2211/C National PhD Scholarship Program in the Priority Fields in Science and Technology; this scholarship provided general PhD support and did not fund this specific research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Neshenko, N.; Bou-Harb, E.; Crichigno, J.; Kaddoum, G.; Ghani, N. Demystifying IoT security: An exhaustive survey on IoT vulnerabilities and a first empirical look on Internet-scale IoT exploitations. IEEE Commun. Surv. Tutor. 2019, 21, 2702–2733. [Google Scholar]
Anand, P.; Singh, Y.; Selwal, A.; Singh, P.K.; Felseghi, R.A.; Raboaca, M.S. Iovt: Internet of vulnerable things? threat architecture, attack surfaces, and vulnerabilities in internet of things and its applications towards smart grids. Energies 2020, 13, 4813. [Google Scholar] [CrossRef]
Gelgi, M.; Guan, Y.; Arunachala, S.; Samba Siva Rao, M.; Dragoni, N. Systematic literature review of IoT botnet DDOS attacks and evaluation of detection techniques. Sensors 2024, 24, 3571. [Google Scholar] [CrossRef]
Alhalabi, W.; Gaurav, A.; Arya, V.; Zamzami, I.F.; Aboalela, R.A. Machine learning-based distributed denial of services (ddos) attack detection in intelligent information systems. Int. J. Semant. Web Inf. Syst. (IJSWIS) 2023, 19, 1–17. [Google Scholar] [CrossRef]
Kabanda, R.; Byera, B.; Emeka, H.; Mohiuddin, K.T. The history, trend, types, and mitigation of distributed denial of service attacks. J. Inf. Secur. 2023, 14, 464–471. [Google Scholar] [CrossRef]
Praptodiyono, S.; Firmansyah, T.; Anwar, M.H.; Wicaksana, C.A.; Pramudyo, A.S.; Al-Allawee, A. Development of hybrid intrusion detection system based on Suricata with pfSense method for high reduction of DDoS attacks on IPv6 networks. East.-Eur. J. Enterp. Technol. 2023, 125, 61–70. [Google Scholar]
Shiju, S.; Nair, S.S. False Alert Mitigation in Intrusion Detection System: A Deep Learning Approach. In Proceedings of the 2025 4th International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS), Ernakulam, India, 11–13 June 2025; IEEE: New York, NY, USA, 2025; pp. 803–809. [Google Scholar]
Hashmi, A.; Barukab, O.M.; Hamza Osman, A. A hybrid feature weighted attention based deep learning approach for an intrusion detection system using the random forest algorithm. PLoS ONE 2024, 19, e0302294. [Google Scholar] [CrossRef]
Prasath, S.; Sethi, K.; Mohanty, D.; Bera, P.; Samantaray, S.R. Analysis of continual learning models for intrusion detection system. IEEE Access 2022, 10, 121444–121464. [Google Scholar] [CrossRef]
Hossain, M.A. Deep Q-learning intrusion detection system (DQ-IDS): A novel reinforcement learning approach for adaptive and self-learning cybersecurity. ICT Express 2025, 11, 875–880. [Google Scholar]
Shen, S.; Cai, C.; Li, Z.; Shen, Y.; Wu, G.; Yu, S. Deep Q-network-based heuristic intrusion detection against edge-based SIoT zero-day attacks. Appl. Soft Comput. 2024, 150, 111080. [Google Scholar] [CrossRef]
Vanderschueren, T.; Verdonck, T.; Baesens, B.; Verbeke, W. Predict-then-optimize or predict-and-optimize? An empirical evaluation of cost-sensitive learning strategies. Inf. Sci. 2022, 594, 400–415. [Google Scholar] [CrossRef]
Sayin, B.; Zoppi, T.; Marchini, N.; Khokhar, F.A.; Passerini, A. Bringing Machine Learning Classifiers Into Critical Cyber-Physical Systems: A Matter of Design. IEEE Access 2025, 13, 94858–94877. [Google Scholar] [CrossRef]
Yang, M.; Bi, X. Cost-Aware Calibration of Classifiers. INFORMS J. Data Sci. 2025, 4, 101–113. [Google Scholar] [CrossRef]
Küçükkara, M.Y.; Atban, F.; Bayılmış, C. Quantum-Neural Network Model for Platform Independent Ddos Attack Classification in Cyber Security. Adv. Quantum Technol. 2024, 7, 2400084. [Google Scholar] [CrossRef]
Atban, F.; Küçükkara, M.Y.; Bayılmış, C. Enhancing variational quantum classifier performance with meta-heuristic feature selection for credit card fraud detection. Eur. Phys. J. Spec. Top. 2025, 234, 3705–3718. [Google Scholar] [CrossRef]
Fontana, E.; Herman, D.; Chakrabarti, S.; Kumar, N.; Yalovetzky, R.; Heredge, J.; Sureshbabu, S.H.; Pistoia, M. Characterizing barren plateaus in quantum ansätze with the adjoint representation. Nat. Commun. 2024, 15, 7171. [Google Scholar] [CrossRef]
Ragone, M.; Bakalov, B.N.; Sauvage, F.; Kemper, A.F.; Ortiz Marrero, C.; Larocca, M.; Cerezo, M. A lie algebraic theory of barren plateaus for deep parameterized quantum circuits. Nat. Commun. 2024, 15, 7172. [Google Scholar] [CrossRef]
Letcher, A.; Woerner, S.; Zoufal, C. Tight and efficient gradient bounds for parameterized quantum circuits. Quantum 2024, 8, 1484. [Google Scholar] [CrossRef]
Li, Y.; Li, L.; Lian, Z.; Zhou, K.; Dai, Y. A quasi-opposition learning and chaos local search based on walrus optimization for global optimization problems. Sci. Rep. 2025, 15, 2881. [Google Scholar] [CrossRef]
Matsuki, T.; Sakemi, Y.; Aihara, K. Chaos-based reinforcement learning with TD3. arXiv 2024, arXiv:2405.09086. [Google Scholar] [CrossRef]
Kaewdornhan, N.; Chatthaworn, R. Improved Deep Reinforcement Learning with Logistic Map for Microgrid Energy Management. E3S Web Conf. 2025, 629, 06001. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Hakak, S.; Ghorbani, A.A. Developing realistic distributed denial of service (DDoS) attack dataset and taxonomy. In Proceedings of the 2019 International Carnahan Conference on Security Technology (ICCST), Chennai, India, 1–3 October 2019; IEEE: New York, NY, USA, 2019; pp. 1–8. [Google Scholar]
Hizal, S.; Cavusoglu, U.; Akgun, D. A novel deep learning-based intrusion detection system for IoT DDoS security. Internet Things 2024, 28, 101336. [Google Scholar]
Almaraz-Rivera, J.G.; Perez-Diaz, J.A.; Cantoral-Ceballos, J.A. Transport and application layer DDoS attacks detection to IoT devices by using machine learning and deep learning models. Sensors 2022, 22, 3367. [Google Scholar] [CrossRef] [PubMed]
Abulhassan, A.; Rashid, I.; Imam, M.; Binbeshr, F. Ddos attack detection in iot: A comparative resource and performance analysis of deep learning and machine learning models. IEEE Access 2025, 13, 116529–116547. [Google Scholar] [CrossRef]
Sajid, M.; Malik, K.R.; Almogren, A.; Malik, T.S.; Khan, A.H.; Tanveer, J.; Rehman, A.U. Enhancing intrusion detection: A hybrid machine and deep learning approach. J. Cloud Comput. 2024, 13, 123. [Google Scholar] [CrossRef]
Wahab, S.A.; Sultana, S.; Tariq, N.; Mujahid, M.; Khan, J.A.; Mylonas, A. A multi-class intrusion detection system for DDoS attacks in IoT networks using deep learning and transformers. Sensors 2025, 25, 4845. [Google Scholar] [CrossRef]
Ain, N.U.; Sardaraz, M.; Tahir, M.; Abo Elsoud, M.W.; Alourani, A. Securing IoT networks against DDoS attacks: A hybrid deep learning approach. Sensors 2025, 25, 1346. [Google Scholar] [CrossRef] [PubMed]
Yang, E.; Jeong, S.; Seo, C. Harnessing feature pruning with optimal deep learning based DDoS cyberattack detection on IoT environment. Sci. Rep. 2025, 15, 17516. [Google Scholar] [CrossRef]
Rahmati, M.; Pagano, A. Federated Learning-Driven Cybersecurity Framework for IoT Networks with Privacy Preserving and Real-Time Threat Detection Capabilities. Informatics 2025, 12, 62. [Google Scholar] [CrossRef]
Sorour, S.E.; Aljaafari, M.; Shaker, A.M.; Amin, A.E. LSTM-JSO framework for privacy preserving adaptive intrusion detection in federated IoT networks. Sci. Rep. 2025, 15, 11321. [Google Scholar] [CrossRef]
Begum, K.; Mozumder, M.A.I.; Joo, M.I.; Kim, H.C. BFLIDS: Blockchain-driven federated learning for intrusion detection in IoMT networks. Sensors 2024, 24, 4591. [Google Scholar] [CrossRef]
Nguyen, T.T.; Reddi, V.J. Deep reinforcement learning for cyber security. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3779–3795. [Google Scholar] [CrossRef]
Satpathy, S.; Tripathy, U.; Swain, P.K. Cloud-based DDoS detection using hybrid feature selection with deep reinforcement learning (DRL). Sci. Rep. 2025, 15, 36546. [Google Scholar] [CrossRef]
Al-Fawa’reh, M.; Abu-Khalaf, J.; Szewczyk, P.; Kang, J.J. MalBoT-DRL: Malware botnet detection using deep reinforcement learning in IoT networks. IEEE Internet Things J. 2023, 11, 9610–9629. [Google Scholar] [CrossRef]
Suresh, A.; Cyril Jose, A. Adaptive network intrusion detection using reinforcement learning with proximal policy optimization. ACM Trans. Priv. Secur. 2025, 28, 1–24. [Google Scholar] [CrossRef]
Kanimozhi, R.; Ramesh, P. Deep reinforcement learning-based intrusion detection scheme for software-defined networking. Sci. Rep. 2025, 15, 38827. [Google Scholar] [CrossRef] [PubMed]
Sethi, K.; Sai Rupesh, E.; Kumar, R.; Bera, P.; Venu Madhav, Y. A context-aware robust intrusion detection system: A reinforcement learning-based approach. Int. J. Inf. Secur. 2020, 19, 657–678. [Google Scholar] [CrossRef]
Jayakrishna, N.; Prasanth, N.N. A hybrid deep learning model for detection and mitigation of DDoS attacks in VANETs. Sci. Rep. 2025, 15, 34170. [Google Scholar] [CrossRef] [PubMed]
Prasad, A.; Mohammad Alenazy, W.; Ahmad, N.; Ali, G.; Abdallah, H.A.; Ahmad, S. Optimizing IoT intrusion detection with cosine similarity based dataset balancing and hybrid deep learning. Sci. Rep. 2025, 15, 30939. [Google Scholar] [CrossRef]
Zahoora, U.; Khan, A.; Rajarajan, M.; Khan, S.H.; Asam, M.; Jamal, T. Ransomware detection using deep learning based unsupervised feature extraction and a cost sensitive Pareto Ensemble classifier. Sci. Rep. 2022, 12, 15647. [Google Scholar] [CrossRef]
Nissar, N.; Naja, N.; Jamali, A. Cost-Sensitive Detection of DoS Attacks in Automotive Cybersecurity Using Artificial Neural Networks and CatBoost. J. Netw. Syst. Manag. 2025, 33, 28. [Google Scholar] [CrossRef]
Nayak, G.S.; Muniyal, B.; Belavagi, M.C. Enhancing phishing detection: A machine learning approach with feature selection and deep learning models. IEEE Access 2025, 13, 33308–33320. [Google Scholar] [CrossRef]
Kim, T.H.; Madhavi, S. Quantum intrusion detection system using outlier analysis. Sci. Rep. 2024, 14, 27114. [Google Scholar] [CrossRef]
Kukliansky, A.; Orescanin, M.; Bollmann, C.; Huffmire, T. Network anomaly detection using quantum neural networks on noisy quantum computers. IEEE Trans. Quantum Eng. 2024, 5, 3100611. [Google Scholar] [CrossRef]
Kadi, A.; Selamnia, A.; Abou El Houda, Z.; Moudoud, H.; Brik, B.; Khoukhi, L. An In-Depth Comparative Study of Quantum-Classical Encoding Methods for Network Intrusion Detection. IEEE Open J. Commun. Soc. 2025, 6, 1129–1148. [Google Scholar] [CrossRef]
Nalayini, C.; Soumya, T.; Lalitha, S.; Tamijetchelvy, R. A novel adaptive transformer based quantum intrusion detection system for software defined networks. Sci. Rep. 2025, 15, 36505. [Google Scholar] [CrossRef]
McClean, J.R.; Boixo, S.; Smelyanskiy, V.N.; Babbush, R.; Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun. 2018, 9, 4812. [Google Scholar] [CrossRef] [PubMed]
Cerezo, M.; Sone, A.; Volkoff, T.; Cincio, L.; Coles, P.J. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat. Commun. 2021, 12, 1791. [Google Scholar] [CrossRef] [PubMed]
Dowling, N.; West, M.T.; Southwell, A.; Nakhl, A.C.; Sevior, M.; Usman, M.; Modi, K. Adversarial robustness guarantees for quantum classifiers. npj Quantum Inf. 2026, 12, 16. [Google Scholar] [CrossRef]
Naruse, M.; Mihana, T.; Hori, H.; Saigo, H.; Okamura, K.; Hasegawa, M.; Uchida, A. Scalable photonic reinforcement learning by time-division multiplexing of laser chaos. Sci. Rep. 2018, 8, 10890. [Google Scholar] [CrossRef]
Bilban, M.; İnan, O. Optimizing Autonomous Vehicle Performance Using Improved Proximal Policy Optimization. Sensors 2025, 25, 1941. [Google Scholar] [CrossRef]
Li, W.; Lu, Z.D.; Deng, D.L. Quantum neural network classifiers: A tutorial. Scipost Phys. Lect. Notes 2022, 61. [Google Scholar] [CrossRef]
Zhou, M.G.; Liu, Z.P.; Yin, H.L.; Li, C.L.; Xu, T.K.; Chen, Z.B. Quantum neural network for quantum neural computing. Research 2023, 6, 134. [Google Scholar] [CrossRef] [PubMed]
Devadas, R.M.; Sowmya, T. Quantum machine learning: A comprehensive review of integrating AI with quantum computing for computational advancements. MethodsX 2025, 14, 103318. [Google Scholar] [CrossRef]
Gong, L.H.; Pei, J.J.; Zhang, T.F.; Zhou, N.R. Quantum convolutional neural network based on variational quantum circuits. Opt. Commun. 2024, 550, 129993. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Ladosz, P.; Weng, L.; Kim, M.; Oh, H. Exploration in deep reinforcement learning: A survey. Inf. Fusion 2022, 85, 1–22. [Google Scholar] [CrossRef]
Fotopoulos, G.B.; Popovich, P.; Papadopoulos, N.H. Review non-convex optimization method for machine learning. arXiv 2024, arXiv:2410.02017. [Google Scholar] [CrossRef]
Wang, S.; Yang, R.; Li, B.; Kan, Z. Structural parameter space exploration for reinforcement learning via a matrix variate distribution. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 7, 1025–1035. [Google Scholar] [CrossRef]

Figure 1. Architecture of the CDRL-QNN framework integrating quantum value approximation, cost-aware reinforcement learning, and chaos-driven exploration for IoT intrusion detection.

Figure 2. Final confusion matrix of the baseline QNN-DQN model at Episode 800. The model achieves high overall accuracy but exhibits a higher number of false negatives compared to the proposed cost-aware framework.

Figure 3. Final confusion matrix of the proposed CDRL-QNN model at Episode 800. The cost-aware and chaos-driven optimization reduces false negatives while maintaining the same false positive count, resulting in improved recall and F1-score.

Table 1. Comparative analysis of related works and the proposed CDRL-QNN method, highlighting methodological focus, optimization strategy, cost awareness, and key limitations in relation to quantum-enhanced intrusion detection systems.

Study	Methodology	Optimization Focus	Cost-Sensitivity	Key Limitation Regarding This Study
Wahab et al. [28]	CNN + Transformers	Class balancing (SMOTE)	Indirect (data-level)	High computational cost; static training.
Satpathy et al. [35]	Actor-Critic DRL	Reward shaping	Yes (reward-based)	Focuses on cloud, not edge/IoT QNN stability.
Suresh & Jose [37]	PPO-RL Ensemble	Dynamic weighting	Indirect	Classical ML ensemble; no quantum component.
Nissar et al. [43]	ANN + CatBoost	Hyperparameter tuning (Optuna)	Explicit (cost learning)	Static model; lacks adaptive RL exploration.
Nalayini et al. [48]	Quantum-inspired IDS	Evolutionary selection	No (focus on overhead)	SDN-focused; uses evolutionary algorithm instead of chaos.
Kukliansky et al. [46]	QNN on noisy HW	Noise robustness	No	Focuses on hardware noise, not algorithmic stability.
Matsuki et al. [21]	Chaos-based RL (TD3)	Chaos exploration	No	General control task; not applied to IDS or QNNs.
Proposed CDRL-QNN	QNN + Chaos + RL	Chaos-driven stability	Explicit (FP/FN costs)	Integrates cost-aware RL, QNN-based value approximation, and chaos-driven exploration within a unified framework.

Table 2. Comparison of the proposed method with additional classical baselines on the balanced 3000-sample setting. Classical baselines (DQN, Random Forest) are reported as mean ± standard deviation across repeated runs. The CDRL-QNN row corresponds to an additional alternative-seed validation run, complementing the controlled frozen-reference comparison reported in Section 5.2 and Section 5.4.

Model	Accuracy	F1-Score	Operational Cost
Classical DQN	0.9269 ± 0.0115	0.9271 ± 0.0117	754.0 ± 128.7
Random Forest	0.9825 ± 0.0035	0.9823 ± 0.0036	51.0 ± 8.5
Proposed CDRL-QNN (alternative seed)	0.9750	0.9751	240.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Küçükkara, M.Y.; Atban, F.; Bayılmış, C. A New Hybrid Method: CDRL-QNN for Stable IoT Intrusion Detection. Mathematics 2026, 14, 1608. https://doi.org/10.3390/math14101608

AMA Style

Küçükkara MY, Atban F, Bayılmış C. A New Hybrid Method: CDRL-QNN for Stable IoT Intrusion Detection. Mathematics. 2026; 14(10):1608. https://doi.org/10.3390/math14101608

Chicago/Turabian Style

Küçükkara, Muhammed Yusuf, Furkan Atban, and Cüneyt Bayılmış. 2026. "A New Hybrid Method: CDRL-QNN for Stable IoT Intrusion Detection" Mathematics 14, no. 10: 1608. https://doi.org/10.3390/math14101608

APA Style

Küçükkara, M. Y., Atban, F., & Bayılmış, C. (2026). A New Hybrid Method: CDRL-QNN for Stable IoT Intrusion Detection. Mathematics, 14(10), 1608. https://doi.org/10.3390/math14101608

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Hybrid Method: CDRL-QNN for Stable IoT Intrusion Detection

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning and Hybrid Approaches in IoT Security

2.2. Reinforcement Learning for Adaptive Threat Detection

2.3. Cost-Sensitive Learning and Operational Stability

2.4. Quantum Neural Networks and Chaos-Driven Optimization

3. Materials and Methods

3.1. Dataset Information

3.2. Quantum Neural Networks

3.3. Reinforcement Learning and Bellman Optimization

3.4. Chaos-Driven Exploration Mechanism

3.5. Proposed Framework: CDRL-QNN

3.5.1. System Architecture

3.5.2. Problem Formulation as a Markov Decision Process

3.5.3. Cost-Aware Reward Design

3.5.4. Chaos-Driven State Perturbation

3.5.5. Training Procedure

4. Experimental Setup

4.1. Data Preprocessing

4.2. Quantum Circuit Configuration

4.3. Hyperparameter Settings

4.4. Evaluation Metrics

5. Results and Operational Analysis

5.1. Detection Performance Comparison

5.2. Operational Cost Reduction Analysis

5.3. Reward Convergence and Stability Analysis

5.4. Trade-Off Analysis Between False Positives and False Negatives

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI