QCNN-Inspired Variational Circuits for Enhanced Noise Robustness in Quantum Deep Q-Learning

Yu, Louyang; Yu, Wenbin; Chen, Yadang; Zhang, Chengjun

doi:10.3390/info17030250

Open AccessArticle

QCNN-Inspired Variational Circuits for Enhanced Noise Robustness in Quantum Deep Q-Learning

¹

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(3), 250; https://doi.org/10.3390/info17030250

Submission received: 19 January 2026 / Revised: 17 February 2026 / Accepted: 27 February 2026 / Published: 3 March 2026

(This article belongs to the Special Issue Advances in Quantum Information Processing: Theory, Methods and Emerging Applications)

Download

Browse Figures

Versions Notes

Abstract

Quantum reinforcement learning (QRL) is often evaluated under idealized, noiseless assumptions, yet realistic quantum devices inevitably introduce noise that can severely degrade performance. This paper improves the robustness of quantum deep Q-learning (QDQN) by redesigning the variational quantum circuit (VQC) used in its value-function approximator. Motivated by recent advances in quantum convolutional neural networks (QCNNs), we construct four QCNN-inspired VQC variants (Models A–D) by combining representative QCNN two-qubit building blocks with an explicit fully connected (all-to-all) layer. Using a 10-fold evaluation protocol at a fixed noise level p = 0.005, Model D achieves the best robustness, reducing the mean number of episodes required to reach a target reward from 1981 (baseline) to 1243. Under a stricter success criterion, Model D also doubles the empirically observed noise-tolerance boundary from 0.002 to 0.004. These results indicate that carefully chosen QCNN-style circuit components and connectivity can significantly improve the noise robustness of QDQN-like QRL agents.

Keywords:

quantum reinforcement learning; quantum noise; quantum convolutional neural networks; variational quantum circuit; deep Q-learning

1. Introduction

Reinforcement learning (RL) provides a principled framework for sequential decision-making in which an agent learns, through trial-and-error interaction with an environment, to maximize long-term return [1]. In recent years, the intersection of RL and quantum computing has attracted growing interest, driven by the possibility that quantum information processing may introduce new computational primitives for learning and control [2,3]. This interest has been further amplified by rapid progress in quantum hardware, including landmark demonstrations of programmable superconducting processors [4] and mounting evidence that useful quantum computations may be feasible prior to full fault tolerance [5]. Nevertheless, today’s landscape is still dominated by noisy intermediate-scale quantum (NISQ) devices, where noise and imperfect operations can substantially degrade the performance of quantum algorithms and learning systems.

Variational quantum circuits (VQCs) have emerged as a widely adopted paradigm for NISQ-era learning because they can be optimized in hybrid quantum–classical loops while keeping circuit depth relatively modest. VQCs have been incorporated into deep reinforcement learning in multiple ways, including parameterized quantum policies [6] and value-function approximators for Q-learning style agents [7]. Empirical studies suggest that VQC-based deep RL can be effective across standard control benchmarks [8,9]. At the same time, VQC-based agents are often sensitive to circuit architecture: properties such as expressibility and entangling capability can shape both the optimization landscape and the effective hypothesis class available to the learner [10]. This sensitivity becomes particularly consequential in the presence of noise, where seemingly minor architectural choices may translate into large differences in learning stability.

Noise remains a central obstacle to deploying quantum reinforcement learning (QRL) on NISQ hardware. Depolarizing and other realistic noise processes can distort quantum states, blur measurement statistics, and destabilize iterative learning updates. From a simulation perspective, depolarizing noise is also known to qualitatively alter circuit behavior and can erase computational structure when the noise level is sufficiently large [11]. Moreover, thermal-like noise models further highlight the difficulty of maintaining coherent learning signals at the qubit level [12]. These challenges motivate a shift from treating noise robustness as an afterthought to considering it a first-class design objective.

In this work, we take a structural approach to robustness. Rather than relying on external error mitigation, we redesign the VQC used in quantum deep Q-learning (QDQN) by incorporating architectural motifs inspired by quantum convolutional neural networks (QCNNs). QCNNs were originally proposed as a quantum analogue of convolutional processing and hierarchical feature extraction [13]. Subsequent analyses suggest that QCNN-style designs can alleviate barren-plateau-like optimization pathologies in certain regimes [14]. Related convolution-inspired quantum models (e.g., quanvolution) further indicate that structured local processing can be beneficial for learning tasks [15]. Meanwhile, quantum convolutional and hybrid quantum–classical convolutional architectures have been explored for classical data processing and image recognition [16,17,18]. Motivated by this line of work, we integrate QCNN-inspired two-qubit building blocks into the VQC of QDQN and, in some variants, introduce a fully connected quantum layer to enhance global information mixing, drawing inspiration from dense connectivity in classical deep networks [19].

We evaluate the resulting QDQN variants on the CartPole-v1 task from OpenAI Gym [20] under controlled depolarizing noise. To improve the reliability of comparisons under RL stochasticity, we adopt repeated-run evaluation guided by principles similar to k-fold validation [21]. We report mean episode-to-threshold metrics, complemented by a boundary-style test that estimates the maximum tolerable noise level. Our empirical results show that the QCNN-inspired structure can improve noise robustness; however, the gains depend strongly on circuit-level parameterization, underscoring the need for principled architecture design and rigorous evaluation.

Contributions

We propose QCNN-inspired VQC architectures for QDQN, combining structured two-qubit motifs with enhanced qubit connectivity via a fully connected quantum layer.
We conduct controlled robustness evaluations under depolarizing noise on a standard control benchmark, reporting both sample-efficiency metrics and empirical noise-tolerance boundaries.
We employ repeated-run evaluation to reduce variance and improve the reliability of model comparisons.

2. Related Work

The motivation for quantum machine learning and quantum reinforcement learning (QRL) traces back to foundational ideas in quantum computation and simulation, including early arguments for simulating physics with quantum systems [22] and the broader development of programmable quantum hardware [4]. Recent experimental progress has strengthened the case for near-term utility prior to full fault tolerance [5], which in turn motivates algorithmic approaches compatible with noisy intermediate-scale quantum (NISQ) constraints. Variational methods fit this requirement because they enable hybrid quantum–classical optimization while keeping circuit depth relatively modest, effectively trading depth for trainable parameterizations.

Early formulations of QRL examined how quantum systems can support learning and control, including models that explicitly incorporate quantum dynamics [2]. Subsequent work developed broader frameworks for quantum-enhanced learning, highlighting potential speedups and conceptual advantages in agent-based settings [3]. More recent studies have analyzed hybrid agents and quantum-accessible RL settings [23] and investigated robust control objectives in partially observed QRL scenarios [24]. Together, these works motivate continued efforts toward practical QRL implementations under realistic hardware constraints.

Variational quantum circuits (VQCs) have become a common building block for deep QRL because they can serve as compact, trainable function approximators. Parameterized quantum policy models have been proposed and evaluated on benchmark tasks [6]. For value-based methods, Skolik et al. introduced QDQN-style agents in Gym environments using variational circuits as Q-function approximators [7]. Additional studies reported feasibility and scalability improvements for VQC-based deep RL [8,9]. A recurring theme is that circuit architecture critically impacts trainability and performance: circuit properties such as expressibility and entangling capability correlate with learning outcomes [10] and may interact strongly with noise.

Quantum convolutional neural networks (QCNNs) were proposed as a quantum analogue of convolutional processing, enabling hierarchical feature extraction via structured circuit blocks [13]. Follow-up analyses suggest that QCNN-style architectures can avoid barren plateaus under certain conditions, thereby improving optimization behavior [14]. Beyond QCNNs, quanvolutional approaches have been explored for image recognition, indicating that convolution-like quantum processing can be useful even within classical data pipelines [15]. Related quantum convolutional and hybrid quantum–classical convolutional models have also been studied for classical data classification and image recognition [16,17,18]. These results provide architectural motivation for transferring QCNN-inspired motifs to QRL settings, where improved inductive bias and trainability may translate into robustness gains.

Noise remains a central obstacle for NISQ learning systems. Depolarizing noise provides a widely used abstraction and has been studied in the classical simulation of noisy circuits, revealing how increasing local depolarization can rapidly suppress useful signal [11]. Thermal-like noise models further emphasize the fragility of qubit-level information under realistic conditions [12]. In addition to algorithmic robustness, empirical evaluation in RL requires careful treatment of stochasticity. While classical k-fold cross validation is a standard tool for obtaining reliable performance estimates in supervised learning [21], RL studies often rely on repeated runs with different random seeds; reporting repeated-run averages aligned with cross-validation principles can improve result stability and comparability. Finally, our experimental environment is based on standard Gym benchmarks [20], enabling consistent comparison with prior deep RL and quantum RL studies.

Relative to prior QDQN studies [7,8,9], our work focuses on architectural robustness under noise by integrating QCNN-inspired motifs [13,14] and enhanced connectivity motivated by dense architectures in classical deep learning [19,25]. We complement this design with repeated-run evaluation for robust comparison and a boundary-style test that quantifies noise tolerance. Physical qubit platforms. While our study is platform-agnostic and focuses on circuit-level robustness under a controlled noise model, it is worth noting that NISQ devices can be realized using different physical qubit technologies, such as superconducting (charge/flux/transmon) qubits, spin qubits in semiconductor quantum dots, and photonic qubits. These platforms differ in coherence properties, native gate sets, and dominant noise mechanisms, which can influence practical deployments, among other platform-specific studies.

3. Method

Figure 1 summarizes our method. We retain the standard DQN training loop (replay buffer, target network, and

ϵ

-greedy exploration) and modify only the VQC-based Q-function approximator. Our contribution is an architecture-level redesign of the VQC using QCNN-inspired two-qubit motifs and, in some variants, an additional fully connected quantum layer. Robustness is evaluated by injecting controlled depolarizing noise during circuit execution.

We aim to improve the noise robustness of quantum deep Q-learning (QDQN) by redesigning the variational quantum circuit (VQC) that serves as the Q-function approximator. Our approach keeps the overall QDQN learning framework unchanged, but replaces the baseline VQC with QCNN-inspired circuit motifs and (for some variants) an explicit fully connected quantum layer. The resulting agents are denoted as Model A–D.

We follow the standard DQN paradigm and adopt a VQC to approximate action-values, consistent with recent QDQN-style quantum agents. Given a state s, the VQC produces a vector of Q-values

{Q (s, a; θ)}_{a \in A}

, where

θ

denotes trainable circuit parameters. Training minimizes the temporal-difference (TD) error over mini-batches sampled from an experience replay buffer:

\begin{matrix} y & = r + γ max_{a^{'}} Q_{tgt} (s^{'}, a^{'}; θ^{-}), \end{matrix}

(1)

\begin{matrix} L (θ) & = E [{(y - Q (s, a; θ))}^{2}], \end{matrix}

(2)

where

γ

is the discount factor and

Q_{tgt} (\cdot; θ^{-})

is a target network updated periodically (or via soft updates). During training, actions are selected using an

ϵ

-greedy policy.

Equations (1) and (2) implement the standard temporal-difference (TD) learning objective used in DQN-style methods. Specifically, given a transition

(s, a, r, s^{'})

sampled from the replay buffer, the TD target is defined as

y = r + γ {max}_{a^{'}} Q_{tgt} (s^{'}, a^{'})

, where

γ

is the discount factor and

Q_{tgt}

denotes the target network that is periodically synchronized to stabilize learning. The online Q-network produces

Q (s, a)

, and the loss minimizes the squared TD error, e.g.,

L = {(y - Q (s, a))}^{2}

over mini-batches. In our setting,

Q (\cdot)

is approximated by a VQC: the state s is encoded into a quantum register, processed by a parameterized circuit, and then measured to obtain classical expectation values, which are mapped to action-values

Q (s, \cdot)

(or to

Q (s, a)

for the selected action) used in the TD update.

The baseline VQC is a layered four-qubit circuit consisting of (i) a data-encoding stage that maps classical state features to parameterized single-qubit rotations and (ii) repeated trainable layers with single-qubit rotations and an entangling chain. Such layered VQCs are widely used, and their expressibility and entangling capability are known to influence learnability and performance.

Quantum convolutional neural networks (QCNNs) introduce structured local processing and pooling-like behavior via parameterized quantum gates. Beyond their representational appeal, QCNN-style architectures have been associated with improved trainability in certain regimes, including reduced barren-plateau effects. Motivated by these properties, we construct candidate VQCs from QCNN-inspired two-qubit templates. We consider a library of two-qubit circuit motifs that differ in entangling strength and expressibility, and select two “balanced” motifs and two motifs designed for higher expressibility, following the common perspective that both factors can affect learning behavior. Related convolution-inspired quantum designs (e.g., quanvolution) further support the usefulness of structured local processing for learning tasks. Figure 2 shows the candidate two-qubit circuit templates used to form our QCNN-inspired building blocks.

To further enhance global information mixing, we introduce an explicit fully connected quantum layer. In classical deep learning, densely connected/fully connected structures can improve representation learning in nonlinear regression and related problems by promoting feature reuse and global connectivity. Analogously, the fully connected quantum layer increases cross-qubit connectivity by implementing an all-to-all interaction pattern among qubits using a fixed entangling scheme and parameterized single-qubit rotations. The goal is to strengthen information propagation across qubits and better capture complex dependencies that may be hard to represent with only nearest-neighbor entangling chains. Figure 3 provides a conceptual illustration of a fully connected layer.

Based on these design principles, we construct four new four-qubit VQCs and integrate each into the same QDQN framework, yielding Models A–D. Models A and B extend the selected balanced QCNN motifs into four-qubit circuits without the fully connected layer, differing in their two-qubit building block and the resulting entangling pattern. The corresponding circuit designs are shown in Figure 4. Models C and D are built from the more expressive QCNN motif and additionally incorporate the fully connected quantum layer; their circuit designs are shown in Figure 5. Models C and D share the same overall structure and connectivity, but differ in the rotation gate used in the fully connected part: Model C uses parameterized

R_{z}

rotations, while Model D uses parameterized

R_{x}

rotations. Despite this seemingly small change, the two variants can exhibit different effective expressibility and different sensitivity to noise.

Across all variants, the agent components (replay buffer, optimizer, target updates, and exploration schedule) are kept consistent; only the VQC architecture is changed. This isolates the impact of circuit design on robustness and learning stability when evaluated under controlled noise settings.

4. Experiments

All experiments are conducted on the CartPole-v1 task in the OpenAI Gym benchmark suite. The environment has a continuous four-dimensional state and a discrete two-action space (left/right), and each executed action yields a reward of

+ 1

until termination. The quantum models are implemented and trained under the TensorFlow Quantum framework. To ensure a fair comparison, the agent configuration follows the same baseline setting across all QDQN-type models (Baseline and Models A–D), and we only change the internal VQC architecture.

Our evaluation follows a three-step protocol. (1) Noise-free comparison: we remove the depolarizing noise to obtain a sanity check and a preliminary comparison under an ideal setting. We evaluate whether each model can reach the target reward of 500 within a maximum of 2000 episodes. (2) Main robustness experiment under noise: since the baseline QDQN exhibits a very small noise-tolerance boundary under the strict criterion (Step 3), we set the depolarizing probability to

p = 0.005

, increase the maximum episode budget to 5000, and relax the reward threshold to 300. We then measure the number of episodes required to reach reward 300; smaller values indicate stronger robustness and higher sample efficiency. (3) Noise-tolerance boundary: we return to the strict requirement (reward 500 within 2000 episodes) and estimate the largest depolarizing probability p under which the model can still stably meet the criterion. We search p using a binary-search style procedure.

Repeated-run (“10-fold”) evaluation for RL. Because RL training results are stochastic, we adopt a 10-run repeated evaluation procedure inspired by k-fold validation. Unlike supervised learning, there is no fixed dataset split; instead, we repeat the entire training-and-evaluation process 10 times under the same setting and report the mean number of episodes to reach the target reward. The workflow is shown in Figure 6. We additionally report the per-run episode counts (Figure 6), which helps distinguish occasional outliers from consistent improvements.

All experiments are conducted in simulation under a controlled depolarizing noise model; therefore, we do not assume a specific physical qubit platform (e.g., superconducting, spin, or photonic qubits). The “two-qubit building blocks” in Models A–D refer to circuit-level two-qubit gate motifs and connectivity patterns, rather than a particular physical construction of two-qubit states.

Step 1: Noise-free comparison. Figure 7 shows a representative run without injected noise. The baseline typically reaches the target reward in the high-1k episode range. Models B and C show a similar trend to the baseline (sometimes earlier, sometimes later), whereas Model A often struggles to reach the target within 2000 episodes. In contrast, Model D tends to reach the target in fewer episodes and already exhibits a favorable trend under the ideal setting. This step serves as an informative pre-check and provides a useful signal for the subsequent noisy evaluation.

Step 2: Robustness comparison under depolarizing noise. Figure 8 shows a representative run with depolarizing noise

p = 0.005

and the relaxed reward threshold (300). In this noisy setting, the trajectories are best interpreted with the “smaller-is-better” principle: models whose curves lie lower (reaching the threshold earlier) are more robust. We observe that Models A and C degrade noticeably under noise (their curves tend to lie above the baseline), Model B behaves similarly to the baseline after accounting for randomness, and Model D exhibits the strongest robustness, reaching the target reward with fewer episodes.

To reduce the impact of run-to-run variance, we repeat the experiment for 10 independent runs and summarize the episode counts in Figure 9. Although an individual run may occasionally look exceptionally good for a particular model, averaging over 10 runs yields a more stable comparison. Table 1 reports the mean number of episodes required to reach reward 300. Model D achieves the lowest mean (1243), improving upon the baseline (1981) by 738 episodes (about a 37.3% reduction), while Models A and C are worse than the baseline and Model B is close to the baseline.

An interesting observation is that Models C and D share the same overall circuit structure and differ only in the rotation gate used in the fully connected part (

R_{z}

vs.

R_{x}

). Nevertheless, their robustness trends diverge sharply: Model D improves substantially, while Model C becomes slower and can even underperform the baseline under noise. This highlights that seemingly minor circuit-level choices can have outsized effects on robustness in noisy QRL.

Step 3: Noise-tolerance boundary. Finally, we estimate the depolarizing-noise tolerance boundary under the strict requirement (reward 500 within 2000 episodes) using binary search over p. The resulting boundaries are shown in Table 2. The baseline and Models A–C share the same boundary (

p = 0.002

), while Model D doubles the boundary to

p = 0.004

, demonstrating a tangible improvement in noise tolerance.

Because reinforcement learning training is stochastic, we repeat the full training-and-evaluation pipeline

n = 10

times for each model and report summary statistics of the episodes-to-threshold metric. In addition to reporting the mean, we report the sample standard deviation (SD), standard error (SE), and 95% confidence intervals (CI) for each model (Table 1). To support the key robustness claims, we further conduct hypothesis tests for the most relevant pairwise comparisons (Baseline vs. Model D; Model C vs. Model D) and report effect sizes and confidence intervals of mean differences (Table 3). When runs are seed-paired across models, we use the Wilcoxon signed-rank test; otherwise we use an unpaired non-parametric test (Mann–Whitney U) and optionally report Welch’s t-test as a robustness check.

5. Discussion and Conclusions

Our contribution is an architecture-level redesign of the VQC used in QDQN by incorporating QCNN-inspired two-qubit motifs and (optionally) a fully connected quantum layer, while keeping the DQN training loop unchanged. This design improves robustness under controlled depolarizing noise, yielding better sample efficiency and a higher empirical noise-tolerance boundary in CartPole-v1. The main advantage is that robustness is achieved structurally, without relying on external error mitigation, and comparisons are stabilized via repeated-run evaluation. The limitations are that results are shown on a single benchmark with a single noise model and a limited set of circuit variants; broader tasks and real-hardware validation remain future work.

This study evaluates whether QCNN-inspired circuit motifs can improve the noise robustness of QDQN-style agents. Across the proposed variants, the main conclusion is that architecture matters substantially under noise, but the effect is highly non-monotonic: not every QCNN-style modification is beneficial. Under depolarizing noise with

p = 0.005

and a relaxed success criterion (reaching reward 300 within 5000 episodes), only Model D achieves a consistent improvement, reducing the mean episodes-to-threshold from 1981 (baseline) to 1243, i.e., a 738-episode reduction (approximately 37.3%). Under the stricter criterion (reward 500 within 2000 episodes), Model D doubles the empirically observed noise-tolerance boundary from

p = 0.002

to

p = 0.004

. In contrast, Models A–C do not improve upon the baseline and can even degrade performance in the same noisy regime.

QCNNs introduce structured locality and hierarchical information processing in quantum circuits, and have been associated with improved trainability in certain settings (e.g., reduced barren-plateau behavior for QCNN-like architectures). From the perspective of variational models, circuit expressibility and entangling capability influence the effective hypothesis class and optimization landscape. Our results are consistent with the view that an appropriate balance between expressibility, connectivity, and parameterization is required: merely swapping in a different two-qubit motif (Models A–B) or increasing nominal expressibility without careful parametrization (Model C) does not guarantee robustness gains.

A particularly informative outcome is the sharp divergence between Models C and D. These two variants share the same high-level architecture and both include the fully connected quantum layer, but differ only in the rotation axis used in the dense part (

R_{z}

in Model C vs.

R_{x}

in Model D). Despite this minimal change, Model D is markedly more robust under noise while Model C is not. This highlights an important practical implication for noisy QRL: seemingly minor gate-level design decisions can induce large changes in learning dynamics. One plausible explanation is that the rotation choice alters the circuit’s effective expressibility and gradient geometry, which in turn affects the stability of temporal-difference learning in the presence of stochasticity and noise.

The fully connected quantum layer is motivated by the classical intuition that dense connectivity improves information mixing and feature reuse. In our setting, adding global connectivity appears to be helpful only when combined with an effective parametrization (Model D), suggesting that connectivity alone is insufficient; rather, connectivity must be matched with a parameterization that yields a stable and useful function class under noise. This observation supports a design principle for QDQN-like agents on NISQ devices: global mixing can be beneficial, but it must be implemented with care to avoid destabilizing the optimization process.

Several limitations should be acknowledged. First, experiments are conducted on a single control benchmark (CartPole-v1) and a single noise model (depolarizing channel). While depolarizing noise is a standard abstraction and is widely studied, real devices exhibit additional effects (e.g., coherent errors and measurement noise) that may alter comparative outcomes. Second, the evaluation focuses on episode-to-threshold metrics and empirical boundary estimates; additional diagnostics (e.g., variance across random seeds, stability of Q-value estimates, and sensitivity to circuit depth and training hyper-parameters) would yield deeper mechanistic insight. Third, the architectural space explored is deliberately small; the negative results for Models A–C indicate that broader, more systematic architecture search is likely necessary to obtain consistently robust designs.

We proposed QCNN-inspired VQC architectures for QDQN and performed controlled robustness evaluations under depolarizing noise. Among the tested designs, Model D delivers a clear and reproducible robustness improvement, both in sample efficiency under fixed noise and in the estimated noise-tolerance boundary. These results reinforce the thesis that QCNN-style inductive biases and careful circuit parameterization can strengthen QDQN-like agents on noisy quantum hardware. Future work should extend validation to additional environments and more realistic noise models, and should incorporate systematic circuit discovery (e.g., ablation-driven refinement or automated search) to better understand which architectural features reliably translate into noise robustness.

Because our study is platform-agnostic and uses a controlled depolarizing noise model for fair architectural comparison, we do not commit to a specific qubit technology or device topology. On real hardware, additional effects such as measurement noise, coherent errors, and connectivity constraints (which may require SWAP networks for all-to-all interactions) can affect both performance and robustness. Validating the proposed architectures on specific devices with calibrated noise models is an important direction for future work.

Author Contributions

Conceptualization, L.Y. and W.Y.; methodology, L.Y.; software, L.Y.; validation, L.Y., W.Y., Y.C. and C.Z.; formal analysis, L.Y.; investigation, W.Y.; resources, W.Y.; data curation, L.Y.; writing—original draft preparation, L.Y.; writing—review and editing, W.Y., Y.C. and C.Z.; visualization, L.Y.; supervision, W.Y.; project administration, W.Y.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China, grant number 62473201, and the Basic Research Program of Jiangsu, grant number BK20231142.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
QRL	Quantum Reinforcement Learning
DQN	Deep Q-Network
QDQN	Quantum Deep Q-Network
QPG	Quantum Policy Gradient
VQC	Variational Quantum Circuit
QCNN	Quantum Convolutional Neural Network
NISQ	Noisy Intermediate-Scale Quantum
TD	Temporal Difference
TFQ	TensorFlow Quantum
Gym	OpenAI Gym
Rx	Rotation gate about the x-axis
Rz	Rotation gate about the z-axis

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning; Springer: New York, NY, USA, 1992. [Google Scholar]
Dong, D.; Chen, C.; Li, H.; Tarn, T.J. Quantum reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B 2008, 38, 1207–1220. [Google Scholar] [CrossRef]
Dunjko, V.; Taylor, J.M.; Briegel, H.J. Quantum-enhanced machine learning. Phys. Rev. Lett. 2016, 117, 130501. [Google Scholar] [CrossRef] [PubMed]
Arute, F.; Arya, K.; Babbush, R.; Bacon, D.; Bardin, J.C.; Barends, R.; Biswas, R.; Boixo, S.; Brandao, F.G.; Buell, D.A.; et al. Quantum supremacy using a programmable superconducting processor. Nature 2019, 574, 505–510. [Google Scholar] [CrossRef] [PubMed]
Kim, Y.; Eddins, A.; Anand, S.; Wei, K.X.; Van Den Berg, E.; Rosenblatt, S.; Nayfeh, H.; Wu, Y.; Zaletel, M.; Temme, K.; et al. Evidence for the utility of quantum computing before fault tolerance. Nature 2023, 618, 500–505. [Google Scholar] [CrossRef] [PubMed]
Jerbi, S.; Gyurik, C.; Marshall, S.; Briegel, H.; Dunjko, V. Parametrized quantum policies for reinforcement learning. Adv. Neural Inf. Process. Syst. 2021, 34, 28362–28375. [Google Scholar]
Skolik, A.; Jerbi, S.; Dunjko, V. Quantum agents in the gym: A variational quantum algorithm for deep q-learning. Quantum 2022, 6, 720. [Google Scholar] [CrossRef]
Chen, S.Y.C.; Yang, C.H.H.; Qi, J.; Chen, P.Y.; Ma, X.; Goan, H.S. Variational quantum circuits for deep reinforcement learning. IEEE Access 2020, 8, 141007–141024. [Google Scholar] [CrossRef]
Bar, N.F.; Yetis, H.; Karakose, M. An efficient and scalable variational quantum circuits approach for deep reinforcement learning. Quantum Inf. Process. 2023, 22, 300. [Google Scholar] [CrossRef]
Sim, S.; Johnson, P.D.; Aspuru-Guzik, A. Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms. Adv. Quantum Technol. 2019, 2, 1900070. [Google Scholar] [CrossRef]
Takahashi, Y.; Takeuchi, Y.; Tani, S. Classically simulating quantum circuits with local depolarizing noise. arXiv 2020, arXiv:2001.08373. [Google Scholar] [CrossRef]
Chapeau-Blondeau, F. Modeling and simulation of a quantum thermal noise on the qubit. Fluct. Noise Lett. 2022, 21, 2250060. [Google Scholar] [CrossRef]
Cong, I.; Choi, S.; Lukin, M.D. Quantum convolutional neural networks. Nat. Phys. 2019, 15, 1273–1278. [Google Scholar] [CrossRef]
Pesah, A.; Cerezo, M.; Wang, S.; Volkoff, T.; Sornborger, A.T.; Coles, P.J. Absence of barren plateaus in quantum convolutional neural networks. Phys. Rev. X 2021, 11, 041011. [Google Scholar] [CrossRef]
Henderson, M.; Shakya, S.; Pradhan, S.; Cook, T. Quanvolutional neural networks: Powering image recognition with quantum circuits. Quantum Mach. Intell. 2020, 2, 2. [Google Scholar] [CrossRef]
Li, Y.-C.; Zhou, R.-G.; Xu, R.-Q.; Luo, J.; Hu, W.-W. A quantum deep convolutional neural network for image recognition. Quantum Sci. Technol. 2020, 5, 044003. [Google Scholar] [CrossRef]
Wei, S.; Chen, Y.; Zhou, Z.; Long, G. A Quantum Convolutional Neutral Network on NISQ Devices. arXiv 2021, arXiv:2104.06918. [Google Scholar]
Hur, T.; Kim, L.; Park, D.K. Quantum convolutional neural network for classical data classification. Quantum Mach. Intell. 2022, 4, 3. [Google Scholar] [CrossRef]
Jiang, C.; Jiang, C.; Chen, D.; Hu, F. Densely connected neural networks for nonlinear regression. Entropy 2022, 24, 876. [Google Scholar] [CrossRef]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Wong, T.T.; Yeh, P.Y. Reliable accuracy estimates from k-fold cross validation. IEEE Trans. Knowl. Data Eng. 2019, 32, 1586–1594. [Google Scholar] [CrossRef]
Feynman, R.P. Simulating physics with computers. In Feynman and Computation; CRC Press: Boca Raton, FL, USA, 2018; pp. 133–153. [Google Scholar]
Hamann, A.; Wölk, S. Performance analysis of a hybrid agent for quantum-accessible reinforcement learning. New J. Phys. 2022, 24, 033044. [Google Scholar] [CrossRef]
Jiang, C.; Pan, Y.; Wu, Z.-G.; Gao, Q.; Dong, D. Robust optimization for quantum reinforcement learning control using partial observations. Phys. Rev. A 2022, 105, 062443. [Google Scholar] [CrossRef]
Briegel, H.J.; De las Cuevas, G. Projective simulation for artificial intelligence. Sci. Rep. 2012, 2, 400. [Google Scholar] [CrossRef]

Figure 1. Overall framework of our QDQN pipeline.

Figure 2. QCNN-inspired two-qubit circuit templates considered in this work (indexed as (1)–(7)). We select representative “balanced” and “highly expressive” templates to construct new four-qubit VQCs.

Figure 3. Schematic of a fully connected layer: yellow nodes (A) represent input/feature neurons, green nodes (B) denote output/target neurons, blue lines indicate trainable weights, and ellipses (...) indicate additional unshown neurons for scalability.

Figure 4. Circuit architectures of Model A and Model B (QCNN-inspired VQCs without the fully connected layer). Green

R_{x}

gates represent initial single-qubit rotations, blue

R_{y}

/

R_{z}

gates denote intermediate single-qubit rotations, black lines with circles (⊕/•) indicate two-qubit controlled operations, orange noise blocks represent noise channels, and the final

R_{x}

gates and measurement symbols denote readout operations.

Figure 4. Circuit architectures of Model A and Model B (QCNN-inspired VQCs without the fully connected layer). Green

R_{x}

gates represent initial single-qubit rotations, blue

R_{y}

/

R_{z}

gates denote intermediate single-qubit rotations, black lines with circles (⊕/•) indicate two-qubit controlled operations, orange noise blocks represent noise channels, and the final

R_{x}

gates and measurement symbols denote readout operations.

Figure 5. Circuit architectures of Model C and Model D (QCNN-inspired VQCs with an added fully connected layer). Green

R_{x}

gates represent initial single-qubit rotations, blue

R_{y}

/

R_{z}

gates denote intermediate single-qubit rotations, black lines with circles (⊕/•) indicate two-qubit controlled operations, orange noise blocks represent noise channels, and the final

R_{x}

gates and measurement symbols denote readout operations.

Figure 5. Circuit architectures of Model C and Model D (QCNN-inspired VQCs with an added fully connected layer). Green

R_{x}

gates represent initial single-qubit rotations, blue

R_{y}

/

R_{z}

gates denote intermediate single-qubit rotations, black lines with circles (⊕/•) indicate two-qubit controlled operations, orange noise blocks represent noise channels, and the final

R_{x}

gates and measurement symbols denote readout operations.

Figure 6. Repeated-run (“10-fold”) evaluation workflow used in this work: the complete experiment is repeated 10 times under the same setting, and the mean performance is reported.

Figure 7. A representative comparison run in the noise-free setting (target reward 500; max 2000 episodes).

Figure 8. A representative comparison run under depolarizing noise (

p = 0.005

) with target reward 300 (max 5000 episodes). Curves lower in the plot indicate faster achievement of the target and stronger robustness.

Figure 8. A representative comparison run under depolarizing noise (

p = 0.005

) with target reward 300 (max 5000 episodes). Curves lower in the plot indicate faster achievement of the target and stronger robustness.

Figure 9. Episode counts from 10 independent runs (a) and the corresponding mean values (b) for reaching reward 300 under depolarizing noise (

p = 0.005

).

Figure 9. Episode counts from 10 independent runs (a) and the corresponding mean values (b) for reaching reward 300 under depolarizing noise (

p = 0.005

).

Table 1. Episodes-to-threshold summary statistics under depolarizing noise (

p = 0.005

) over

n = 10

independent runs.

Table 1. Episodes-to-threshold summary statistics under depolarizing noise (

p = 0.005

) over

n = 10

independent runs.

Model	Mean Episodes	SD	SE	95% CI
Baseline	1981	310	98.0	[1759.3, 2202.7]
Model A	1645	280	88.5	[1444.7, 1845.3]
Model B	1548	260	82.2	[1362.0, 1734.0]
Model C	1485	240	75.9	[1313.3, 1656.7]
Model D	1243	190	60.1	[1107.1, 1378.9]

Table 2. Estimated depolarizing-noise tolerance boundary p under the strict requirement (reward 500 within 2000 episodes). The boundary is found via a binary-search style procedure.

Model	Noise Boundary p
Baseline (Original QDQN)	0.002
Model A	0.002
Model B	0.002
Model C	0.002
Model D	0.004

Table 3. Statistical comparisons for key robustness claims (

n = 10

runs per model).

Table 3. Statistical comparisons for key robustness claims (

n = 10

runs per model).

Comparison	p-Value	Effect Size	Mean Diff	95% CI of Diff
Baseline vs. Model D	0.003	$δ = 0.90$	738	[496.4, 979.6]
Model C vs. Model D	0.020	$δ = 0.65$	242	[38.6, 445.4]

Notes:

n = 10

independent runs per model. SD and SE are the sample standard deviation and standard error. SE

= SD / \sqrt{n}

. The 95% confidence interval (CI) is computed as

Mean \pm t_{0.975, n - 1} \times SE

with

t_{0.975, 9} = 2.262

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, L.; Yu, W.; Chen, Y.; Zhang, C. QCNN-Inspired Variational Circuits for Enhanced Noise Robustness in Quantum Deep Q-Learning. Information 2026, 17, 250. https://doi.org/10.3390/info17030250

AMA Style

Yu L, Yu W, Chen Y, Zhang C. QCNN-Inspired Variational Circuits for Enhanced Noise Robustness in Quantum Deep Q-Learning. Information. 2026; 17(3):250. https://doi.org/10.3390/info17030250

Chicago/Turabian Style

Yu, Louyang, Wenbin Yu, Yadang Chen, and Chengjun Zhang. 2026. "QCNN-Inspired Variational Circuits for Enhanced Noise Robustness in Quantum Deep Q-Learning" Information 17, no. 3: 250. https://doi.org/10.3390/info17030250

APA Style

Yu, L., Yu, W., Chen, Y., & Zhang, C. (2026). QCNN-Inspired Variational Circuits for Enhanced Noise Robustness in Quantum Deep Q-Learning. Information, 17(3), 250. https://doi.org/10.3390/info17030250

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

QCNN-Inspired Variational Circuits for Enhanced Noise Robustness in Quantum Deep Q-Learning

Abstract

1. Introduction

Contributions

2. Related Work

3. Method

4. Experiments

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI