1. Introduction
Reinforcement learning (RL) provides a principled framework for sequential decision-making in which an agent learns, through trial-and-error interaction with an environment, to maximize long-term return [
1]. In recent years, the intersection of RL and quantum computing has attracted growing interest, driven by the possibility that quantum information processing may introduce new computational primitives for learning and control [
2,
3]. This interest has been further amplified by rapid progress in quantum hardware, including landmark demonstrations of programmable superconducting processors [
4] and mounting evidence that useful quantum computations may be feasible prior to full fault tolerance [
5]. Nevertheless, today’s landscape is still dominated by noisy intermediate-scale quantum (NISQ) devices, where noise and imperfect operations can substantially degrade the performance of quantum algorithms and learning systems.
Variational quantum circuits (VQCs) have emerged as a widely adopted paradigm for NISQ-era learning because they can be optimized in hybrid quantum–classical loops while keeping circuit depth relatively modest. VQCs have been incorporated into deep reinforcement learning in multiple ways, including parameterized quantum policies [
6] and value-function approximators for Q-learning style agents [
7]. Empirical studies suggest that VQC-based deep RL can be effective across standard control benchmarks [
8,
9]. At the same time, VQC-based agents are often sensitive to circuit architecture: properties such as expressibility and entangling capability can shape both the optimization landscape and the effective hypothesis class available to the learner [
10]. This sensitivity becomes particularly consequential in the presence of noise, where seemingly minor architectural choices may translate into large differences in learning stability.
Noise remains a central obstacle to deploying quantum reinforcement learning (QRL) on NISQ hardware. Depolarizing and other realistic noise processes can distort quantum states, blur measurement statistics, and destabilize iterative learning updates. From a simulation perspective, depolarizing noise is also known to qualitatively alter circuit behavior and can erase computational structure when the noise level is sufficiently large [
11]. Moreover, thermal-like noise models further highlight the difficulty of maintaining coherent learning signals at the qubit level [
12]. These challenges motivate a shift from treating noise robustness as an afterthought to considering it a first-class design objective.
In this work, we take a structural approach to robustness. Rather than relying on external error mitigation, we redesign the VQC used in quantum deep Q-learning (QDQN) by incorporating architectural motifs inspired by quantum convolutional neural networks (QCNNs). QCNNs were originally proposed as a quantum analogue of convolutional processing and hierarchical feature extraction [
13]. Subsequent analyses suggest that QCNN-style designs can alleviate barren-plateau-like optimization pathologies in certain regimes [
14]. Related convolution-inspired quantum models (e.g., quanvolution) further indicate that structured local processing can be beneficial for learning tasks [
15]. Meanwhile, quantum convolutional and hybrid quantum–classical convolutional architectures have been explored for classical data processing and image recognition [
16,
17,
18]. Motivated by this line of work, we integrate QCNN-inspired two-qubit building blocks into the VQC of QDQN and, in some variants, introduce a fully connected quantum layer to enhance global information mixing, drawing inspiration from dense connectivity in classical deep networks [
19].
We evaluate the resulting QDQN variants on the CartPole-v1 task from OpenAI Gym [
20] under controlled depolarizing noise. To improve the reliability of comparisons under RL stochasticity, we adopt repeated-run evaluation guided by principles similar to
k-fold validation [
21]. We report mean episode-to-threshold metrics, complemented by a boundary-style test that estimates the maximum tolerable noise level. Our empirical results show that the QCNN-inspired structure can improve noise robustness; however, the gains depend strongly on circuit-level parameterization, underscoring the need for principled architecture design and rigorous evaluation.
Contributions
We propose QCNN-inspired VQC architectures for QDQN, combining structured two-qubit motifs with enhanced qubit connectivity via a fully connected quantum layer.
We conduct controlled robustness evaluations under depolarizing noise on a standard control benchmark, reporting both sample-efficiency metrics and empirical noise-tolerance boundaries.
We employ repeated-run evaluation to reduce variance and improve the reliability of model comparisons.
2. Related Work
The motivation for quantum machine learning and quantum reinforcement learning (QRL) traces back to foundational ideas in quantum computation and simulation, including early arguments for simulating physics with quantum systems [
22] and the broader development of programmable quantum hardware [
4]. Recent experimental progress has strengthened the case for near-term utility prior to full fault tolerance [
5], which in turn motivates algorithmic approaches compatible with noisy intermediate-scale quantum (NISQ) constraints. Variational methods fit this requirement because they enable hybrid quantum–classical optimization while keeping circuit depth relatively modest, effectively trading depth for trainable parameterizations.
Early formulations of QRL examined how quantum systems can support learning and control, including models that explicitly incorporate quantum dynamics [
2]. Subsequent work developed broader frameworks for quantum-enhanced learning, highlighting potential speedups and conceptual advantages in agent-based settings [
3]. More recent studies have analyzed hybrid agents and quantum-accessible RL settings [
23] and investigated robust control objectives in partially observed QRL scenarios [
24]. Together, these works motivate continued efforts toward practical QRL implementations under realistic hardware constraints.
Variational quantum circuits (VQCs) have become a common building block for deep QRL because they can serve as compact, trainable function approximators. Parameterized quantum policy models have been proposed and evaluated on benchmark tasks [
6]. For value-based methods, Skolik et al. introduced QDQN-style agents in Gym environments using variational circuits as Q-function approximators [
7]. Additional studies reported feasibility and scalability improvements for VQC-based deep RL [
8,
9]. A recurring theme is that circuit architecture critically impacts trainability and performance: circuit properties such as expressibility and entangling capability correlate with learning outcomes [
10] and may interact strongly with noise.
Quantum convolutional neural networks (QCNNs) were proposed as a quantum analogue of convolutional processing, enabling hierarchical feature extraction via structured circuit blocks [
13]. Follow-up analyses suggest that QCNN-style architectures can avoid barren plateaus under certain conditions, thereby improving optimization behavior [
14]. Beyond QCNNs, quanvolutional approaches have been explored for image recognition, indicating that convolution-like quantum processing can be useful even within classical data pipelines [
15]. Related quantum convolutional and hybrid quantum–classical convolutional models have also been studied for classical data classification and image recognition [
16,
17,
18]. These results provide architectural motivation for transferring QCNN-inspired motifs to QRL settings, where improved inductive bias and trainability may translate into robustness gains.
Noise remains a central obstacle for NISQ learning systems. Depolarizing noise provides a widely used abstraction and has been studied in the classical simulation of noisy circuits, revealing how increasing local depolarization can rapidly suppress useful signal [
11]. Thermal-like noise models further emphasize the fragility of qubit-level information under realistic conditions [
12]. In addition to algorithmic robustness, empirical evaluation in RL requires careful treatment of stochasticity. While classical
k-fold cross validation is a standard tool for obtaining reliable performance estimates in supervised learning [
21], RL studies often rely on repeated runs with different random seeds; reporting repeated-run averages aligned with cross-validation principles can improve result stability and comparability. Finally, our experimental environment is based on standard Gym benchmarks [
20], enabling consistent comparison with prior deep RL and quantum RL studies.
Relative to prior QDQN studies [
7,
8,
9], our work focuses on architectural robustness under noise by integrating QCNN-inspired motifs [
13,
14] and enhanced connectivity motivated by dense architectures in classical deep learning [
19,
25]. We complement this design with repeated-run evaluation for robust comparison and a boundary-style test that quantifies noise tolerance. Physical qubit platforms. While our study is platform-agnostic and focuses on circuit-level robustness under a controlled noise model, it is worth noting that NISQ devices can be realized using different physical qubit technologies, such as superconducting (charge/flux/transmon) qubits, spin qubits in semiconductor quantum dots, and photonic qubits. These platforms differ in coherence properties, native gate sets, and dominant noise mechanisms, which can influence practical deployments, among other platform-specific studies.
3. Method
Figure 1 summarizes our method. We retain the standard DQN training loop (replay buffer, target network, and
-greedy exploration) and modify only the VQC-based Q-function approximator. Our contribution is an architecture-level redesign of the VQC using QCNN-inspired two-qubit motifs and, in some variants, an additional fully connected quantum layer. Robustness is evaluated by injecting controlled depolarizing noise during circuit execution.
We aim to improve the noise robustness of quantum deep Q-learning (QDQN) by redesigning the variational quantum circuit (VQC) that serves as the Q-function approximator. Our approach keeps the overall QDQN learning framework unchanged, but replaces the baseline VQC with QCNN-inspired circuit motifs and (for some variants) an explicit fully connected quantum layer. The resulting agents are denoted as Model A–D.
We follow the standard DQN paradigm and adopt a VQC to approximate action-values, consistent with recent QDQN-style quantum agents. Given a state
s, the VQC produces a vector of Q-values
, where
denotes trainable circuit parameters. Training minimizes the temporal-difference (TD) error over mini-batches sampled from an experience replay buffer:
where
is the discount factor and
is a target network updated periodically (or via soft updates). During training, actions are selected using an
-greedy policy.
Equations (1) and (2) implement the standard temporal-difference (TD) learning objective used in DQN-style methods. Specifically, given a transition sampled from the replay buffer, the TD target is defined as , where is the discount factor and denotes the target network that is periodically synchronized to stabilize learning. The online Q-network produces , and the loss minimizes the squared TD error, e.g., over mini-batches. In our setting, is approximated by a VQC: the state s is encoded into a quantum register, processed by a parameterized circuit, and then measured to obtain classical expectation values, which are mapped to action-values (or to for the selected action) used in the TD update.
The baseline VQC is a layered four-qubit circuit consisting of (i) a data-encoding stage that maps classical state features to parameterized single-qubit rotations and (ii) repeated trainable layers with single-qubit rotations and an entangling chain. Such layered VQCs are widely used, and their expressibility and entangling capability are known to influence learnability and performance.
Quantum convolutional neural networks (QCNNs) introduce structured local processing and pooling-like behavior via parameterized quantum gates. Beyond their representational appeal, QCNN-style architectures have been associated with improved trainability in certain regimes, including reduced barren-plateau effects. Motivated by these properties, we construct candidate VQCs from QCNN-inspired two-qubit templates. We consider a library of two-qubit circuit motifs that differ in entangling strength and expressibility, and select two “balanced” motifs and two motifs designed for higher expressibility, following the common perspective that both factors can affect learning behavior. Related convolution-inspired quantum designs (e.g., quanvolution) further support the usefulness of structured local processing for learning tasks.
Figure 2 shows the candidate two-qubit circuit templates used to form our QCNN-inspired building blocks.
To further enhance global information mixing, we introduce an explicit fully connected quantum layer. In classical deep learning, densely connected/fully connected structures can improve representation learning in nonlinear regression and related problems by promoting feature reuse and global connectivity. Analogously, the fully connected quantum layer increases cross-qubit connectivity by implementing an all-to-all interaction pattern among qubits using a fixed entangling scheme and parameterized single-qubit rotations. The goal is to strengthen information propagation across qubits and better capture complex dependencies that may be hard to represent with only nearest-neighbor entangling chains.
Figure 3 provides a conceptual illustration of a fully connected layer.
Based on these design principles, we construct four new four-qubit VQCs and integrate each into the same QDQN framework, yielding Models A–D. Models A and B extend the selected balanced QCNN motifs into four-qubit circuits without the fully connected layer, differing in their two-qubit building block and the resulting entangling pattern. The corresponding circuit designs are shown in
Figure 4. Models C and D are built from the more expressive QCNN motif and additionally incorporate the fully connected quantum layer; their circuit designs are shown in
Figure 5. Models C and D share the same overall structure and connectivity, but differ in the rotation gate used in the fully connected part: Model C uses parameterized
rotations, while Model D uses parameterized
rotations. Despite this seemingly small change, the two variants can exhibit different effective expressibility and different sensitivity to noise.
Across all variants, the agent components (replay buffer, optimizer, target updates, and exploration schedule) are kept consistent; only the VQC architecture is changed. This isolates the impact of circuit design on robustness and learning stability when evaluated under controlled noise settings.
4. Experiments
All experiments are conducted on the CartPole-v1 task in the OpenAI Gym benchmark suite. The environment has a continuous four-dimensional state and a discrete two-action space (left/right), and each executed action yields a reward of until termination. The quantum models are implemented and trained under the TensorFlow Quantum framework. To ensure a fair comparison, the agent configuration follows the same baseline setting across all QDQN-type models (Baseline and Models A–D), and we only change the internal VQC architecture.
Our evaluation follows a three-step protocol. (1) Noise-free comparison: we remove the depolarizing noise to obtain a sanity check and a preliminary comparison under an ideal setting. We evaluate whether each model can reach the target reward of 500 within a maximum of 2000 episodes. (2) Main robustness experiment under noise: since the baseline QDQN exhibits a very small noise-tolerance boundary under the strict criterion (Step 3), we set the depolarizing probability to , increase the maximum episode budget to 5000, and relax the reward threshold to 300. We then measure the number of episodes required to reach reward 300; smaller values indicate stronger robustness and higher sample efficiency. (3) Noise-tolerance boundary: we return to the strict requirement (reward 500 within 2000 episodes) and estimate the largest depolarizing probability p under which the model can still stably meet the criterion. We search p using a binary-search style procedure.
Repeated-run (“10-fold”) evaluation for RL. Because RL training results are stochastic, we adopt a 10-run repeated evaluation procedure inspired by
k-fold validation. Unlike supervised learning, there is no fixed dataset split; instead, we repeat the entire training-and-evaluation process 10 times under the same setting and report the mean number of episodes to reach the target reward. The workflow is shown in
Figure 6. We additionally report the per-run episode counts (
Figure 6), which helps distinguish occasional outliers from consistent improvements.
All experiments are conducted in simulation under a controlled depolarizing noise model; therefore, we do not assume a specific physical qubit platform (e.g., superconducting, spin, or photonic qubits). The “two-qubit building blocks” in Models A–D refer to circuit-level two-qubit gate motifs and connectivity patterns, rather than a particular physical construction of two-qubit states.
Step 1: Noise-free comparison. Figure 7 shows a representative run without injected noise. The baseline typically reaches the target reward in the high-1k episode range. Models B and C show a similar trend to the baseline (sometimes earlier, sometimes later), whereas Model A often struggles to reach the target within 2000 episodes. In contrast, Model D tends to reach the target in fewer episodes and already exhibits a favorable trend under the ideal setting. This step serves as an informative pre-check and provides a useful signal for the subsequent noisy evaluation.
Step 2: Robustness comparison under depolarizing noise. Figure 8 shows a representative run with depolarizing noise
and the relaxed reward threshold (300). In this noisy setting, the trajectories are best interpreted with the “smaller-is-better” principle: models whose curves lie lower (reaching the threshold earlier) are more robust. We observe that Models A and C degrade noticeably under noise (their curves tend to lie above the baseline), Model B behaves similarly to the baseline after accounting for randomness, and Model D exhibits the strongest robustness, reaching the target reward with fewer episodes.
To reduce the impact of run-to-run variance, we repeat the experiment for 10 independent runs and summarize the episode counts in
Figure 9. Although an individual run may occasionally look exceptionally good for a particular model, averaging over 10 runs yields a more stable comparison.
Table 1 reports the mean number of episodes required to reach reward 300. Model D achieves the lowest mean (1243), improving upon the baseline (1981) by 738 episodes (about a 37.3% reduction), while Models A and C are worse than the baseline and Model B is close to the baseline.
An interesting observation is that Models C and D share the same overall circuit structure and differ only in the rotation gate used in the fully connected part ( vs. ). Nevertheless, their robustness trends diverge sharply: Model D improves substantially, while Model C becomes slower and can even underperform the baseline under noise. This highlights that seemingly minor circuit-level choices can have outsized effects on robustness in noisy QRL.
Step 3: Noise-tolerance boundary. Finally, we estimate the depolarizing-noise tolerance boundary under the strict requirement (reward 500 within 2000 episodes) using binary search over
p. The resulting boundaries are shown in
Table 2. The baseline and Models A–C share the same boundary (
), while Model D doubles the boundary to
, demonstrating a tangible improvement in noise tolerance.
Because reinforcement learning training is stochastic, we repeat the full training-and-evaluation pipeline
times for each model and report summary statistics of the episodes-to-threshold metric. In addition to reporting the mean, we report the sample standard deviation (SD), standard error (SE), and 95% confidence intervals (CI) for each model (
Table 1). To support the key robustness claims, we further conduct hypothesis tests for the most relevant pairwise comparisons (Baseline vs. Model D; Model C vs. Model D) and report effect sizes and confidence intervals of mean differences (
Table 3). When runs are seed-paired across models, we use the Wilcoxon signed-rank test; otherwise we use an unpaired non-parametric test (Mann–Whitney U) and optionally report Welch’s
t-test as a robustness check.
5. Discussion and Conclusions
Our contribution is an architecture-level redesign of the VQC used in QDQN by incorporating QCNN-inspired two-qubit motifs and (optionally) a fully connected quantum layer, while keeping the DQN training loop unchanged. This design improves robustness under controlled depolarizing noise, yielding better sample efficiency and a higher empirical noise-tolerance boundary in CartPole-v1. The main advantage is that robustness is achieved structurally, without relying on external error mitigation, and comparisons are stabilized via repeated-run evaluation. The limitations are that results are shown on a single benchmark with a single noise model and a limited set of circuit variants; broader tasks and real-hardware validation remain future work.
This study evaluates whether QCNN-inspired circuit motifs can improve the noise robustness of QDQN-style agents. Across the proposed variants, the main conclusion is that architecture matters substantially under noise, but the effect is highly non-monotonic: not every QCNN-style modification is beneficial. Under depolarizing noise with and a relaxed success criterion (reaching reward 300 within 5000 episodes), only Model D achieves a consistent improvement, reducing the mean episodes-to-threshold from 1981 (baseline) to 1243, i.e., a 738-episode reduction (approximately 37.3%). Under the stricter criterion (reward 500 within 2000 episodes), Model D doubles the empirically observed noise-tolerance boundary from to . In contrast, Models A–C do not improve upon the baseline and can even degrade performance in the same noisy regime.
QCNNs introduce structured locality and hierarchical information processing in quantum circuits, and have been associated with improved trainability in certain settings (e.g., reduced barren-plateau behavior for QCNN-like architectures). From the perspective of variational models, circuit expressibility and entangling capability influence the effective hypothesis class and optimization landscape. Our results are consistent with the view that an appropriate balance between expressibility, connectivity, and parameterization is required: merely swapping in a different two-qubit motif (Models A–B) or increasing nominal expressibility without careful parametrization (Model C) does not guarantee robustness gains.
A particularly informative outcome is the sharp divergence between Models C and D. These two variants share the same high-level architecture and both include the fully connected quantum layer, but differ only in the rotation axis used in the dense part ( in Model C vs. in Model D). Despite this minimal change, Model D is markedly more robust under noise while Model C is not. This highlights an important practical implication for noisy QRL: seemingly minor gate-level design decisions can induce large changes in learning dynamics. One plausible explanation is that the rotation choice alters the circuit’s effective expressibility and gradient geometry, which in turn affects the stability of temporal-difference learning in the presence of stochasticity and noise.
The fully connected quantum layer is motivated by the classical intuition that dense connectivity improves information mixing and feature reuse. In our setting, adding global connectivity appears to be helpful only when combined with an effective parametrization (Model D), suggesting that connectivity alone is insufficient; rather, connectivity must be matched with a parameterization that yields a stable and useful function class under noise. This observation supports a design principle for QDQN-like agents on NISQ devices: global mixing can be beneficial, but it must be implemented with care to avoid destabilizing the optimization process.
Several limitations should be acknowledged. First, experiments are conducted on a single control benchmark (CartPole-v1) and a single noise model (depolarizing channel). While depolarizing noise is a standard abstraction and is widely studied, real devices exhibit additional effects (e.g., coherent errors and measurement noise) that may alter comparative outcomes. Second, the evaluation focuses on episode-to-threshold metrics and empirical boundary estimates; additional diagnostics (e.g., variance across random seeds, stability of Q-value estimates, and sensitivity to circuit depth and training hyper-parameters) would yield deeper mechanistic insight. Third, the architectural space explored is deliberately small; the negative results for Models A–C indicate that broader, more systematic architecture search is likely necessary to obtain consistently robust designs.
We proposed QCNN-inspired VQC architectures for QDQN and performed controlled robustness evaluations under depolarizing noise. Among the tested designs, Model D delivers a clear and reproducible robustness improvement, both in sample efficiency under fixed noise and in the estimated noise-tolerance boundary. These results reinforce the thesis that QCNN-style inductive biases and careful circuit parameterization can strengthen QDQN-like agents on noisy quantum hardware. Future work should extend validation to additional environments and more realistic noise models, and should incorporate systematic circuit discovery (e.g., ablation-driven refinement or automated search) to better understand which architectural features reliably translate into noise robustness.
Because our study is platform-agnostic and uses a controlled depolarizing noise model for fair architectural comparison, we do not commit to a specific qubit technology or device topology. On real hardware, additional effects such as measurement noise, coherent errors, and connectivity constraints (which may require SWAP networks for all-to-all interactions) can affect both performance and robustness. Validating the proposed architectures on specific devices with calibrated noise models is an important direction for future work.