3.1. RL-Based Quantum Routing
The RL-based approach is essential for improving quantum circuit performance in noisy, intermediate-scale quantum (NISQ) devices. Efficient routing reduces the requirement for error-prone swap operations, resulting in reduced circuit depth and higher algorithmic success rates. RL adapts routing techniques to various quantum designs and circuit needs, overcoming the limits of inflexible heuristic approaches. This subsection summarizes Q-learning-based reinforcement learning (qRL) [
26] and concludes with a brief comparative takeaway to support the systematic comparison of
Section 3.
In the qRL protocol [
26], Q-learning is used to improve routing decisions in dynamic quantum networks (
Figure 4). The objective is to maximize end-to-end entanglement integrity and request success rates while reducing resource consumption (qubits and EPR pairs). In qRL, an agent learns routing policies via Q-learning by interacting with a simulated quantum network environment [
42,
43]. The environment consists of nodes, communication channels, EPR pairs, and fidelity measures that provide real-time feedback on network status. The state (S) represents the current network setup, including channel quality, EPR pair availability, and the quantity of qubits per node. Based on this state, the agent chooses certain actions (As), such as selecting a certain route or assigning EPR pairs to connections, to optimize routing decisions. The reward (R) is determined based on the fidelity improvement of the chosen route and the effective use of network resources, stimulating the agent to prefer high-quality pathways while reducing resource consumption. The agent employs a policy (π), especially an ε-greedy strategy, that balances exploration (random actions) with exploitation (determining the most prevalent option) for adaptation in dynamic quantum networks. This method provides qRL to dynamically optimize routing decisions, producing high-fidelity communication and efficient resource use in quantum entanglement networks. We include this schematic to explicitly map the RL components (state–action–reward) to quantum networking primitives used in entanglement routing. The approach integrates two critical quantum networking processes, the end-to-end (E2E) entanglement rate and fidelity after purification. For a route with N repeaters, the E2E entanglement ratio is determined as follows:
where
γ is the EPR generation rate and
q is the per-stage success probability of entanglement extension (including entanglement swapping, as applicable). Furthermore, to improve fidelity using two low-fidelity EPR combinations (
f1,
f2), the following is used:
The Q-value for state-action pairs is updated using
where
α denotes the learning rate,
γ represents the discount factor, and
R(
s,
a) is the reward.
However, the authors use multiple experimental settings where the network architecture follows a lattice structure, expressed as (G = (V, E)). This structure consists of teleportation nodes and an embedded system (ES), which stipulates communication and processing. The configurations of the experimental setup are presented in
Table 1, and the key parameters (memory, initial fidelity, and EPR success probability) are varied across four configurations.
According to the study, the proposed qRL technique outperforms all baseline approaches, with around 20% greater fidelity in Config1, which models high-resource networks. Notably, in Config4, which correlates to low-resource situations, qRL maintains a fidelity of greater than 0.6, whereas the qDijkstra technique falls below 0.4. In bigger networks with n > 30, redundant routes improve system stability and dependability. Moreover, qRL accurately maintains a request success rate of more than 80% under a 3000-request demand, even in the resource-constrained Config4 environment. Traditional approaches, such as R0 and qDijkstra, show severe performance deterioration, with success rates falling below 50% under intense network loads. This demonstrates the stability and scalability of the qRL framework in dynamic network situations.
Comparative takeaway: Compared with heuristic baselines (qDijkstra and rule-based routing), qRL improves adaptivity under changing resource and load conditions, which helps sustain both fidelity and request success in constrained regimes. The trade-off is additional learning and tuning overhead (e.g., reward design and exploration strategy), and performance generalization depends on the reported simulation assumptions.
In another study, the authors propose proactive entanglement swapping and entanglement caching to improve quantum routing algorithms [
28], particularly SEER [
44] and REPS [
45]. The technique successfully stores idle entangled connections for future use, while applying RL to strategically swap entanglements in high-demand network segments. The workflow of the proposed approach in [
28] is shown in
Figure 5.
The study analyzes and optimizes the quantum routing process using several mathematical formulations, such as the Bellman equation, swapped segment, topology matrix, request matrix, and segment vector. All the equations can be stated as follows:
From Equation (4), the authors used Q-Learning update rules, where α is the learning rate and γ is the discount factor. is the reward for action in state .
The proposed approach’s reward system is based on three important components: positive rewards (PRs), negative rewards (NRs), and penalties (Ps).
In this framework,
PRs are awarded when proactively exchanged segments are successfully employed in future inquiries, hence promoting effective entanglement resource allocation. In contrast,
NRs are awarded when deliberately exchanged segments expire without being used, resulting in wasted resources. Furthermore, a
P is applied when the algorithm fails to decide on a segment that is further required, affecting the overall routing. Afterwards, in the final steps, the authors used deep Q-Learning state, such as
Here, represents the topology matrix, which serves as the adjacency matrix of entangled links. denotes the request matrix, capturing the number of requests between node pairs at a given time step. Additionally, refers to the segment vector, a binary vector that indicates the candidate segments available for routing entanglement.
Furthermore, the authors use several features to formulate their proposed approach.
Table 2 provides the specific setups of the experimental conditions utilized in that study.
The proposed DQRL technique improves request success rates for both SEER and REPS. In the instance of SEER, DQRL raises the success rate by 48%, from 50% to 74%. Similarly, REPS’ success rate improves by 61%, jumping from 45% to 73%. Additionally, the duration of entanglement has a significant impact on the effectiveness of entanglement caching. When entangled links endure actively for six or more time slots, caching improves request success rates by 12% to 19%. Furthermore, the chance of effective entanglement switching is critical for network performance. When the swap probability is increased to 0.9, the system achieves a 55% improvement over the baseline, highlighting the importance of effective entanglement swapping in improving quantum network stability.
Comparative takeaway: Compared with baseline SEER/REPS and heuristic proactive swapping, this RL-guided caching and swapping strategy provides higher request success rates by explicitly targeting high-demand segments and reusing idle entanglement resources. The approach is particularly effective when entanglement lifetime and swap success probability are sufficiently high, suggesting a trade-off in which gains depend on underlying physical-layer reliability and assumed timing parameters.
In another study, the authors propose a deep quantum routing agent (DQRA), a DRL architecture for routing entanglement requests in quantum networks, as illustrated in
Figure 6 [
30]. The proposed hybrid framework combines a scheduling neural network (SNN) with a qubit-preserving shortest-path strategy to maximize the number of requests served within each time window. DRL is primarily used for dynamic request scheduling to improve resource allocation across qubits and repeaters. The authors also argue that the approach supports scalability by keeping training and routing complexity polynomial with respect to network size, which targets large-scale quantum networks. However, the experimental setup of the study presented in
Table 3 and the overall mathematical terms can be expressed as
The reward at step t serves to maximize adjustment requests;
represents the number of resolved requests,
denotes the reward of success, and the condition is
.
is the penalty for failure, where the condition provides
.
γ ∈ [0, 1] refers to the discount factor, and f serves as the binary terminal for the episode’s completion. For the Q-Learning update, the study uses
In Equation (8), lr means the learning rate, and
is a target networks estimation. However, the study utilized the edge weight metric for shortest path weights using qubit ability:
where lower weights prefer nodes with greater qubit availability.
The assessment findings show that both DQN and DRN have high success rates in qubit-rich situations, ranging from 80% to 90%. Even under severe settings when c_i = 2, they retain a success rate of 60% to 75%, exceeding baseline approaches by 20% to 40%. In terms of scalability, training time grows polynomially with network size, whereas routing time remains linear, even for enormous networks like 25 × 25 nodes. In terms of training dynamics, DQN displays a faster convergence rate compared to DRN, but tends to plateau earlier. Additionally, the SNN scales quadratically with network size, indicating an increase in parameter count as network complexity rises.
Comparative takeaway: Relative to heuristic baselines that rely on fixed shortest-path scheduling, DQRA adds adaptivity through DRL-based scheduling and explicitly prioritizes qubit availability in routing decisions. This improves success rate under both qubit-rich and -constrained regimes, but introduces additional training overhead and a growing parameter footprint as network size increases.
In another study [
27], they examine proximal policy optimization (PPO)-based routing for quantum networks, which optimizes the singlet fraction (f) to mitigate quantum-specific errors such as amplitude damping and phase noise [
46]. This improves error resilience and uses a 6G-inspired low-density parity-check topology to provide reliable quantum communication [
47]. The overall procedure can be illustrated in
Figure 7. However, the study presents many formulations:
Here, denotes the fidelity, which quantifies entanglement quality in terms of teleportation performance. The study defines both an intermediate (preliminary) reward and a final reward, and uses a threshold value of 0.8 in the reward design. In addition, the quantum-state model considers Werner states under amplitude-damping noise and phase-noise effects.
The proposed method is compared to conventional baselines such as Dijkstra and Monte Carlo, using a reinforcement learning framework designed to maximize routing efficiency. However, the overall setup is demonstrated in
Table 4.
In convoluted networks with 16 nodes and white noise, PPO beats Monte Carlo by up to 13,000×. While Dijkstra performs well for additive mistakes, it fails when quantum noise is introduced. PPO successfully uses error cancellation for amplitude damping, needing just 60,000 actions, as opposed to 546 million for the same network size in Monte Carlo. PPO responds to reversible phase shifts, lowering the number of steps required to 39,000 compared to Monte Carlo’s 60 million. In dynamic network situations, PPO retains 93.4% functionality, considerably outperforming Monte Carlo’s 33.1% under changing error scenarios.
Comparative takeaway: Compared with classical shortest-path routing (Dijkstra) and sampling-heavy optimization (Monte Carlo), PPO introduces adaptive decision-making that explicitly accounts for quantum noise models, which improves robustness in amplitude damping and phase-noise settings. The main limitation is the reliance on extensive training interactions and the dependence of performance on the assumed noise model and topology configuration
In another study, Le et al. [
29] propose a deep reinforcement learning approach for entanglement routing in quantum networks and introduce the Deep Quantum Routing Agent (DQRA), a learning-based routing system designed to improve network performance by increasing the number of successfully served communication requests within a given time window. The method combines a deep neural network for request scheduling with a qubit-preserving shortest-path strategy for route selection. The quantum network is modeled as a graph
G = (
V,
E), where
V denotes quantum nodes and
E denotes entangled links. The routing objective is formulated to maximize the effective routing rate by selecting efficient paths for source–destination pairs under resource constraints. The DQRA framework represents the environment state using network topology, available qubit capacities, and pending requests. Actions correspond to scheduling and routing decisions for entanglement requests, while the reward function is designed to encourage serving more requests and penalize inefficient decisions or early termination. The authors train and evaluate the agent using deep learning-based variants, including a DQN-based scheduler and an additional DRN-based model, to support decision-making under dynamic conditions. The reported experiments on grid-based topologies indicate that the approach achieves high request success rates even under limited qubit capacity and demonstrates favorable scaling trends as network size increases.
Comparative takeaway: Compared with heuristic scheduling and static shortest-path routing, this DRL-based design improves adaptivity by jointly optimizing scheduling and routing decisions under resource constraints, while introducing added training complexity and dependence on the assumed topology and evaluation setup
3.2. Heuristic & Algorithmic Approaches
In this subsection we explore non-ML methods like heuristic algorithms and collaborative routing strategies. One study [
31] proposes the Collaboratively Optimized Selection of Paths (COSP) method [
31], which is intended to improve entanglement routing in quantum networks by maximizing predicted throughput, service rate, and quantum resource use. The technique follows a systematic approach, beginning with a network model where the quantum network is represented as a graph
G (
V,
E,
C), and where
V comprises quantum processors and repeaters,
E represents quantum channels, and
C reflects channel use. The entanglement process is separated into four stages: request submission, route selection, link status broadcasting, and entanglement swapping, which ensures systematic distribution of quantum resources. The COSP algorithm consists of multiple optimization strategies. The P1 Algorithm organizes requests into a multilevel queue based on priority and source-destination distance, reducing resource conflicts while ensuring fairness by prioritizing the furthest and closest pairs within the same priority level. The P2 Algorithm selects candidate paths using Dijkstra’s and Yen’s algorithms [
48], incorporating linear programming to estimate resource consumption. Additionally, Monte Carlo Tree Search (MCTS) is utilized to optimize resource allocation, treating the problem as a sequential decision-making process where each source-destination pair chooses a path to maximize expected throughput. The P4 algorithm further enhances reliability by implementing a “fail-retransmit” mechanism, allowing failed entanglement attempts to be another opportunity for success in the final request queue [
49]. COSP uses probabilistic models to calculate predicted throughput (
E) and resource usage (
U) for each path, depending on entanglement success rates and qubit consumption. The MCTS method is implemented as a Markov Decision method (MDP), with the goal of maximizing anticipated rewards (throughput) via optimum path selection. The UCB1 formula is adjusted to balance exploration and exploitation, which improves search tree decision-making efficiency. The COSP method has been statistically proven to improve network performance, greatly increasing predicted throughput and service rates, especially in high-concurrency settings, while ensuring fairness and effective resource usage.
Table 5 shows the way that, overall, the parameters and performance metrics can be organized.
However, the study proved that COSP showed a 50% increase in average predicted throughput over earlier methods, particularly in high concurrency conditions. COSP also considerably increased service rates, with a 55% increase. This suggests more efficiency in answering communication requirements.
Comparative takeaway: Compared with RL-based routing methods, COSP relies on explicit path search and resource estimation, which can provide more interpretable decision logic and avoid training overhead. However, the approach may incur higher online computation due to multi-stage optimization and MCTS, and performance depends on the assumed topology model and probabilistic resource estimates.
In another study [
35], the authors address the challenge of deploying quantum repeaters in large-scale quantum networks at reasonable cost. Quantum repeaters are required to extend communication distance by mitigating fidelity degradation during transmission. To reduce the number of repeaters while preserving end-to-end connectivity, the study proposes two heuristic placement strategies, the Multi-Center Approach (MCA) and the Single-Center Approach (SCA). These heuristics aim to obtain near-optimal solutions substantially faster than Integer Linear Programming (ILP), which can become computationally expensive for large networks [
50]. The experimental settings and evaluation metrics are summarized in
Table 6.
MCA operates in two phases: center selection and center connection. In center selection, nodes are chosen as candidate repeater locations based on connectivity and coverage, with additional centers added when the distance between leaf-node pairs exceeds the maximum transmission range Lmax. The process continues until the network is covered by the selected centers. In the center connection phase, inter-center links are established, and intermediate nodes are introduced when distances exceed Lmax. The study further explores connection strategies, including an MST-based method that adds repeaters where needed and a more exhaustive variant that considers additional nodes to improve connectivity [
50,
51,
52].
Figure 8 illustrates the overall methodology.
The reported results indicate that the heuristic methods substantially reduce computation time compared with ILP. For example, the study notes that ILP can take days to solve a network with 54 nodes, whereas SCA can produce a feasible solution within seconds, illustrating the practical advantage of heuristic placement in large-scale settings.
Comparative takeaway: Compared with RL-based routing methods that optimize online decisions, repeater-placement heuristics target the infrastructure layer by reducing required hardware while keeping connectivity constraints. The main benefit is scalability and fast runtime relative to ILP, while the limitation is that solution quality depends on heuristic design choices and network topology assumptions.
3.3. Entanglement and QKD Based Methods
Recent quantum networking studies increasingly rely on entanglement-routing optimization and QKD-oriented protocol design to improve reliability, throughput, and security under realistic constraints [
32,
33,
34]. This subsection summarizes how these studies structure their methods and what their findings imply for future deployment.
In study [
32], Yiming et al. propose a two-stage framework for entanglement routing that aims to maximize both the number of served quantum-user pairs and the expected throughput. The offline stage focuses on selecting a set of quantum-user pairs and their primary routes by solving an ILP formulation. Because the offline problem is NP-complete, the authors relax binary decision variables to continuous values, solve the resulting linear program, and recover an integer solution using branch-and-bound. Candidate paths are generated using shortest-path-based selection (via Yen’s method) to limit resource usage and form a set of primary routing paths.
In the online stage, the objective shifts to maximizing predicted throughput by optimizing qubit assignments along the selected routes. This stage is also formulated as an ILP and is NP-hard [
53]. The study follows a similar relax-and-recover approach (continuous relaxation followed by branch-and-bound) to obtain a feasible integer solution and improve route-level qubit allocation. To enhance robustness, the authors incorporate a recovery-route mechanism in which each switch on a primary path maintains a precomputed local recovery path. When a link fails, a switch uses local information to trigger an alternative recovery path and perform entanglement swapping, thereby sustaining connectivity and improving stability under failures. For the network model, the authors used quantum network graph quantum switch and quantum link. All the mathematical formulations used can be expressed as
where
G means the undirected graph,
is the quantum set, and
denotes the edges that connect the quantum set’s vertices. Each quantum switch has a qubit accessible for entanglement, and an effective switching rate is denoted in each transition as
q ∈ [0, 1]. Then, each link
has a success rate of entanglement generation as follows:
Here, alpha is a constant connected to link materials, whereas is the connection’s length.
For routing, the authors utilized a notion of expected throughput for a quantum-user pair; the expected throughput of a route is
A = {
v_0,
v_1, ……
v_
l}, where
and
are the user pairs.
Here,
is the number of qubits,
A is the path, and
is the success rate of generating entanglement on a connection.
q represents the switching in each switch, whereas
l is the number of edges in the route. The shortest pathways are determined using Yen’s method, which minimizes resource use and ensures efficient routing. When a connection fails, the recovery route algorithm creates alternate paths within a K-hop distance to improve network resiliency and dependability. The switching strategy drives entanglement swapping at switches, preferring quantum-user pairings with greater predicted throughput to ensure connection and performance. To ensure near-optimal solutions, the branch-and-bound method is used, with an approximation ratio of 2, resulting in efficient resource use and high-performance quantum communication [
54].
Figure 9 shows the overall solving problems of the study for three stages. Additionally,
Table 7 summarizes their setups.
The authors compare their proposed algorithms, MULTI-R and MR-REC, to benchmark methods, emphasizing important findings. MULTI-R reliably serves all 20 quantum-user pairs, exceeding FER, Q-PASS [
55,
56], and B1; however, FER suffers in resource-constrained networks. MR-REC achieves the highest predicted throughput, improving by up to 55% above FER, whereas MULTI-R falls behind but still outperforms FER. Despite having the maximum throughput, ALG-4 loses fairness by providing fewer pairings. In terms of resilience, when 10% of the switches go offline, MR-REC has a greater decline in throughput but stays stable, with losses remaining less than twice the ratio of offline switches.
Comparative takeaway: Compared with greedy and heuristic baselines (FER, Q-PASS, and hop-based routing), the proposed MULTI-R/MR-REC framework provides more systematic control over the trade-off between fairness (serving more user pairs) and throughput (expected ebits per unit time) by jointly optimizing path selection and qubit allocation. Unlike approaches that only optimize routing paths, it explicitly includes a recovery mechanism to handle link failures, improving resilience. However, relative to learning-based routing methods, it relies on ILP and branch-and-bound procedures, which introduce higher computational overhead and make scalability and runtime more sensitive to network size and optimization constraints.
In another study, Ref. [
34], proposed a multi-qubit GHZ state-based QKD system that employs single qubit transmission and QND measurements to transmit multiple classical bits. The suggested QKD paradigm involves Alice generating L+1 GHZ state qubits and sending one to Bob. She generates L auxiliary qubits and encodes key values (1 as |1⟩|1⟩ and 0 as |0⟩|0⟩). Alice does a Bell State Measurement (BSM) between the first entangled qubit and the first auxiliary qubit and reports the result to Bob. Bob applies suitable gates depending on the BSM results and tests his qubit using the QND measurement.
Table 8 presents the overall QKD method and key features of the study and
Table 9 shows the overall setup of the study, and
Table 10 displays the experimental setup for the QKD Scheme.
Comparative takeaway: Compared with standard single-qubit QKD protocols, the GHZ-based design improves protocol efficiency by enabling multiple key bits per transmitted qubit via multi-qubit entanglement and QND measurement. However, it requires more complex state preparation and control operations, and its practical benefit depends on the feasibility of maintaining GHZ entanglement and performing reliable measurements under realistic noise and resource constraints.
In another study [
33], the authors wanted to improve fidelity for encoded quantum bell pairs for long and short-distance communication, together with the generalized network illustrated in
Figure 10. A flowchart describing the major phases in the process might help visualize the technique. To begin, a Bell state |φ +⟩ |ϕ +⟩ is created using Hadamard and CNOT gates.
The Bell state is encoded using a Quantum Repetition Code (QRC) with 2k + 1 qubits. The ancilla qubits are initialized to |0⟩ and entangled with the main qubits via CNOT gates. The encoded qubits are sent over a quantum channel, resulting in bit flip or phase flip errors. Short-distance communication uses stabilizer formalism to assess symptoms and repair faults locally. To measure symptoms across great distances, Alice and Bob employ a traditional communication channel, which allows them to communicate measurement findings and fix mistakes globally. Finally, the fidelity of the Bell state is determined following error correction to assess the protocol’s performance. The entire scheme was simulated using the IBM Qiskit QASM simulator.
Comparative takeaway: Compared with routing-centric or scheduling-centric methods, this work focuses on physical-layer robustness by improving entanglement fidelity through encoding and error correction. The approach can enhance reliability under noise, but it introduces additional qubit overhead (2k + 1 encoding) and operational complexity, which may increase resource demands when integrated into larger network-level protocols.