A Novel Path Planning Approach for Mobile Robot in Radioactive Environment Based on Improved Deep Q Network Algorithm

: The path planning problem of nuclear environment robots refers to ﬁnding a collision-free path under the constraints of path length and an accumulated radiation dose. To solve this problem, the Improved Dueling Deep Double Q Network algorithm (ID3QN) based on asymmetric neural network structure was proposed. To address the issues of overestimation and low sample utilization in the traditional Deep Q Network (DQN) algorithm, we optimized the neural network structure and used the double network to estimate action values. We also improved the action selection mechanism, adopted a priority experience replay mechanism, and redesigned the reward function. To evaluate the efﬁciency of the proposed algorithm, we designed simple and complex radioactive grid environments for comparison. We compared the ID3QN algorithm with traditional algorithms and some deep reinforcement learning algorithms. The simulation results indicate that in the simple radioactive grid environment, the ID3QN algorithm outperforms traditional algorithms such as A*, GA, and ACO in terms of path length and accumulated radiation dosage. Compared to other deep reinforcement learning algorithms, including DQN and some improved DQN algorithms, the ID3QN algorithm reduced the path length by 15.6%, decreased the accumulated radiation dose by 23.5%, and converged approximately 2300 episodes faster. In the complex radioactive grid environment, the ID3QN algorithm also outperformed the A*, GA, ACO, and other deep reinforcement learning algorithms in terms of path length and an accumulated radiation dose. Furthermore, the ID3QN algorithm can plan an obstacle-free optimal path with a low radiation dose even in complex environments. These results demonstrate that the ID3QN algorithm is an effective approach for solving robot path planning problems in nuclear environments, thereby enhancing the safety and reliability of robots in such environments.


Introduction
Nuclear energy, regarded as a significant clean energy source, is extensively utilized worldwide.However, its development and utilization have also introduced severe environmental challenges [1].The nuclear environment is characterized by unique conditions, including high temperature, high pressure, and intense radiation emitted by nuclear reactors, which pose substantial safety risks to workers.To safeguard personnel safety, enhance work efficiency, and ensure environmental stability, nuclear environment robots have been introduced into the nuclear energy environment.As a robot specifically designed for the nuclear energy environment, nuclear environment robots have the ability to adapt to extreme conditions and replace humans in performing dangerous tasks.In the process of realizing the autonomous navigation of nuclear environment robots, path planning technology is particularly important.The path planning problem of nuclear environment robots refers to finding a collision-free path under the constraints of path length and an accumulated radiation dose [2].
In recent years, path planning in nuclear environments has been a research hotspot of great interest.Numerous scholars have provided innovative ideas for the path planning problem in nuclear environments.Pei et al. [3] proposed a Dijkstra algorithm based on the cumulative dose to find the optimal evacuation path, and they used the equilibrium model considering the road resistance to formulate the optimal evacuation path planning under a nuclear accident; the obtained paths were rationally arranged to minimize the collective dose to all the personnel.Chen et al. [4] proposed an improved version of the A* algorithm, which provides a path with good results for the nuclear power plant environment that not only has a lower cumulative dose, but also has the advantage of suitable length.According to the experimental results, this improved version of the A* algorithm has a strong antiinterference ability and can effectively cope with environmental disturbances.Chao et al. [5] proposed a grid-based rapidly exploring random trees star (GB-RRT*) algorithm, which is more efficient than the traditional rapidly exploring random trees star (rapidly exploring random trees star) algorithm.The rapidly exploring random trees star (RRT*) algorithm performs better in terms of convergence and the reliability of the minimum dose path.Therefore, the GB-RRT* algorithm provides an efficient realization of the minimum dose path for occupational workers.Zhang et al. [6] proposed a path planning scheme with double constraints in which both the time and dose values are taken into account.In order to better minimize the effect of the radiation dose on path planning, they introduced a radiation dose weighting factor and incorporated it into an ant colony algorithm for correction.According to the experimental results, the algorithm provides new ideas and effective solutions for path planning with respect to nuclear accident emergency rescue.Ya et al. [7] proposed two bio-inspired minimum dose path planning methods using the particle swarm optimization (PSO) algorithm and genetic algorithm (GA), respectively, which resulted in optimal paths for robots in radioactive environments with low radiation exposure.Zhou et al. [8] proposed an improved hierarchical swarm ant colony algorithm known as the improved hybrid elephant swarm ant colony algorithm.With the improved algorithm, the paths obtained in radioactive environments are more reliable and converge faster, especially in environments containing concave obstacles, which yielded good search results.The results showed that the application of this algorithm in nuclear accident emergency rescue has a broad prospect.Wang et al. [9] improved the particle swarm algorithm and established a mathematical dosage computational model for the path planning problem.The experimental results showed that the algorithm is able to reduce the effective dose value while ensuring that the overall convergence speed is not affected.
However, although these algorithms have achieved some results in path planning in kernel environments, they still have some limitations and disadvantages.For example, they often can only find the local optimal solution, and it is difficult to find the global optimal solution for complex nonlinear problems; at the same time, it is more difficult to handle multiple constraints, and it is difficult to take into account all of the constraints.In contrast, path planning algorithms based on deep reinforcement learning (DRL) [10][11][12] have significant advantages.First of all, deep reinforcement learning can model and solve complex nonlinear problems and can find better path planning schemes.Second, through machine learning and autonomous decision making, deep reinforcement learning is able to perform path planning with multiple constraints considered, thereby leading to better trade-offs and optimization.To the best of our knowledge, there is currently no research on solving multiobjective path planning problems in radioactive environments using deep learning.This article proposes improvements upon traditional deep reinforcement learning algorithms and applies them to path planning in nuclear environments.
At present, the most widely used deep reinforcement learning path planning is the Deep Q Network (DQN) proposed by the DeepMind team [13].This algorithm not only solves the dimensionality catastrophe in the traditional Q-learning algorithm, but also can use the environmental information acquired by the robot as the training data to train the network so as to achieve the purpose of network convergence.However, the traditional DQN algorithm has the following drawbacks: (1) over-estimation [14], i.e., the estimated value of the network model output is larger than the true value, and the degree of being overestimated varies for each point of the value function, which can cause the robot to choose the suboptimal strategy and ignore the optimal strategy in many cases, thus affecting the effectiveness of path planning; and (2) low sample utilization [15], which occurs since the DQN algorithm adopts a random and uniform way to draw samples from the experience base for training, thus causing it to repeatedly select one part of the samples for training while ignoring the other part, which reduces the convergence speed of the network and affects the training effect of the network.In order to solve the above problems, many scholars have proposed various improvement algorithms.
To address the overestimation problem, Hado et al. [16] proposed the Double Deep Q Network (DDQN) algorithm, which solves the overestimation problem caused by the greedy strategy by utilizing the dual network to estimate the value of the action twice.Liu et al. [17] proposed the EN-DQN algorithm with exploratory noise, which selects the action based on the change of the network output value caused by the added noise and optimizes the network training effect by memorizing the multistep changes affecting the decision making of the intelligences through recurrent neural networks.Wang et al. [18] introduced a competitive network to balance the effects of actions on Q values in the current and target networks, respectively, as well as remove reward bias.Kim et al. [19] proposed a dynamic real-time fusion DDQN method, which trains the network parameters through a priori knowledge, and adjusting the weights to the overestimation is reduced to ensure that the selected policy is optimal.Dong et al. [20] proposed an algorithm for a dynamic target DDQN (DTDDQN), which combines the DDQN with the average DQN algorithm to improve the network parameters and solve the overestimation problem.
To address the problem of low sample utilization, Schaul et al. [21] proposed a preferred experience playback mechanism to improve the utilization of high-quality samples.Zhang et al. [22] proposed a DQN method for sample storage, which utilizes a playback pool to store network data to solve the problem of low sampling efficiency, and they verified the effectiveness of the algorithm through an analysis of the simulation environment and the real environment; Zhao et al. [23] proposed a secondary active sampling method, which selects samples from both sequence cumulative returns and the TD error, as well as uses the samples obtained from the two samplings to train to obtain the optimal policy.The work of [24] proposed the RSV-Dueling DQN method for state value reuse, which standardizes the reward values and then combines them with the Q values obtained by a dueling DQN to improve the sample utilization rate.Li et al. [25] proposed the IDQNPER algorithm.The stored samples are given weights and sent to the network for training in a certain prioritized order, while the sequences with high similarity are removed to improve the network training effect.The aforementioned improved DQN algorithms have to some extent improved the performance of the DQN algorithm, but they still suffer from slow network convergence speed and poor training effectiveness.
This paper studies the global path planning problem in a static radioactive environment with known information of the radioactive sources and obstacles.That is, this information is detected by the sensors (e.g., radiation detectors and LiDARs) mounted on the mobile robot before path planning.In this paper, we proposed the Improved Dueling Deep Double Q-Network (ID3QN) algorithm based on an asymmetric neural network structure to solve the global path planning problem of the mobile robot in a radioactive environment by considering the dual constraints of the path length and cumulative radiation dose rate.The algorithm we proposed not only accelerates convergence speed, but also improves the quality of the path planning solution.In general, the main contributions of this paper are as follows: (1) Some improvements of traditional DQN algorithm have been developed for the path planning of a mobile robot in a radioactive environment.In the proposed ID3QN algorithm, the neural network has been optimized to a decomposed network structure consisting of a 'value network' for estimating the state value and a 'dominant network' for estimating the advantage of each action relative to the average action.The decomposed network structure can estimate the value of each action more accurately.Furthermore, we proposed an improved action selection mechanism to achieve a balance between exploration and exploitation for enhancing the convergence speed of the algorithm; In addition, a prioritized experience replay mechanism has been introduced to maintain sample diversity and improve the utilization of important samples; Finally, the reward function has been modified by considering the path length and cumulative radiation dose to provide effective feedback of every action taken by the mobile robot.All these enhancement mechanisms can improve the efficiency and flexibility of the proposed ID3QN algorithm and thus improve the solution quality of the dual-objective path planning problem in a complex radiation environment.
(2) The deterministic algorithm A*, intelligent algorithms ACO and GA, traditional DQN algorithms, D3QN algorithm and our proposed ID3QN algorithm were used for comparison experiments in two radioactive environments of different complexities, and the effectiveness and advantages of the proposed algorithm were proven.
The remaining sections of this paper are structured as follows: Section 2 presents the ID3QN algorithm.Section 3 provides an analysis and comparison of the experimental results.In Section 4, we discuss the proposed algorithm and some advanced evolutionary algorithms.Finally, Section 5 presents a summary of our findings and suggestions for future research.

Target Network Optimization
The traditional DQN algorithm uses a single Q network to estimate the value function of actions.This approach may result in overestimation of the values for certain actions, thereby leading to the selection of inappropriate actions and reducing the convergence speed of the algorithm.The calculation formula of the target network in the traditional DQN is shown in Equation (1) : In Equation (1), y DQN i is the Q value of the target network in the DQN algorithm.It represents the expected reward obtained after performing an action.γ is the discount factor coefficient, s t+1 represents the next state, a is the action performed, θ − t refers to the parameters of the target network at time t s, and refers to the parameters of the estimation network.maxQ (s t+1 , a, θ − t ) represents the maximum Q value corresponding to the action in state s t+1 as estimated by the estimation network at time t s.
The ID3QN algorithm incorporates the advantages of the double DQN improvement algorithm and introduces a target network to avoid the overestimation issue.Specifically, ID3QN uses the target network to calculate the maximum Q value for the next state, and it uses the current main network to select the action with the maximum Q value for the current state, thereby reducing the overestimation of values.The calculation formula of the target values in the ID3QN algorithm is shown in Equation (2): In Equation (2), y ID3QN i is the Q value of the target network in the ID3QN algorithm.It represents the expected reward obtained after performing an action.γ is the discount factor coefficient, s t+1 represents the next state, and θ − t represents the parameters of the target network and estimation network at time t s, respectively.maxQ(s t , a t , θ t ) represents the action corresponding to the maximum Q value in state s t as estimated by the target network at time t s.

Optimize Network Structure
While the symmetric network structure of traditional DQN provides model stability, it struggles to handle the understanding of model decisions and exhibits lower efficiency in capturing differences between different actions.Therefore, building upon this foundation, this paper adopts an asymmetric neural network structure to accurately model different actions in order to cater to the requirements of path planning in radioactive environments, thereby enhancing the model's performance and efficiency.This network decomposes the Q value function into two components: 'value' and 'advantage'.It separately calculates the value of states and the advantage of actions, and it then combines them to obtain the Q value for each action in the current state.Specifically, the network structure of ID3QN includes two branches: a "value" network and an "advantage" network.The "value" network estimates the value of states, while the "advantage" network estimates the advantage of each action relative to the average action.The calculation of the Q value in the competing network of ID3QN is shown in Equation ( 3): Therein, s represents the state, a represents the action, θ represents the network parameters, α and β respectively represent the parameters for the advantage network and value network, V(s; θ, β) represents the value for state s, A(s, a; θ, α) represents the advantage for the current action a, |A| represents the size of the action space, and ∑ A(s, a ; θ, α) represents the average advantage of all actions.The final Q value denotes the estimated reward for selecting action a in state s.
The method of decomposing the Q value function can more accurately estimate the value of each action while reducing variance, improving learning efficiency, and training stability.In addition, ID3QN also adopts a learnable parameter sharing mechanism, which reduces the number of parameters in the network, improves the network's training speed, and improves computational efficiency.The symmetric network structure of traditional DQN and the network structure of the proposed algorithm in this paper are illustrated in Figure 1.

Improved Action Selection Mechanism
When using Equation ( 1) to approximate the value function as the optimization target in DQN, the action corresponding to the maximum Q value in the next state is selected every time.Both action selection and evaluation are based on the parameters of the target value network, which can lead to overestimation of Q values, thus resulting in poor exploration ability, low training efficiency, and negative impact on the convergence speed of the network, as well as the effectiveness of path planning [14].
To address this issue, the ID3QN algorithm adopts an action selection strategy based on obstacle memory units and probabilities.Specifically, by modifying the calculation method of Q values, the number of times the robot enters obstacle states and repeated states during training is reduced, thereby allowing it to explore new states with more capacity.At the same time, the convergence speed of the network is improved, and the effectiveness of path planning is optimized.The obstacle memory unit is represented as Obstacle(s t , a); after each action selection, it is determined whether the state is an obstacle, and Obstacle(s t , a) is updated accordingly.If it is an obstacle, Obstacle(s t , a) is updated to 1; otherwise, it is updated to 0. Finally, the calculation method of Q(s t , a) is modified as follows: where 'inf' represents infinity, µ is the parameter for probability adjustment, and the calculation formula for the probability of selecting action a in state s is defined as follows: where T is the number of times action a has been selected in state s t .As the number of times action a is selected increases (represented by T becoming larger), the value of P(s t , a) becomes smaller.Since the robot selects the action corresponding to the maximum Q value, when P(s t , a) is decreased, it tends to explore other actions and new states, thereby avoiding getting stuck in local optima.
In order to balance the exploration and exploitation trade-off in decision making, this method still adopts an ε-greedy strategy for action selection.Under this strategy, the algorithm randomly selects actions with a certain probability to explore unknown states while also retaining the currently known optimal action for exploitation.The new action selection strategy is defined as follows: Among them, argmax a∈A Q (s t , a) means that if the random value µ is less than ε, then the robot keeps the current known optimal action; otherwise, it randomly selects an action.

Prioritized Experience Replay Mechanism
In traditional DQN experience replay, previous experiences (including state, action, reward, and next state) are continuously stored in an experience pool during training.When the experience pool is full, new data will overwrite old data.During learning, a batch of experiences is randomly sampled from the experience pool for training the neural network.The traditional sampling method is uniform sampling from the experience pool, which is simple to implement but cannot differentiate the importance of different samples.Additionally, the limited memory of the experience replay pool leads to low sampling efficiency.Therefore, it is necessary to improve the experience replay mechanism of DQN to increase the frequency of sampling helpful samples and reduce the frequency of sampling unhelpful samples to improve learning efficiency.Prioritized experience replay mechanism is an effective approach for this purpose.Specifically, the traditional experience replay mechanism can be modified to prioritize experience replay as follows: Step 1: Add a prioritized sample memory pool, which is specifically used for storing samples of path planning.
Step 2: During the learning process, whenever a better path is learned, update the prioritized sample memory pool; add the new sample to the prioritized sample memory pool and remove old samples from it.
Step 3: During training of the neural network, randomly sample a batch of data from the normal sample memory pool and combine it with all the data from the prioritized sample memory pool for training the neural network.
Step 4: Adopt prioritized sampling method by assigning higher priority to samples in the prioritized sample memory pool to increase the likelihood of being selected for training the neural network.
This method can improve learning efficiency, better differentiate sample importance, and increase sampling efficiency.

Improvement of the Reward Function
The agent in reinforcement learning must interact with the environment in order to obtain reward values and update network parameters to optimize its behavior policy.Therefore, the reward function plays a crucial role in reinforcement learning path planning, and its quality directly affects the effectiveness of the algorithm and the convergence of the network.Based on the traditional DQN algorithm, the reward function is set as follows: In Equation ( 7), when the robot selects an action and reaches the target state, we provide a positive reward by setting C 1 as a positive integer.However, if the robot reaches a state occupied by an obstacle after selecting an action, we consider it to be a collision, and we impose a negative penalty by setting C 2 as a negative number.The traditional DQN algorithm's reward function indicates that the robot can only receive a reward value when it reaches the goal point or fails to plan, and there is no timely and effective feedback in other states.Even with training, the robot still needs to go through many rounds to plan the desired path, and the planned path is often not optimal.In order to plan a comprehensive optimal path with a short length and a small cumulative radiation dose, this paper redesigns the reward function, as shown in Equation (8).
In Equation ( 8), the selection of values for C 1 and C 2 is typically such that C 1 is chosen as a positive value while C 2 is assigned a negative value.K 1 represents the parameter used to adjust distance, while K 2 corresponds to the parameter utilized for irradiation adjustment.The restricted area refers to the region where the radiation dose exceeds the specified radiation dose.Additionally, J represents a constant in the equation.Furthermore, D ig signifies the distance between the current state and the endpoint, D jg represents the distance between the subsequent step state and the endpoint, and R ij denotes the cumulative irradiation dosage between the current position and the subsequent position.
We use Euclidean distance to measure the cost of movement between two points, and the calculation formula is as follows: where (x i , y i ) and (x j , y j ) respectively represent the coordinates of the current state and the next state, and D ij represents the distance between the two points.The calculation formula for the radiation dose at the current position and the next position is defined as follows: where R i (x i , y i ) represents the average radiation dose rate of grid i, R j (x j , y j ) represents the average radiation dose rate of grid j, and D ij represents the Euclidean geometric distance between the two positions.
In order to better evaluate the goodness of a path, we have defined the following evaluation function: In Equation (11), F(n) represents the heuristic function of reaching the target point via any node n from the starting point, D(n) represents the cumulative distance of reaching the target point via any node n from the starting point, and R(n) represents the cumulative radiation dose of reaching the target point via any node n from the starting point.When the robot plans a collision-free path, we calculate F(n).If F(n) is smaller than the cost of the previously planned path, then we consider it to be a better path.Select action a t = argmaxQ(s t , a t , θ) Update the Q values for all the samples using Equation ( 2 end for 34: end for

Robot Path Planning Based on ID3QN
This paper proposes a path planning algorithm based on ID3QN, the overall process of which is shown in Figure 2 and mainly involves the following steps: Step 1: Sensor collects state information.The robot perceives the status information of the surrounding environment, such as the position of obstacles and the current position of the robot, through the laser radar and sensors and inputs it into the ID3QN network.
Step 2: Calculate Q values and select actions.This paper proposes a state exploration mechanism based on obstacle memory units and probabilities.By improving the Q value calculation method and the method of selecting actions, the final Q value and the current action a are obtained.
Step 3: Reward evaluation.After the robot executes an action, it obtains a reward R based on the new state of the robot and the changes in the environment.The reward R refers to the contribution of the robot to the path planning goal, such as avoiding obstacles, reducing cumulative radiation dose, and reaching the target position as soon as possible.
Evaluate the quality of the selected actions based on the reward R to provide guidance for the next step of path planning.
Step 4: Update experience pool.Store the current state information, selected actions, rewards, and other data in the experience pool.At the same time, use the improved prioritized experience replay mechanism to randomly select samples from the experience pool for training, and iteratively update the network parameters.
Step 5: Generate iterative loop.By constantly looping the above process with the goal of maximizing the cumulative reward value, the optimal action corresponding to the optimal action value function can be obtained.In grid map path planning tasks, each state of the robot is unique and can be represented by the following formula: In the experiment, the robot adopts eight-directional movement methods [18], which are left-up, up, right-up, left, right, left-down, down, and right-down-as shown in Figure 3-and its corresponding movement methods are marked as actions 0, 1, 2, 3, 4, 5, 6, 7, respectively.The experiment utilizes the improved reward function proposed in this paper.The parameters in Equation ( 8) are assigned specific values for the purpose of maximizing the expected cumulative reward from the initial state to the target state, thereby directly influ-encing the outcome of path planning.In this study, when the robot reaches the endpoint, a reward value is assigned with C 1 set to 5 [20].Conversely, encountering obstacles incurs a penalty, with C 2 set to −1 [20].To strike a balance between path length and cumulative dose impact on path selection, K 1 and K 2 are both set to 1 when the robot is in free grid areas.Additionally, the constant J is set as e.The resulting reward function is expressed in Equation (15).

Experimental Environment and Parameter Configuration
Based on a decommissioned plant of a nuclear facility, an experimental environment was established, which is shown in Figure 4.In order to establish the navigation map of the robot, reference [2], the grid method was used to model the environment.The 3D space was first divided into an obstacle grid and a free grid, where the obstacle grid included nuclear facilities and obstacles (non-nuclear facilities).In the process of mapping the 3D scene to a 2D map, different colors were used to indicate the irradiation intensity, including red, yellow, and green.In this case, the black grid represents obstacles, the white grid represents the free grid, the red color indicates the strongest irradiation, the yellow color is the next strongest, and the green color represents the lower irradiation.With such a modeling approach, the robot can be provided with critical navigation information and assisted decision making to ensure the safe operation of the robot in the nuclear energy experimental environment.In order to demonstrate the adaptability of this algorithm to environments with different levels of complexity, simple radioactive raster maps and complex radioactive raster maps were set up.The environmental setups are illustrated in Figure 4a,b respectively.In the simple radioactive raster map, four radioactive sources were placed with sparse obstacle setups.In the complex radioactive raster map, seven radioactive sources were set up with varying intensities, and the obstacle setups were more densely configured.In this paper, we conducted two sets of path planning experiments in a raster map environment.We compared the ID3QN algorithm with the traditional A* [4], GA [20], improved ACO [2], DQN [13], and D3QN [26] in both the simple and complex radioactive raster environments.The goal was to validate the superiority, effectiveness, and robustness of the ID3QN algorithm over the other algorithms.We used algorithmic scores [27], path length, the number of corners, the number of collisions [20], and the accumulated radiation dose [2] as comparison indexes to verify the ID3QN algorithm's superiority, effectiveness, and robustness compared to the other algorithms.
Among them, the number of iterations of the DQN, D3QN, and ID3QN algorithms was set to 6000 times.The robot started from the starting point, and when it reached the end point or the movement step exceeded the maximum step, this marked the end mark of an iteration.In the path planning experiment of this paper, the robot that successfully reaches the end point without collision will be regarded as a successful path planning and receive one point, while any collision or timeout will be regarded as a failed path planning and receive zero points [27].
The experimental configuration of this paper is shown in Table 1: The experimentally relevant parameters for all the algorithms in this paper are shown in Table 2: In the experiment, several key parameters influence the training effectiveness of algorithms such as the ID3QN, D3QN, and DQN.These parameters include the neural network architecture, learning rate, sample size for random replay, experience pool capacity, exploration factor, discount factor, iteration count for updating target network parameters, and probability exploration parameter.We carefully set these parameters based on the characteristics of the grid map environment and the insights gained from previous studies [20][21][22][23][24][25][26][27].
To begin with, our neural network adopts a three-layer fully connected structure comprising an input layer, a hidden layer with 256 neurons, and an output layer.The input layer consists of two neurons, which receive input information in the form of two-dimensional coordinates.The output layer comprises eight neurons, which represent the Q values of the eight actions within the action space.This output layer generates an eight-dimensional tensor.We utilized the rectified linear unit (ReLU) activation function [20], Adam optimizer [27], and mean squared error (MSE) loss function [27].
The DQN uses a single neural network to estimate the value function for each action.The ID3QN and D3QN introduce dual estimation (Q value and target Q value) to reduce estimation bias, and they also include "Advantage" and "Value" branches to separately estimate the advantage of each of the actions and the value functions of the states.The neural network structures of the two models are different and have inconsistent complexities.The time complexity of the neural network model depends on the number of layers, the number of neurons per layer, and the dimensionality of the input states.[28] The time complexity can be estimated using the following method: Among them, D represents the number of convolutional layers, which is the depth of the network.l represents the lth layer of the network.n l represents the number of neurons in the lth layer.C l−1 represents that the input state of the previous layer has C features.n 2 l represents the forward propagation complexity of the neural network, and C l−1 * n l represents the backward propagation complexity of the gradient descent algorithm.
Since the DQN uses a single neural network, the overall time complexity is defined in Equation ( 16).In the network structures of the D3QN and ID3QN, there are two branches, namely, the advantage branch and the value branch.The forward propagation time complexity of these branches is twice that of the DQN.Additionally, since the D3QN requires two neural networks (the main network and target network) for double estimation, the overall time complexity becomes four times that of the DQN, which is defined in Equation (17).
Taking into account the unique properties of the 20 × 20 grid map, we have set specific values for the following parameters: the learning rate (0.01), sample size for random replay (64), experience pool capacity (3200), discount factor (0.9), and iteration count for updating the target network parameters (10).The exploration factor is crucial for striking a balance between exploration and exploitation during the early stages of the algorithm.Initially, we set the exploration factor to 0.99 and gradually decreased it using a simulated annealing strategy until it reached 0.01.By applying this annealing process, the robot continually explored unfamiliar environments, gathered more diverse samples for network training, and ultimately achieved superior path planning results.Additionally, the probability exploration parameter was set to 0.6 to leverage the benefits of the Q network training results for action selection while avoiding the pitfall of getting trapped in local optima.
In the path planning experiment using the GA, there are several key parameters that affect the experimental results, including the population size, maximum number of evolution generations, crossover factor, and mutation factor.We set these main parameters that impact the GA based on the characteristics of the grid map path planning and the experiences of some scholars in parameter settings [7,29,30].After continuously adjusting through experiments, we determined the values for these parameters to be 100 for the population size, 500 for the maximum number of evolution generations, 0.8 for the crossover factor, and two for the mutation factor.
In the path planning experiment using the ACO algorithm, there are several key parameters that affect the experimental results, including the number of iterations, number of ants, pheromone influence factor, heuristic factor, pheromone evaporation coefficient, and pheromone reinforcement coefficient.We set these main parameters that impact the ACO algorithm based on the characteristics of the grid map path planning and the experiences of some scholars in parameter settings [2,31,32].After continuously adjusting through experiments, we determined the values for these parameters to be 500 for the number of iterations, 100 for the number of ants, three for the pheromone influence factor, seven for the heuristic factor, 0.3 for the pheromone evaporation coefficient, and one for the pheromone reinforcement coefficient.

Comparison of Scenario 1
The ID3QN algorithm proposed in this paper was simulated and tested in the Graph 4 environment.The comparison was made with the A*, GA, and ACO, with the comparison indicators being the path length, the number of corners [20], accumulated radiation dose [2], and the training duration.Based on the comparison data, the path planning route map was obtained and is presented in Figure 5, and the comparison indicator data was summarized and is presented in Table 3.The ID3QN algorithm proposed in this paper was simulated and tested in the environment shown in Figure 4a, and it was compared with the A* algorithm, GA algorithm, ACO algorithm, DQN algorithm, and D3QN algorithm; the metrics of comparison are the algorithmic scores [27], path length, the number of corners, number of collisions [20], and accumulated radiation dose [2].By comparing the data, this paper derives the path planning line diagram, which is shown in Figure 5, and summarizes the metrics comparison data in Table 3.
According to the data in Table 3 and Figure 5, it can be concluded that compared with the traditional A*, GA, and ACO, the ID3QN algorithm has the shortest path length of 29.7990 m; the lowest cumulative irradiation dose of 15.5711 mSv; and the number of inflection points of 10, which is optimal for integrated path planning.It can be seen that the ID3QN algorithm proposed in this paper outperformed the traditional algorithms such as the A*, GA, and ACO in terms of the length of path planning as well as the cumulative irradiation dose.In real environments, the path planning algorithm proposed in this paper is conducive to improving the safety and reliability of robots in nuclear environments by reducing the magnitude of the cumulative irradiation dose.Compared to the DQN, D3QN, and other algorithms, the ID3QN algorithm proposed in this paper has the shortest path, which is only 29.799 m, and the DQN algorithm has the longest path, which is 35.3137m; in terms of the cumulative irradiation dose, the ID3QN algorithm has the smallest cumulative irradiation dose, which is 15.5717 mSv, and the DQN algorithm has the largest cumulative dose dose, which is 20.3404 mSv; in terms of the number of inflection points, the ID3QN has the highest number of inflection points, and the DQN has the lowest number of inflection points, because the main objectives of ID3QN's planning is the shortest path with the lowest cumulative irradiation dose, which makes the number of inflection points increase; in terms of the number of collisions, the ID3QN algorithm has the fewest number of collisions at 25,908, and the DQN algorithm has the highest number of collisions at 79,789; and lastly, in terms of the algorithmic scores, the ID3QN algorithm scores the is the highest at 5025 and the DQN algorithm has the lowest score at 7; in terms of the algorithm training time, the ID3QN algorithm took the longest time to train, followed by the D3QN algorithm, and the DQN algorithm took the shortest time to train.
Based on the results, it is evident that the ID3QN algorithm surpasses the original DQN algorithm in path planning.This improvement is achieved through the optimization of the neural network structure, the enhancement of action exploration, and the refinement of experience replay mechanisms.Compared to the DQN algorithm, the ID3QN algorithm reduced the path length by 15.6%, the cumulative radiation dose by 23.5%, and the collision frequency by 67.5%, thus resulting in a 717-fold increase in the final score.The objective of path planning in this paper is to provide a comprehensive optimal path with the minimum path length and cumulative radiation dose for a robot operating in a static radiative environment.The proposed ID3QN algorithm significantly improves the convergence speed of the algorithm and the quality of the path planning solution through the optimization of the neural network structure, improved action exploration strategies, and enhanced experience replay, as shown in Table 3.However, at the same time, it incurs additional time cost compared to the traditional DQN algorithm.Nevertheless, this paper aims to provide a multiobjective offline path planning method for robots entering dangerous radiative environments.Therefore, obtaining shorter and safer paths at the expense of certain time costs is acceptable.
As shown in Figure 6a, the reward values of the DQN algorithm, D3QN algorithm, and ID3QN algorithm for a single round of path planning are plotted.It can be seen that the reward values of the DQN algorithm, D3QN algorithm, and ID3QN algorithm were stabilized within 6000 training rounds.However, the ID3QN algorithm converged around 200 rounds, while both the DQN algorithm and D3QN algorithm converged around 2500 rounds, thus indicating that the ID3QN algorithm converged faster than DQN algorithm and D3QN algorithm.And the reward value of the ID3QN algorithm after convergence is higher than the DQN algorithm and D3QN algorithm, which proves that ID3QN algorithm performs better than the DQN algorithm.The collision count variation during each iteration of training for the DQN algorithm, D3QN algorithm, and ID3QN algorithm is shown in Figure 6b.It can be observed that within 6000 training episodes, the collision counts for all three algorithms remained stable within a certain range.However, the ID3QN algorithm converged around 400 episodes with a much lower collision count compared to both the DQN and D3QN algorithms.This indicates that the obstacle avoidance performance of the ID3QN algorithm is superior to that of the DQN and D3QN algorithms.

Comparison of Scenario 2
The ID3QN algorithm proposed in this paper was tested in simulation in the complex raster map environment shown in Figure 4b, which has added several new radioactive sources and a large number of obstacles to further validate the effectiveness, as well as the robustness, of the ID3QN algorithm.The ID3QN algorithm was also compared with the A*, GA, ACO algorithm, DQN algorithm, and D3QN algorithm, and the metrics for comparison are the algorithm score [27], the path length, the number of corners, the number of collisions [20], the accumulated radiation dose [1], and the training duration.By comparing the data, this paper derives the path planning line diagram, which is shown in Figure 7, and summarizes the metrics comparison data in Table 4.
According to the data in Table 4 and Figure 7, it can be seen that under the complex raster map environment and compared with the traditional conventional A*, GA, and ACO algorithms, the ID3QN algorithm has the shortest path length of 29.7990 m; the lowest cumulative irradiation dose of 18.3488 mSv; and the number of inflection points of 13; The overall training time was the longest, which is the optimal result of integrated path planning.
In the scenarios with increasing radiation sources and obstacles, the performance scores and training speeds of the ID3QN, D3QN, and DQN algorithms all decreased.However, the ID3QN still performed very well in comparison to the D3QN and DQN algorithms.With the highest score, the shortest path length, and the lowest accumulated radiation dose, the ID3QN demonstrated superior performance.This indicates that under complex map conditions, the ID3QN still has good adaptability and can plan a comprehensive optimal path.As shown in Figure 8a, the reward value changes of the DQN algorithm, D3QN algorithm, and ID3QN algorithm in the complex radioactive grid environment are plotted.It can be seen that within 6000 training rounds, the ID3QN algorithm still converged around 800 rounds, and the reward value of the ID3QN algorithm after convergence is higher than that of DQN algorithm and D3QN algorithm, which proves that ID3QN algorithm performs better than the DQN and D3QN algorithm.
Figure 8b illustrates the variation in the collision count during each iteration of training for the DQN algorithm, D3QN algorithm, and ID3QN algorithm in a complex radioactive grid environment.It can be observed that both the ID3QN and D3QN algorithms outperformed the DQN algorithm in terms of obstacle avoidance in the complex environment.The ID3QN algorithm reached stability after 600 iterations, and the collision count remained lower than that of both the DQN and D3QN algorithms.This indicates that the ID3QN algorithm exhibits superior obstacle avoidance performance compared to the DQN and D3QN algorithms.
In summary, the ID3QN algorithm proposed in this paper has better performance in path planning, which enables the robot to obtain higher scores and plan better paths in fewer training rounds.

Discussion
The simulation results demonstrate that in environments of different complexities, the performance of the ID3QN algorithm is significantly superior to algorithms such as the A*, GA, ACO, DQN, and D3QN.It can obtain a comprehensive optimal path with a shorter path length and lower accumulated radiation dose.Firstly, the ID3QN algorithm optimizes the neural network structure and utilizes the dual-network double estimation of action values.This method decomposes the Q value function, thereby allowing for more accurate estimation of the value of each action while reducing variance.It improves learning efficiency and training stability.It effectively avoids overestimating the Q values, which may lead to a bias toward selecting overestimated actions and neglecting actions that actually have higher values.This enables better decision making in path planning, reduces the number of training iterations, and improves the training speed.Secondly, the algorithm adopts a prioritized experience replay mechanism by adding a priority replay sample pool.This effectively utilizes important sample data, thus improving learning efficiency and better distinguishing the importance of samples to enhance sampling efficiency.Lastly, considering the constraints of path length and accumulated radiation dose, the reward function was redesigned.This ensures that every decision made by the robot receives better feedback, thereby accelerating the convergence speed of the algorithm.
The ID3QN algorithm proposed in this article offers a high-quality solution for path planning problems in radioactive environments.It outperforms traditional algorithms like the A*, GA, and ACO in terms of path planning effectiveness, as well as exhibits faster convergence speed in training compared to the DQN, D3QN, and other algorithms.To enhance the quality of finding the optimal path, the algorithm incorporates strategies such as dual-network dual-estimation of the action value and prioritized sample replay.However, these strategies result in slightly longer training times compared to traditional algorithms like the DQN and D3QN, and there is also a challenge with respect to adjusting numerous parameters.In the future, the algorithm's parameters can be optimized, and decision making can be assisted through excellent evolutionary and decision-making algorithms, thereby ultimately reducing the training time and speeding up the training process.

Conclusions
To address the path planning problem of mobile robots in radioactive environment, this paper proposes an improved approach based on the traditional DQN algorithm.The proposed method is the ID3QN algorithm, which utilizes an asymmetric neural network structure for path planning.By enhancing the action selection mechanism, introducing a prioritized experience replay mechanism, and improving the reward function, the performance of path planning is greatly improved.The ID3QN algorithm was compared with traditional algorithms like the A*, GA, and ACO, as well as deep reinforcement learning algorithms, to validate its effectiveness and robustness.
The experiment shows that in a simple radioactive grid map environment, the proposed ID3QN algorithm outperforms traditional algorithms such as the A*, GA, and ACO in terms of path length and cumulative radiation dose.The overall planned path is more optimal.Compared to the DQN and partially improved DQN algorithms, the path length was reduced by 15.6%, the cumulative radiation dose was reduced by 23.5%, the collision count was reduced by 67.5%, and the algorithm score was improved by 717 times, thereby enabling the planning of an collision-free optimal path in a shorter training session.In a complex radioactive grid map environment, the ID3QN algorithm also outperformed the A*, GA, ACO, and other algorithms in terms of the planned path, and it was more capable of stably planning a comprehensive optimal path with a shorter path length and lower radiation dose than the DQN and partially improved DQN algorithms.
In future work, we will consider how to improve the efficiency of the algorithm and reduce its training time.We will also aim to enhance the adaptability of the algorithm by optimizing it and extending its application to path planning problems involving dynamic obstacles.

2. 1 . 6 .Algorithm 1 8 : 9 :
The Proposed ID3QN Algorithm for Path Planning of Mobile Robot in Nuclear Environment Flowchart The flowchart of the ID3QN algorithm proposed in this paper for path planning of mobile robots in nuclear environments is shown in Algorithm 1.The proposed ID3QN algorithm for path planning of mobile robot in a nuclear environment 1: Input : Radiative environment information, obstacle information, starting point, endpoint 2: Output: Optimal path 3: Initialize replay memory D and prioritize replay memory D to capacity N 4: Initialize action-value function Q with random weights θ 5: Initialize target action-value function Q with weights θ − = θ 6: Initialize memory units for obstacles 7: for episode = 1 to K do Initialize state s = s 1 for t = 1 to T do 10: if RandomValue > ε then 11:

) 20 :
Use the MSELoss loss function and update all the parameters of the Q network through gradient backpropagation in the neural network 21:Every C steps reset Q = Q 22:if s t+1 == s end then23:    Calculate the cost value F(n) by Equation(11)

Figure 4 .
Figure 4. Grid map of nuclear environment.(a) Scenario 1: simple radioactive environment with equal intensity radioactive sources and sparse obstacles.(b) Scenario 2: complex radioactive environment with different equal intensity radioactive sources and dense obstacles.

Figure 7 .
Figure 7. Path planning routes for six algorithms under Scenario 2.
Excecute action a t in emulator and observe reward r t calculated by Equation (15) Store transition (s t , a t , r t , s t+1 ) in D Sample random minibatch of transitions (s t , a t , r t , s t+1 ) from D and sample all transtions (s t , a t , r t , s t+1 ) from D

Table 1 .
Experimental hardware and software configuration.

Table 3 .
Comparison of six algorithms under Scenario 1.

Table 4 .
Comparison of six algorithms under Scenario 2.