Double Q-Learning for Radiation Source Detection

Anomalous radiation source detection in urban environments is challenging due to the complex nature of background radiation. When a suspicious area is determined, a radiation survey is usually carried out to search for anomalous radiation sources. To locate the source with high accuracy and in a short time, different survey approaches have been studied such as scanning the area with fixed survey paths and data-driven approaches that update the survey path on the fly with newly acquired measurements. In this work, we propose reinforcement learning as a data-driven approach to conduct radiation detection tasks with no human intervention. A simulated radiation environment is constructed, and a convolutional neural network-based double Q-learning algorithm is built and tested for radiation source detection tasks. Simulation results show that the double Q-learning algorithm can reliably navigate the detector and reduce the searching time by at least 44% compared with traditional uniform search methods and gradient search methods.


Introduction
In the field of homeland security, detecting anomalous radiation sources in urban environments is an important yet challenging task due to the complexity of urban radiation background. When a suspicious area is determined, a radiation survey is usually carried out to search for anomalous radiation sources. To deliver a comprehensive and efficient survey, different survey approaches have been studied such as manually scanning of the area by human operated detectors and automatically scanning the area with robots under the navigation of pre-defined survey paths [1,2]. However, neither of the manual scanning method and the survey path method can achieve flexibility and efficiency at the same time. For instance, manual scanning carried out by humans is flexible to adjust survey strategies using new acquired measurements, but it requires lots of human efforts and may be dangerous for human surveyors. The pre-defined survey path method does not require human efforts during survey, but it cannot use new information gained during the survey and thus is not flexible to adjust survey paths. To address those issues, this study proposes the reinforcement learning algorithm as a data-driven approach to navigate automated robots equipped with radiation detectors for radiation source detection tasks.
Radiation exists everywhere in daily life. It arises from naturally occurring radioactive materials (NORMs) that are present in air, soil, and building materials [3,4]. Those NORMs contribute to the background radiation of the environment and are treated as a uniform radiation source. Besides those NORMs, there may also exist anomalous radiation sources in the environment such as wrongly disposed radioactive medical wastes, illicit radioactive materials, or even nuclear weapons [5,6]. They usually have high concentration with small footprint and are treated as point sources in the environment. The anomalous radiation source detection task is to identify and localize the anomalous radiation sources in an area. There are two major approaches that are commonly used to estimate the location and intensity of anomalous radiation sources from radiation measurements: the maximum likelihood estimation-based methods [7][8][9][10] and the Bayesian estimation-based methods [11][12][13]. In real applications, the radiation measurements are usually short in counts due to the limited detection time and the shielding of the radiation sources. The spectral comparison ratio method [14] and Gaussian process [15] have been studied to process the count-starved radiation measurements and suppress noise.
Besides those source parameter estimation studies, people also investigated into the optimized operations of detectors such as the placement of detectors and the optimized detection time. Klimenko et al., studied the problem of optimizing detection time for a predefined survey path using methods from the sequential testing theory [16]. Cortez et al., designed radiation surveys based on variances in acquired measurements and uncertainties in the radiation field [17]. Hutchinson et al., sequentially determined the detector's placement positions using the concept of maximum entropy sampling [18]. Lazna et al., proposed a circular path planning strategy to exploit the directional characteristics of detectors [19]. Ristic et al., designed the survey path incrementally by choosing detector positions and detection times maximizing the information gain in the Renyi divergence sense [20]. Besides those conventional heuristic-or information theory-based methods, neural network-based reinforcement learning provides another emerging tool to solve the radiation source detection task.
Reinforcement learning (RL) is one of the machine learning algorithms that studies the problem of how agents ought to take actions in a given environment such that a certain goal can be achieved or rewards can be maximized [21]. With the recent development of computation hardwares and deep neural networks (DNN), the combination of RL and DNN shows great success in handling complex tasks such as playing video games, driving vehicles, playing Go, and controlling robots, to name a few [22]. By properly defining the learning problem, agents are able to learn the optimal actions under complex environments without prior human knowledge. Q-learning is one of the RL algorithms that learns the optimal action policy for agents without requiring a model of the environment [23]. Mnih et al., combined DNN with Q-learning which greatly improved the representability of Q-learning approach [24]; in their paper, the deep Q-learning (DQN) algorithm was trained to play Atari games and achieved human-level performance. Furthermore, the double Q-learning algorithm was developed to solve DQN's issue of overestimating action values and achieved a more stable and reliable learning [25]. The above implement of Q-learning approaches show the potential to apply Q-learning to automated anomalous source detection tasks.
The major contribution of this paper is formulating the radiation source detection task as a reinforcement learning problem and constructing a double Q-learning algorithm to solve the radiation source detection task. In this study, a simulation environment is setup that an agent carrying a radiation detector (this agent with detector is called "the detector" in the remaining part of this paper for brevity) searches a predefined area. This area has background radiation and an anomalous radiation source. The goal of the detector is to find the source using as short time as possible. A convolutional neural network-based double Q-learning algorithm is built to navigate the detector looking for the radiation source. This paper is organized as follows: section two introduces the radiation source detection task, the simulation environment, and the double Q-learning algorithm for radiation source detection; section three presents training results of the algorithm and analyzes its performance in searching for radiation sources; section four summarizes this paper and discusses possible future steps.

Radiation Source Detection Task
According to the physics of radiation emission, radiation counts measured by radiation detectors during a unit time interval can be modeled by Poisson distribution: Here λ is the intensity of the radiation, and k is the radiation counts measured by the detector.
λ is contributed by two sources: the background radiation with intensity b and the anomalous radiation source with intensity I.
The d 2 in Equation (2) is the distance between the detector and the radiation source. It is from the geometric efficiency correction that the detected point source's radiation intensity is proportional to the point source's solid angle viewed from the detector's detection surface. This solid angle is further approximated to be squared inverse proportional to the distance (denoted by d) between the detector and the point radiation source. As shown in Figure 1, there may exist walls in the searching area that block the radiation source's signal from the detector. Thus, there is an indicator function in Equation (2) to differentiate the blocked/not blocked case.
In this work, a radiation detection task and its simulation environment was setup according to the radiation model above. As shown in Figure 1, a searching area was setup in the simulation environment. In this area, there existed an anomalous radiation source and a radiation detector. In real life, radiation detectors usually move in all directions with arbitrary step size; in the simulation environment, the detector's moving directions were limited to up, down, left, or right, and the step size was limited to 1 meter. Theses constrains simplified the movement control while kept a valid approximation to the real life scenario. With these constrains, the detector could only move along the dashed grids as shown in Figure 1. There might also have walls in the simulation environment that could totally block the source's radiation signal. In the radiation detection task, the detector aimed at finding the anomalous radiation source as soon as possible.

Q-Learning with Convolutional Neural Network
In this work, we applied the double Q-learning approach [25] from reinforcement learning to train the detector searching for radiation sources. The radiation source detection task was formulated as a finite discrete Markov decision process (MDP), in which the radiation detector interacted with the environment through a sequence of states (s), actions (a), and rewards (R). The goal of the detector was to move to the grid node closest to the source as soon as possible. This was achieved by training a convolutional neural network (CNN) to approximate the optimal action value function Q * (s, a) which is the maximum expected cumulative future reward starting from state s, action a, discounted by γ, maximized over action policy π, and accumulated for future time steps t (see Appendix A). As shown in Figure 2, the CNN takes the state s t as input and outputs four different values (x 1 , x 2 , x 3 , x 4 ) corresponding to four different actions (a 1 =up, a 2 =down, a 3 =left, a 4 =right). Each value represents the expected cumulative future reward starting from state s t and taking the corresponding action, for example x i = Q(s t , a i ), ∀i ∈ {1, 2, 3, 4}. Since the detector could only move along dashed grids (Figure 1), the state s t was constructed by several matrices, which preserved the detector's searching history. The output layer is a fully connected layer with 4 nodes, and each of the nodes represents its corresponding action's expected cumulative future reward given the input as current state. For the convolution layers or the pooling layer, the 'size' parameter specifies the size of the convolution or pooling kernel; the 'filters' parameter specifies the channel number of the convolution kernel; the 'strides' parameter specifies the stride of the sliding window for each dimension of input; the 'activation' parameter specifies the activation function applied to the output of the convolution results. For the pooling layer, we applied max pooling. For the fully connected layers, the 'size' parameter specifies the number of nodes in the layer, and the 'activation' parameter specifies the activation function applied to the output of the fully connected layer. The 'Relu' activation function is defined as Relu(x) = max(x, 0), and the 'Sigmoid' activation function is defined as Sigmoid(x) = e x 1+e x . These definitions and names are consistent with the names used by the 'tf.layers.conv2d' function in Tensorflow [26].
If the simulation area is m meters wide and n meters long, a Mean Measurement Matrix of size m × n can be created so that the (i, j) element of this matrix represents the mean value of radiation measurements acquired at position (i, j). Similarly, a Number of Measurements Matrix of size m × n can be created so that the (i, j) element of this matrix stores how many measurements have been taken at position (i, j). A Current Map Matrix can also be created to record the position of the detector. Being different from previous two matrices, the Current Map Matrix has size (m + 2) × (n + 2) to incorporate boundaries of the simulation area. In this matrix, all boundaries and walls are denoted by 1, all accessible positions are denoted by 0, and the current position of the detector was denoted by 2. As shown in Figure 3, at each time step t these three matrices can be represented by three images. In our algorithm, the state s t was constructed as a stacking of these three images: the Mean Measurement Matrix and the Number of Measurements Matrix were first padded by zeros to let them have size (m + 2) × (n + 2), and then these three images were stacked together. Representing the searching history using Mean measurement matrix, Number of measurements matrix, and Current map matrix. Plot (a) shows an example of a searching history. In plot (b), color represents the intensity of mean radiation measurements for each position. In plot (c), color represents the number of measurements been taken so far at each position. In plot (d), yellow represents the detector's current position, green represents boundaries and walls, and blue represents accessible positions. Plot (e) represents the signal that the detector observed throughout the search.
In our implementation, the simulated environment had an area of 10 m by 10 m, and we considered a typical hand-hold scintillation detector (a thallium activated cesium iodide detector with scintillator size 2 × 1 × 0.5 inch). The average background radiation level of the UIUC campus measured by this detector was 25 cps (counts per second); thus, b was set to be 25 cps in Equation (2). Plot (e) of Figure 3 shows an example of the signal the detector observed in one search. When training the reinforcement learning algorithm, the radiation source should be strong enough so that the detector can learn how to search for the source. Thus, I in Equation (2) was uniformly sampled from (3000, 7000) cps to simulate different anomalous radiation sources with different intensities. The double Q-learning algorithm was trained by iteratively conducting a large number of episodes, in which the detector interacted with the environment until it reached a certain terminal state, and the environment was reset for the next episode. In an episode, the simulation environment was randomly initialized so that the wall's position, the radiation source's position and intensity, and the detector's position were properly defined. In each time step inside the episode, the radiation detector chose one direction to move and then collected one second radiation measurement. This episode would terminate if the detector moves to the grid node closest to the source, or the total number of time steps is larger than a predefined time limit. In this study, the time limit was chosen to be 100 because it is the time needed to traverse the entire simulation area. The double Q-learning algorithm was trained for 1 million episodes. The reward in each time step was defined as follows: if the detector moves closer to the source.
Since there might exist walls, the Euclidean distance could not be directly used; the shortest-path distance was instead used to determine whether the detector moved closer or further to the source. This reward was set to be asymmetrical (0.5 vs. −1.5) to encourage the detector finding the source as soon as possible. The detailed training configuration is presented in Appendix B, and code is available upon request.

Evaluation
Three simulation experiments were carried out to evaluate the performance of the Q-learning algorithm. The first experiment was the radiation source searching test (Section 3.2), the second experiment was the detector trapping test (Section 3.3), and the third experiment was the radiation source estimation test (Section 3.4).
In the first experiment, the goal of the detector was to find the source as soon as possible. A successful case of finding a source was defined as the detector moving to the grid node closest to the source within 100 steps. For example, when source is at (5.2, 7.8), the search is successful if the detector moves to the grid (5, 8) within 100 steps. The metrics were the average searching time and the failure rate of finding the source. In this experiment, two different searching areas were tested as shown in Figure 4. One searching area does not have walls inside, and the other searching area has one wall inside. For both of the searching areas, the detector was always initialized at the lower left corner of the searching area. In each of the searching areas, we tested ten different source intensities from 50 × 2 0 cps to 50 × 2 9 cps. For each source intensity, we repeated the search 200 times with different random source positions. The performance of the Q-learning algorithm was compared with the gradient search algorithm (Appendix C). In the second experiment, we compared the detector trapping behavior between the Q-learning algorithm and the gradient search algorithm. As shown in Figure 5, there were two different kinds of simulation areas to be tested. In both of the areas, the detector started from the lower left corner, and the source was placed at the upper left corner. In order to find the source, the detector needed to move from the lower part to the upper part of the simulation area. The plot (a) in Figure 5 shows the first kind of simulation area, in which the wall was attached to the left edge of the area and had various lengths (0 m, 2 m, 4 m, 6 m, 8 m). The longer the wall is, the more radiation signal it will block. All these wall configurations were already seen by the Q-learning algorithm in the training stage. The plot (b) in Figure 5 shows the second kind of simulation area. This wall configuration was not seen by the Q-learning algorithm in the training stage. For each of the wall configurations, 20 repeated tests were performed. For each test, we computed the relative trapped time, which was defined as the percentage of time the detector stayed in the lower part of the simulation area (y ∈ [0, 4]) during the whole searching process. The larger the relative trapped time is, the severer the detector is trapped. In the third experiment, the goal of the detector was to estimate the accurate location and intensity of the radiation source. Similar to the first experiment, the detector was guided by a navigation algorithm to look for the radiation source. The navigation algorithms under comparison were the Q learning algorithm, the gradient search algorithm, and the uniform search algorithm [27] with three different searching densities (see Appendix C). After the search, we estimated the radiation source's location and intensity using measurements collected so far. The closer the detector is to the source, the more informative radiation measurements can be acquired by the detector, and consequently a more accurate estimation about the source can be made. There are a variety of algorithms to estimate the radiation source's location and intensity [7][8][9][10][11][12][13]. For simplicity, we used the maximum likelihood estimation (MLE) approach to do the estimation. Suppose {(x 1 , y 1 , m 1 ), (x 2 , y 2 , m 2 ), . . . , (x t , y t , m t )} are the measurements acquired till time t, the MLE estimator of source position (x * s , y * s ) and source intensity µ * s is calculated as follows: In this experiment, the simulation area did not have walls inside, and nine different source positions were tested ( Figure 6). The source intensity was set to be 500 cps. For each source position, 20 repeated tests were performed.

Training Result of the Double Q-Learning Algorithm
The training progress of the double Q-learning algorithm was monitored through a series of checkpoints. Between every 100 training episodes there was a checkpoint. In each checkpoint, 30 randomly initialized episodes were performed with CNN parameters fixed, and the average, minimum, and maximum reward in these episodes were recorded (Figure 7).

Figure 7.
Training curves of the episode reward. One episode is one round of the searching task, starting from initializing the simulation environment and ending with the detector triggering the termination condition (either finding the source or reaching the maximum time step inside an episode). Between every 100 training episodes, 30 randomly initialized episodes were evaluated with CNN parameters fixed. Plot (a) shows the average testing reward of the 30 episodes. Plot (b) shows the minimum testing reward of the 30 episodes. Plot (c) shows the maximum testing reward of the 30 episodes. In these three plots, dark blue lines represent the averaged testing results with window size of 100 episodes, while light blue shadows represent raw testing results.
The average reward converged to near zero after 400 k episodes of training. Note that the per-step reward function was asymmetric (Equation (4)), an episode reward with zero mean meant that for every three correct actions (move closer to the source) the detector would take one wrong action (move further away from the source) on average. The minimum reward had majority between −50 and 0 with mean value converging to −25. It delivered a much bigger fluctuation than the mean reward. This fluctuation was caused by rear scenarios that were not learned well by the algorithm. The maximum reward converged to 7 with much smaller fluctuation than the minimum reward.

Radiation Source Searching Test
To further analyze the Q-learning algorithm's performance in radiation source searching tasks, we selected the Q-learning model learned at the 842,500th episode and tested its performance under two different searching areas as shown in Figure 4. This model was selected because it achieved the biggest minimum reward during the training process (the middle plot of Figure 7). Metrics under evaluation were the average searching time and the failure rate of finding a radiation source. The Q-learning model was also compared with the gradient search approach.
As shown in Figure 8, the Q-learning algorithm outperformed the gradient search method in both simulation areas with walls and without walls, in all tested source intensities. Because walls could block radiation and introduce obstacles in detector's movements, adding walls made the searching time longer for both of the two algorithms. For simulation areas without walls, two algorithms' searching times reduced when the source intensity increased, and the Q-learning algorithm spent at least 25% less searching time than the gradient search method. For simulation areas with walls, the two algorithm's searching times did not depend on the source intensity, and the searching time of the Q-learning algorithm was 50% less than the searching time of the gradient search method.  Figure 9 shows that the failure rate of the gradient search algorithm and the Q-learning algorithm significantly dropped when the source intensity increased. Without walls, the failure rate of the two algorithms were the same and converged to below 2% for sources stronger than 800 cps. With walls, the failure rate of the Q-learning algorithm slightly increased and converged to 5% for sources stronger than 800 cps, but the gradient search method's failure rates were all above 30%. The gradient search method relied solely on radiation gradient to search for sources. If the detector and the source were in different sides of the wall, there would be no radiation gradient in the detector side, and the detector would just randomly move until it moved to the other side of the wall. This behavior significantly reduced the performance of the gradient search method, especially when the searching area had obstacles. On the contrary, the Q-learning method was able to design its optimal searching path based on the obstacles in the searching area. If the detector found the source did not exist in this side of the wall, it would take the shortest path to the other side of the wall and search for sources. This explained why the Q-learning algorithm performed equally well in areas with walls and without walls. Figure 9. Failure rate of the Q-learning algorithm (QL) and the gradient search algorithm (GS) in the radiation source searching test, for simulation areas with walls and without walls. Without walls, the two algorithms had the same failure rate; with walls, the Q-learning algorithm achieved much smaller failure rate than the gradient search method.

Detector Trapping Test
The previous experiment reveals the Q-learning algorithm's ability to avoid being trapped by the walls. In this experiment, we further analyzed the detector trapping issue of the Q-learning algorithm and the gradient search algorithm on two different simulation areas shown in Figure 5. Figure 10 shows the average relative trapped times (curves) and searching path examples (drawings) for the Q-learning algorithm and the gradient search algorithm in simulation areas illustrated by plot (a) of Figure 5. The Q-learning model under evaluation was taken from the 842,500th episode. As the wall length increased from 0 m to 8 m, the relative trapped time of the gradient search method increased from 0.5 to 0.9. From the searching path examples we can find that the gradient search method was easily trapped by long walls and wasted most of the searching time in the wrong side of the wall. When there was no wall, the Q-learning algorithm's relative trapped time was 0.27. When there were walls, its relative trapped times increased to 0.46 and were independent of the wall length. This demonstrates that the Q-learning algorithm was not trapped by walls. Example searching paths from the Q-learning algorithm show efficient searching strategies.
In the previous detector trapping test, one possible reason for the Q-learning algorithm not being trapped by walls is that the tested simulation areas are similar to the ones that have already been seen by the Q-learning algorithm during the training stage. In order to evaluate the detector trapping performance in new simulation areas, we tested the Q-learning algorithm and the gradient search method in the simulation area illustrated by plot (b) of Figure 5. This simulation area has different geometry compared to the ones being used to train the Q-learning algorithm. The overlapped two walls blocked all the source's radiation from the initialization point of the detector. In this test, the relative trapped time was defined as the percentage of the time the detector stayed in the lower 60% of the searching area. The gradient search algorithm's mean relative trapped time was 0.999, and plot (a) of Figure 11 shows a failure search path. For this geometry, the detector in the lower part of the simulation area could not receive any signal from the radiation source in the upper part of the simulation area, and the gradient method would just randomly pick one direction to move since it could not detect any meaningful radiation gradient. The gradient search algorithm was heavily trapped by this geometry, and almost all of its searches failed. The Q-learning model taken from the 842,500th episode had a mean relative trapped time of 1, which means that this Q-learning model was also entirely trapped by the new geometry. Plot (b) of Figure 11 shows a failure search path of this Q-learning model. We can see that this detector was trapped in the corner and did not search for the source at all. It is because the trained Q-learning algorithm was highly fitted to the training simulation areas and did not generalize well for the simulation area with new geometry. To teach the Q-learning algorithm how to search in this new geometry, we added this new geometry into the training stage and trained 6000 additional episodes starting from the 842,500-episode model. After this additional training, the new Q-learning algorithm was not trapped by walls any more and achieved a relative trapping time of 0.63. Plot (c) of Figure 11 shows a successful search path from the new Q-learning algorithm. In this search, the algorithm firstly realized that there was no source in the lower part of the area; then, it took the shortest path to the upper part of the area and finally found the source there.
This experiment demonstrates that the Q-learning algorithm is able to efficiently design its searching path based on searching area geometries and recent radiation measurements. For learned search area geometries, the Q-learning algorithm will not be trapped and can find the source efficiently; for new search area geometries, the Q-learning algorithm will fail. This issue can be solved by adding additional training episodes for the new search area geometries. In our experiment, the 6000 additional training episodes took less than 30 min on a Tesla K80 GPU. In real applications, the searching area geometry is usually available in advance (such as through satellite pictures from Google Maps), and the Q-learning model can be easily updated according to the target searching area geometry in short time. Figure 10. Detector trapping test 1. In this trapping test, simulation areas with different wall lengths were tested. The relative trapped time was defined as the percentage of the time the detector stayed in the lower 40% part of the simulation area during the whole searching process. The orange curve represents the mean relative trapped time of the gradient search method, and the blue curve represents the mean relative trapped time of the Q-learning method. The example searching paths for both algorithms in different wall lengths are also drawn in the figure. For all the searching path examples, the detector initialization point was at the lower left corner, the radiation source was at the upper left corner, and the color of the path represents the mean radiation intensity (blue is low, and yellow is high). The Q-learning algorithm was not trapped by walls, while the gradient search method was trapped by walls. Figure 11. Detector trapping test 2. In this trapping test, a simulation area with two overlapped long walls was tested. The relative trapped time was defined as the percentage of the time the detector stayed in the lower 60% part of the simulation area during the whole searching process. For all the searching path examples, the detector initialization point was at the lower left corner, the radiation source was at the upper left corner, and the color of the path represents the mean radiation intensity (blue is low, and yellow is high). Plot (a) shows an example search path from the gradient search method that was failed to find the source. Plot (b) shows an example search path from the Q-learning model taken from the 842,500th episode, and this search was also failed to find the source. Plot (c) shows an example search path from the Q-learning model that was trained for additional 6000 episodes on the new simulation area. After this additional training, the new Q-learning model was able to efficiently search for sources in the new geometry.

Radiation Source Estimation Test
In this experiment, we tested the Q-learning algorithm, the uniform search algorithm, and the gradient search algorithm's performance of estimating the source's position and intensity. Similar to the previous section, the Q-learning model under evaluation was taken from the 842,500th episode. Table 1 shows the mean estimation error for different searching methods. The simulation area was 10 m by 10 m. All of the five searching methods achieved source localization errors less than 15 cm (relative errors less than 1.5%) and source intensity estimation errors less than 7%. Averaging over all the 9 source positions shown in Figure 6, the mean searching time was 15.82 s for the Q-learning algorithm and 33.63 s for the gradient search algorithm. The Q-learning approach was 50% quicker than the gradient search algorithm and at least 44% quicker than the uniform search algorithms. Compared to the gradient search method, the Q-learning algorithm reduced the localization error by 35%, obtained similar source intensity estimation error, and used less searching time. Compared to the uniform search 1 approach, the Q-learning approach used less detection time and achieved a smaller localization error. The trade-off was a slightly bigger intensity estimation error (6.3% vs. 5%). The uniform search 2 and 3 approaches outperformed the Q-learning method, but they used much longer searching time and collected much more measurements. For source searching tasks with limited searching time, the Q-learning method would be a competitive alternative method compared to the uniform search method and the gradient search method, since it significantly reduced the searching time while maintained a low estimation error.  1,9]) represents the mean searching time for the i'th source shown in Figure 6.

Conclusions and Future Works
In this paper, the radiation source detection task was formulated as a reinforcement learning problem and a CNN-based double Q-learning algorithm was developed to navigate the detector searching for the radiation source. Simulation environments were set up to test the algorithm's performance. In the radiation source searching test, the Q-learning method used at least 25% less searching time and achieved a lower failure rate than the gradient search algorithm. In the detector trapping test, the Q-learning algorithm was less prone to the detector trapping issue than the gradient search method. In the radiation source estimation test, all of the uniform search methods, the gradient search method, and the Q-learning method achieved relative localization error lower than 1.5% and relative intensity estimation error lower than 7%, but the Q-learning approach reduced the mean searching time by at least 44% compared to other methods.
In the future, we are going to implement this Q-learning algorithm on a drone-based radiation detection platform, and field tests will be carried out. In real applications such as nuclear security and nuclear decommissioning, the detection platform's computation power and energy supply are usually limited. For security concerns, the detection platform may need to be isolated from Internet connection. These limitations require the navigation algorithm to run locally on light-weight mobile computation devices such as mobile phones. The Q-learning algorithm's computation burden is mostly from the training stage, which could be done in other powerful computers. Once the training is finished, the algorithm can be deployed efficiently on light-weight mobile devices for navigation.
The current implementation of the double Q-learning algorithm has fixed detection time (1 s for each time step), fixed step length (1 m for each action), and limited moving angles (4 directions).
In the future, we will explore the finer controlling of the detector such as adding detection time as another controllable parameter, accepting different step lengths, and supporting more moving angles. The current Q-learning algorithm was designed and trained based on the one-source searching scenario; consequently, it could only handle one-source searching tasks. However, it has the potential to be applied into multiple-source searching tasks. When more than one source present in the environment, the current algorithm can still move towards one of the sources, but it does not know what to do after finding the first source. In order to search for multiple sources, the Q-learning algorithm needs to remember the found sources and use their positions and intensities to adjust new radiation measurements.

Conflicts of Interest:
The authors declare no conflict of interest.
Finally, we obtain the Bellman equation [28] from the last line of above equation: The intuition of the Bellman equation is that the optimal policy will always choose the action that can maximize the expected cumulative future reward for the next time step. According to this Bellman equation, Watkins and Dayan proposed the following iterative updating algorithm known as Q-learning and proved its convergence [23]: α is a parameter between 0 and 1 controlling the learning rate. The convergence of the above updating rule requires that all possible pairs of (a, s) are repeatedly visited during the training process.

Appendix A.3. Neural Network for Q-Learning
Usually, interesting problems have a large number of states and actions, and thus the updating rule (Equation (A10)) is difficult to implement. Instead, a function approximator Q(s, a; θ) can be learned according to the Bellman equation. In this study, we use CNN as the function approximator, and θ represents the parameters (weights) of the CNN. In the Deep Q Network (DQN) paper [24], the target value is defined as The θ is the parameter of the CNN from previous iterations and hold as constant. The loss function is defined according to the Bellman equation: The second term in Equation (A13) represents the variance of the target value which does not depend on θ and is thus ignored. The gradient of the estimation loss with respect to θ is ∇ θ L(θ) = E s,a,S t+1 [ (Q(s, a; θ) − y)∇ θ Q(s, a; θ) ] (A14) At this point, θ can be learned using gradient descent approach: Experience replay [24] is used to calculate the expectation in the gradient (Equation (A14)) during the training process. In every training step, training data (S t , A t , R t , S t+1 ) are stored in a memory buffer. For each iteration, samples are uniformly selected from the memory buffer, and the CNN is updated through the batch stochastic gradient descent using those samples. This technique insures the independence of the training data between different training iterations and thus avoids oscillations or divergence in θ [24].
Double Q-learning [25] is used in this paper to further improve the performance. In DQN, Equation (A11) can be rewritten as following: y = R t + γQ(S t+1 , argmax a Q(S t+1 , a; θ ); θ ) (A16) Because the action a is greedily selected from Q with parameter θ and then evaluated using Q with the same parameter θ , it may lead to the overestimation of the action-value function. Double Q-learning approach addresses this issue by decoupling the selection from evaluation [25]: y DoubleQ = R t + γQ(S t+1 , argmax a Q(S t+1 , a; θ); θ ) (A17)

Appendix B. Training Details
When training the model, we applied the -greedy [24] policy with reduced from 1 to 0.1 linearly over the first one million steps. During training, the loss calculated according to the mean squared-error (Equation (A12)) might be very big and have large gradients. This large gradients might make the training process unstable. To control the fluctuation of gradients and stabilize the training process, we replaced the original loss in Equation (A12) with the following Huber loss: L(y,ŷ) = (y−ŷ) 2 2 , if|y −ŷ| ≤ 1.

15:
Terminate the episode if source is found 16: end for 17: end for s t , n t , and m t represent the mean measurement matrix, number of measurements matrix, and current map matrix at step t respectively. The total number of training episodes was M = 1 × 10 6 , the step limit for each episode was T = 100, the discount factor was γ = 0.9, and the θ was updated every 10 training steps. medium, and high searching densities represented by plot (a), (b), and (c) in Figure A2. The searching density controls the trade-off between searching time and searching coverage. Figure A2. The uniform search algorithms with different searching densities. Each blue dot represents a position to take one measurement. From (a) to (c), their total search times are 28 s, 37 s, and 54 s respectively.