# Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Problem Formulation

## 3. Decision-Making Policy and Deep Q-Learning Solution

#### 3.1. The Agent’s Actions and Decision Making

#### 3.2. Dynamic Programming Scheme with Prediction and Target Neural Networks

#### 3.3. Model-Free and Model-Based Learning

#### 3.4. The Choice of the Actions at the Learning Stage

#### 3.5. Outline of the Q-Max Algorithm

Algorithm 1. Generating the training data set |

Input: domain $C=\left\{{c}_{1},{c}_{2},\dots ,{c}_{n}\right\}$, |

set $\mathbb{A}=\left\{\uparrow ,\nearrow ,\to ,\searrow ,\downarrow ,\swarrow ,\leftarrow ,\nwarrow ,\odot \right\}$ of possible actions, |

probability ${p}_{TA}$ of true alarms (Equation (3)), |

rate $\alpha $ of false alarms and their probability ${p}_{FA}=\alpha {p}_{TA}$ (Equation (4)), |

sensor sensitivity $\lambda $, |

range $\left[{\xi}_{1},{\xi}_{2}\right]$ of possible numbers $0<{\xi}_{1}<{\xi}_{2}\le n-1$ of targets, |

length $L\in \left(0,\infty \right)$ of the agent’s trajectory, |

number $N\in \left(0,\infty \right)$ of agent trajectories, |

initial probability map $P\left(0\right)$ on the domain $C$. |

Output: data set that is an $L\times N$ table of pairs $\left(c,P\right)$ of agent positions $c$ and corresponding probability maps $P$. |

1. Create the $L\times N$ data table. |

2. For each agent trajectory $j=1,\dots ,N$ do: |

3. Choose a number $\xi \in \left[{\xi}_{1},{\xi}_{2}\right]$ of targets according to a uniform distribution on the interval $\left[{\xi}_{1},{\xi}_{2}\right]$. |

4. Choose the target locations ${c}_{1},{c}_{2},\dots ,{c}_{\xi}\in C$ randomly according to the uniform distribution on the domain $C$. |

5. Choose the initial agent position $c\left(0\right)\in C$ randomly according to the uniform distribution on the domain $C$. |

6. For $l=0,\dots ,L-1$ do: |

7. Save the pair $\langle c\left(l\right),P\left(l\right)\rangle $ as the $j$th element of the data table. |

8. Choose an action $\mathbb{a}\left(l\right)\in \mathbb{A}$ randomly according to the uniform distribution on the set $\mathbb{A}$. |

9. Apply the chosen action and set the next position $c\left(l+1\right)=\mathbb{a}\left(c\left(l\right)\right)$ of the agent. |

10. Calculate the next probability map $P\left(l+1\right)$ with Equations (20) and (21). |

11. End for |

12. End for |

13. Return the data table. |

Algorithm 2. Training the prediction neural network |

Network structure: |

input layer: $2n$ neurons ($n$ agent positions and $n$ target location probabilities, both relative to the size $n$ of the domain), |

hidden layer: $2n$ neurons, |

output layer: $9$ neurons (in accordance with the number of possible actions). |

Activation function: |

sigmoid function $f\left(x\right)=1/\left(1+{e}^{-x}\right)$. |

Loss function: |

mean square error (MSE) function. |

Input: domain $C=\left\{{c}_{1},{c}_{2},\dots ,{c}_{n}\right\}$, |

set $\mathbb{A}=\left\{\uparrow ,\nearrow ,\to ,\searrow ,\downarrow ,\swarrow ,\leftarrow ,\nwarrow ,\odot \right\}$ of possible actions, |

probability ${p}_{TA}$ of true alarms (Equation (3)), |

rate $\alpha $ of false alarms and their probability ${p}_{FA}=\alpha {p}_{TA}$ (Equation (4)), |

sensor sensitivity $\lambda $, |

discount factor $\gamma $, |

objective probability map ${P}^{*}$ (obtained by using the value $\epsilon $), |

number $r$ of iterations for updating the weights, |

initial value $\eta $ (Equation (22)) and its discount factor $\delta $, |

learning rate $\rho $ (with respect to the type of optimizer), |

number $M$ of epochs, |

initial weights $w$ of the prediction network and initial weights ${w}^{\prime}=w$ of the target network, |

training data set (that is, the $L\times N$ table of $\left(c,P\right)$ pairs created by Procedure 1). |

Output: The trained prediction network. |

1. Create the prediction network. |

2. Create the target network as a copy of the prediction network. |

3. For each epoch $j=1,\dots ,M$ do: |

4. For each pair $\left(c,P\right)$ from the training data set, do: |

5. For each action $\mathbb{a}\in \mathbb{A}$ do: |

6. Calculate the value $Q\left(c,P,\mathbb{a};w\right)$ with the prediction network. |

7. Calculate the probability $p\left(\mathbb{a}|Q;\eta \right)$ (Equation (22)). |

8. End for. |

9. Choose an action according to the probabilities $p\left(\mathbb{a}|Q;\eta \right)$. |

10. Apply the chosen action and set the next position ${c}^{\prime}=\mathbb{a}\left(c\right)$ of the agent. |

11. Calculate the next probability map ${P}^{\prime}$ with Equations (20) and (21). |

12. If $P={P}^{*}$ or ${c}^{\prime}\notin C$, then |

13. Set the immediate reward $R\left(\mathbb{a}\right)=0$. |

14. Else |

15. Calculate the immediate reward $R\left(\mathbb{a}\right)$ with respect to $P$ and ${P}^{\prime}$ (Equation (14)). |

16. End if. |

17. For each action $\mathbb{a}\in \mathbb{A}$ do: |

18. If $P={P}^{*}$ then |

19. Set $Q\left({c}^{\prime},{P}^{\prime},\mathbb{a};{w}^{\prime}\right)=0$. |

20. Else |

21. Calculate the value $Q\left({c}^{\prime},{P}^{\prime},\mathbb{a};{w}^{\prime}\right)$ with the target network. |

22. End if. |

23. End for. |

24. Calculate the target value ${Q}^{+}=R\left(\mathbb{a}\right)+\gamma \underset{\mathbb{a}\in \mathbb{A}}{\mathrm{max}}Q\left({c}^{\prime},{P}^{\prime},\mathbb{a};{w}^{\prime}\right)$ (Equation (17)). |

25. Calculate the temporal difference learning error as ${\Delta}_{l}\left(Q\right)={Q}^{+}-Q\left(c,P,\mathbb{a};w\right)$ for the chosen action $\mathbb{a}$ (Equation (19)) and set ${\Delta}_{l}\left(Q\right)=0$ for all other actions. |

26. Update the weights $w$ in the prediction network by backpropagation with respect to the error ${\Delta}_{l}\left(Q\right)$. |

27. Every $r$ iterations, set the weights of the target network as ${w}^{\prime}=w$. |

28. End for. |

#### 3.6. The SPL Algorithm

## 4. Simulation Results

^{®}Core™ i7-10700 CPU with 16 GB RAM. In the simulations, the detection was conducted over a gridded square domain of size $n={n}_{x}\times {n}_{y}$ cells, and it was assumed that the agent and each target could occupy only one cell. Given this equipment, we measured the run time of the simulations for different datasets, which demonstrated that the suggested algorithms are implementable on usual computers and do not require specific apparats for their functionality.

#### 4.1. Network Training in the Q-Max Algorithm

#### 4.2. Detection by the Q-Max and SPL Algorithms

#### 4.3. Comparison between the Q-Max and SPL Algorithms and the Eig, Cov and Cog Algorithms

#### 4.4. Comparison between the SPL Algorithm and an Algorithm Providing the Optimal Solution

#### 4.5. Run Times and Mean Squared Error for Different Sizes of Data Sets

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Conflicts of Interest

## References

- Nahin, P.J. Chases and Escapes: The Mathematics of Pursuit and Evasion; Princeton University Press: Princeton, NJ, USA, 2007. [Google Scholar]
- Washburn, A.R. Search and Detection; ORSA Books: Arlington, VA, USA, 1989. [Google Scholar]
- Koopman, B.O. Search, and Screening. Operation Evaluation Research Group Report, 56; Center for Naval Analysis: Rosslyn, VI, USA, 1946. [Google Scholar]
- Stone, L.D. Theory of Optimal Search; Academic Press: New York, NY, USA, 1975. [Google Scholar]
- Cooper, D.; Frost, J.; Quincy, R. Compatibility of Land SAR Procedures with Search Theory; US Department of Homeland Security: Washington, DC, USA, 2003.
- Frost, J.R.; Stone, L.D. Review of Search Theory: Advances and Applications to Search and Rescue Decision Support; US Coast Guard Research and Development Center: Groton, MA, USA, 2001. [Google Scholar]
- Kagan, E.; Ben-Gal, I. Probabilistic Search for Tracking Targets; Wiley & Sons: Chichester, UK, 2013. [Google Scholar]
- Stone, L.D.; Barlow, C.A.; Corwin, T.L. Bayesian Multiple Target Tracking; Artech House Inc.: Boston, MA, USA, 1999. [Google Scholar]
- Kagan, E.; Ben-Gal, I. Search, and Foraging: Individual Motion and Swarm Dynamics; CRC/Taylor & Francis: Boca Raton, FL, USA, 2015. [Google Scholar]
- Kagan, E.; Shvalb, N.; Ben-Gal, I. (Eds.) Autonomous Mobile Robots and Multi-Robot Systems: Motion-Planning, Communication, and Swarming; Wiley & Sons: Chichester, UK, 2019. [Google Scholar]
- Brown, S. Optimal search for a moving target in discrete time and space. Oper. Res.
**1980**, 28, 1275–1289. [Google Scholar] [CrossRef] - Matzliach, B.; Ben-Gal, I.; Kagan, E. Sensor fusion and decision-making in the cooperative search by mobile robots. In Proceedings of the International Conference Agents and Artificial Intelligence ICAART’20, Valletta, Malta, 22–24 February 2020; pp. 119–126. [Google Scholar]
- Matzliach, B.; Ben-Gal, I.; Kagan, E. Cooperative detection of multiple targets by the group of mobile agents. Entropy
**2020**, 22, 512. [Google Scholar] [CrossRef] [PubMed] - Elfes, A. Sonar-based real-world mapping, and navigation. IEEE J. Robot. Autom.
**1987**, 3, 249–265. [Google Scholar] [CrossRef] - Elfes, A. Occupancy grids: A stochastic spatial representation for active robot perception. In Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence, New York, NY, USA, 27–29 July 1990; pp. 136–146. [Google Scholar]
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res.
**1996**, 4, 237–285. [Google Scholar] [CrossRef] [Green Version] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; Bradford Book, MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Jeong, H.; Hassani, H.; Morari, M.; Lee, D.D.; Pappas, G.J. Learning to Track Dynamic Targets in Partially Known Environments. 2020. Available online: https://arxiv.org/abs/2006.10190 (accessed on 20 June 2022).
- Quiroga, F.; Hermosilla, G.; Farias, G.; Fabregas, E.; Montenegro, G. Position control of a mobile robot through deep reinforcement learning. Appl. Sci.
**2022**, 12, 7194. [Google Scholar] [CrossRef] - Washburn, A.R. Search for a moving target: The FAB algorithm. Oper. Res.
**1983**, 31, 739–751. [Google Scholar] [CrossRef]

**Figure 6.**The change in the temporal difference learning error with respect to the number of training epochs. The solid line is associated with the training stage, and the dashed line is associated with the validation stage.

**Figure 7.**Discounted cumulative reward of detection by the Q-max algorithm (

**a**) and cumulative payoff of detection by the SPL algorithm (

**b**) compared with the results obtained by the random detection procedure. The solid line in both figures is associated with the suggested algorithms (Q-max and SPL), and the dashed line is associated with the random choice of actions.

**Figure 8.**Cumulative reward of detection by the Q-max algorithm for static targets (

**a**) and cumulative payoff of detection by the SPL algorithm for static targets (

**b**) compared with the results obtained by the COV algorithm.

**Figure 9.**The number of agent actions in detecting two static targets with the SPL/Q-max algorithms (black bars) and the COV algorithm (gray bars): (

**a**) $\lambda =15$ and (

**b**) $\lambda =10$.

**Figure 10.**Dependence of the detection probabilities on the number of planned actions for the SPL algorithm (solid line) and DP algorithm (dotted line); the sensor sensitivity is $\lambda =15$, the false alarm rate is $\alpha =0.25$, and the termination time is $t=120$ min.

**Figure 11.**Dependence of the detection probabilities on the false alarm rate $\alpha $ for sensor sensitivities $\lambda =15$ (dotted line) and $\lambda =10$ (dashed line). The probability $0.95$ for the SPL algorithm and all values of $\alpha $ is depicted by the solid line. The termination time is $120$ min.

**Table 1.**Number of agent actions and the discounted cumulative information gain in detecting two static targets for the false alarm rate $\alpha =0.5$.

Detection Algorithm | Number of Actions Up to Detection of the First Target | Number of Actions Up to Detection of the Second Target | Discounted Cumulative Information Gain |
---|---|---|---|

Random | $25$ | $45$ | $13.4$ |

$EIG$ | $17$ | $27$ | $17.1$ |

$COV$ | $17$ | $24$ | $17.5$ |

$COG$ | $18$ | $29$ | $16.1$ |

Q-max | $15$ | $21$ | $20.5$ |

SPL | $14$ | $21$ | $20.1$ |

**Table 2.**Number of agent actions and the discounted cumulative information gain in detecting two moving targets for the false alarm rate $\alpha =0.5$.

Detection Algorithm | Number of Actions Up to Detection of the First Target | Number of Actions Up to Detection of the Second Target | Discounted Cumulative Information Gain |
---|---|---|---|

Random | $72$ | $105$ | $21.8$ |

$EIG$ | $50$ | $65$ | $27.1$ |

$COV$ | $49$ | $62$ | $28.7$ |

$COG$ | $55$ | $67$ | $26.2$ |

Q-max | $32$ | $45$ | $33.2$ |

SPL | $31$ | $43$ | $32.1$ |

**Table 3.**The number of agent actions in detecting two static targets for different values of the false alarm rate $\alpha $ and of the sensor sensitivity $\lambda $.

Sensor Sensitivity | Algorithm | False Alarm Rate | ||
---|---|---|---|---|

$\mathit{\alpha}=0.25$ | $\mathit{\alpha}=0.5$ | $\mathit{\alpha}=0.75$ | ||

$\lambda =15$ | COV | $14$ | $25$ | $63$ |

SPL/Q-max (average) | $13$ | $20$ | $45$ | |

$\lambda =5$ | COV | $64$ | $95$ | $242$ |

SPL/Q-max (average) | $44$ | $54$ | $63$ |

**Table 4.**Number of planned agent actions in detecting two static targets by the SPL algorithm and dynamic programming (DP) algorithm for different values of the false alarm rate $\mathit{\alpha}$ and of the sensor sensitivity $\mathit{\lambda}$.

Sensor Sensitivity | Algorithm | Characteristic | False Alarm Rate | ||||
---|---|---|---|---|---|---|---|

$\mathit{\alpha}=0$ | $\mathit{\alpha}=0.05$ | $\mathit{\alpha}=0.1$ | $\mathit{\alpha}=0.25$ | $\mathit{\alpha}=0.5$ | |||

$\lambda =15$ | DP | Run time | $0.4$ s | $1$ min | $120$ min | $120$ min | $120$ min |

Number of planned actions | $3$ | $5$ | $7$ | $7$ | $7$ | ||

Detection probabilities ${p}_{1}$ $\mathrm{and}{p}_{2}$ | $1.0$ $1.0$ | $1.0$ $0.99$ | $0.99$ $0.96$ | $0.90$ $0.84$ | $0.84$ $0.68$ | ||

SPL | Run time | $0.4$ s | $1$ min | $120$ min | $120$ min | $120$ min | |

Number of planned actions | $3$ | $5$ | $7$ | $13$ | $20$ | ||

Detection probabilities ${p}_{1}$ $\mathrm{and}{p}_{2}$ | $1.0$ $1.0$ | $1.0$ $0.99$ | $0.99$ $0.96$ | $0.99$ $0.95$ | $0.99$ $0.95$ | ||

$\lambda =10$ | DP | Run time | $1$ min | $120$ min | $120$ min | $120$ min | $120$ min |

Number of planned actions | $5$ | $7$ | $7$ | $7$ | $7$ | ||

Detection probabilities ${p}_{1}$ $\mathrm{and}{p}_{2}$ | $1.0$ $1.0$ | $0.96$ $0.95$ | $0.90$ $0.85$ | $0.85$ $0.65$ | $0.71$ $0.43$ | ||

SPL | Run time | $1$ min | $120$ min | $120$ min | $120$ min | $120$ min | |

Number of planned actions | $5$ | $7$ | $15$ | $21$ | $32$ | ||

Detection probabilities ${p}_{1}$ $\mathrm{and}{p}_{2}$ | $1.0$ $1.0$ | $0.96$ $0.95$ | $0.97$ $0.95$ | $0.98$ $0.95$ | $0.99$ $0.95$ |

Domain Size ${\mathit{n}}_{\mathit{x}}\times {\mathit{n}}_{\mathit{y}}$ | Number of Nonzero Weights in the Neural Network | Size of the Data Set | Run Time for One Epoch [Minutes] | Mean Squared Error * |
---|---|---|---|---|

$10\times 10$ | $42,009$ | $5000$ | $4$ | $0.13$ |

$10,000$ | $8$ | $0.12$ | ||

$20\times 20$ | $648,009$ | $5000$ | $7$ | $0.15$ |

$10,000$ | $14$ | $0.13$ | ||

$40\times 40$ | $10,272,009$ | $5000$ | $10$ | $0.18$ |

$10,000$ | $20$ | $0.15$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Matzliach, B.; Ben-Gal, I.; Kagan, E.
Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities. *Entropy* **2022**, *24*, 1168.
https://doi.org/10.3390/e24081168

**AMA Style**

Matzliach B, Ben-Gal I, Kagan E.
Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities. *Entropy*. 2022; 24(8):1168.
https://doi.org/10.3390/e24081168

**Chicago/Turabian Style**

Matzliach, Barouch, Irad Ben-Gal, and Evgeny Kagan.
2022. "Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities" *Entropy* 24, no. 8: 1168.
https://doi.org/10.3390/e24081168