# A Simulator and First Reinforcement Learning Results for Underwater Mapping

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Related Work

#### 1.2. Open Issues and Contribution

#### 1.3. Structure of the Paper

## 2. Simulator

**Environment:**The map is represented as a grid. The map is loaded and created from a heat map image, where the temperature is the height of the environment. When creating the environment, the length, width and height of the grid are passed as arguments. Some of the areas can be marked as litter. The 3D map then contains the following information: $0=not\phantom{\rule{0.277778em}{0ex}}occupied$, $1=occupied$ and $0.5=litter$ (Figure 1). A reduced-complexity 2.5D representation is also created, consisting of two matrices. One of these matrices, denoted by $\widehat{\mathbf{H}}$, stores the seafloor height for every 2D coordinate i, j. The other, denoted by m, is labeled with litter. The ground truth litter map is defined as follows: ${m}_{i,j}=0=surface$ and ${m}_{i,j}=1=litter$. After the 3D and 2.5D representations are created, the agent can be placed at any position in the environment.

**Map representation:**The real map must be differentiated from the beliefs about the map. This difference is caused by limited range and errors in the sensor, see the sensor model below. In particular, the sensor can give different measurement outputs when measuring the same object from the same or a different pose. The recovery of a spatial world model from sensor data is best modeled as an estimation theory problem.

**Figure 1.**Left side: The ground truth map, where the gray areas are litter. Right side: Map discovered by the agent. A yellow color of the voxels means there is a strong probability for no litter, and blue means a high probability of litter. Top and center: 2D projection of the currently discovered map, where white represents litter, and black is no litter.

**Figure 2.**From the left: The first two images are real data from the OpenTopology website. The last two pictures are generated 2D images with GANs.

**Robot state:**The pose $\mathbf{P}$ of the agent consists of a vector ${\mathbf{P}}_{T}=[{p}_{x},{p}_{y},{p}_{z}]$ that stores the position of the agent and a rotation matrix ${\mathbf{P}}_{R}$ that stores the attitude of the agent relative to the world coordinates.**Action and transitions:**The translation actions are performed in the agent’s coordinates. Before the agent performs an action, a check is made to see if the action is legal, i.e., whether the action would cause the agent to leave the environment or bump against the bottom. If the action is illegal, a collision signal is sent. The actions are defined here as follows:

- Translate: forward, backward, left, right, up and down.
- Rotate: clock wise or counter clock wise around each axis.

**Sensor model**: The sensor model is based on a multi-beam sonar where an overall angle $\widehat{b}$ is covered by the angles of the combined K beams. Every beam has an opening angle $\widehat{o}$, and this opening angle is represented by L rays, see Figure 3.

**Simulated sensor and map updates:**To read the sensor data the simulator checks if in the agent’s sensor range the voxels are occupied. A distance variable ${a}_{z}$ is defined in the ray’s z coordinate, which is increased in a loop with a step of $\frac{1}{u}$ voxels until the maximum sensor range is reached. The sensor rotation matrix from the 4D array, ${\mathbf{S}}_{k,l,\xb7,\xb7}$, is multiplied with a vector $[0,0,-{a}_{z}]$ to check if, at the current position along the ray, the voxel is occupied; if not, ${a}_{z}$ is increased. This continues in a loop until the sensor range is reached.

**Speed of the simulator:**The main motivation to build the simulator was speed. Figure 5 shows the dependency of the speed on the amount of rays and on the map size. In this figure, u is 3, and the ray length/range is 11. The tests were performed while choosing random agent actions, on a computer running Ubuntu 20.04 and having an Intel Core i7-8565U CPU and 16GiB RAM. The smallest speed is around 100 steps per second, which means 10,000 s (about 3 h) are needed to simulate one million samples. This is acceptable for DRL.

## 3. Background on DRL

**Deep reinforcement learning**uses deep neural networks (deep NNs) to estimate the Q values. Denote by $\Theta $ the parameters of the NN and by $Q(s,a,\Theta )$ the corresponding approximate Q-value function. In the case of the DDDQN [3] and Rainbow [4] algorithms that will be used, two networks are employed. The reason is that taking an action with the highest noisy Q-value in the maximum application within the TD, using the “normal” parameters $\Theta $, would lead to overestimation of the Q-Values. To solve this, the algorithms compute the maximum in the TD using another set of parameters ${\Theta}^{+}$, which give the so-called target network [23]:$${\delta}^{+}={R}_{t+1}+\gamma \underset{{a}^{\prime}}{max}Q\left(\right)open="("\; close=")">{s}_{t+1},{a}^{\prime},{\Theta}^{+}$$

## 4. Application of DRL to Underwater Mapping

**State**: The state s of the agent is a tuple composed of the pose $\mathbf{P}$, the belief $\mathbf{B}$, the entropy $\mathbf{H}$ and the height $\widehat{\mathbf{H}}$:$$s=\left(\right)open="\langle "\; close="\rangle ">\mathbf{P},\mathbf{B},\mathbf{H},\widehat{\mathbf{H}}$$- The state is normalized between −1 and 1 to help the neural network learn. To do so the pose $\mathbf{P}$ is split into ${\mathbf{P}}_{T}$ (position) and ${\mathbf{P}}_{R}$ (rotation):$${s}^{N}=\left(\right)open="\langle "\; close="\rangle ">\begin{array}{c}{\mathbf{P}}_{T}^{N}=(\frac{{\mathbf{P}}_{T}}{{\widehat{\mathbf{P}}}_{max}}-0.5)\xb72,\\ {\mathbf{P}}_{R}^{N}=(\frac{{\mathbf{P}}_{R}}{2\pi}-0.5)\xb72,\\ {\mathbf{B}}^{N}=(\mathbf{B}-0.5)\xb72,\\ {\mathbf{H}}^{N}=(-\frac{\mathbf{H}}{log\left(0.5\right)}-0.5)\xb72,\\ {\widehat{\mathbf{H}}}_{N}=\left(\frac{\widehat{\mathbf{H}}}{{\widehat{\mathbf{H}}}_{max}}\right)-0.5)\xb72\end{array}$$

**Actions:**The actions are those defined in the robot model of Section 2.**Reward:**The goal of the agent is to find the litter as fast as possible. In general, rewards can be defined based on the belief and the entropy of the map. To help the agent learn, rewards are provided both for finding litter and for exploring the map.

**Entropy-dependent exploration:**In this environment, states are similar at the beginning, e.g., for the first state of the trajectories, the entropy and belief are similar (uniform belief with high entropy). As the trajectories of the agent get longer and it discovers more of the map at various locations, the states become more unique. This poses an exploration problem. For this reason, instead of only exploring at the beginning and decreasing the linearly as in traditional DRL [24], exploration is also made dependent on the entropy left on the map:$$\u03f5=\frac{\theta}{H\left(s\right)}$$**Modified PER:**Unlike classic DRL tasks (e.g., Atari games), here the agent receives nonzero rewards in almost every state. At the beginning of a trajectory, as the locations that are easy to discover are found, high Q-Values are seen, which then decrease progressively. As a consequence, the TD-error $\delta $ at the beginning of a trajectory is also greater, which increases the probability that PER chooses an early sample of a trajectory to train the NN, at the expense of later samples that may actually be more unique and, therefore, more relevant. To avoid this, the TD-error is normalized by the Q-value before using it for PER:$${\delta}_{N}=\frac{(R+\gamma {\mathrm{max}}_{a}{Q}^{\prime}({s}^{\prime},a,{\Theta}^{+}))-Q(s,a,\Theta )}{Q(s,a,\Theta )}$$**Collision-free exploration:**By default, the agent can collide with the seafloor or with the borders of the map (operating area). Any such collision leads to termination of the trajectory, with a poor return, which is hypothesized to discourage the agent from visiting again positions where collisions are possible. This means both that the Q-values are estimated poorly for such positions and, in the final application of the learned policy, that those positions are insufficiently visited to map them accurately.

## 5. Experiments and Discussion

#### 5.1. DDDQN Results and Discussion

**Normalization of the TD error in PER:**For this experiment only, several simplifications are performed. Only discovered voxels are rewarded, and a penalty is given when the agent crashes, $R={R}_{H}+{R}_{C}$, while the litter component ${R}_{L}$ is ignored, meaning that only a coverage problem is solved. Two agents are run, one with the normalized TD errors (31) and the other without, over the same amount of steps (10 million). Collisions are allowed during validation.

**DDDQN versus LM:**Figure 12 shows the comparison between the LM and DDDQN agents, this time for the full litter-discovery problem. As shown in Table 1, the DRL agent finds around $56\%$ of the litter, while the LM-agent finds $98\%$ of the litter at the end of the trajectory. On the other hand, the DDDQN agent finds after 50 steps on average around $4\%$ more litter. Moreover, the large difference in the variance is remarkable: in the worst trajectory, after 120 steps (half of the trajectory) the DDDQN agent found at least 11 L whereas the LM-agent found 0.

**Figure 12.**The results of the agents on the 40 validation maps, in the case with map size $27\times 27\times 24$ maps. Left: LM agent. Right: DRL agent.

#### 5.2. Rainbow Experiments and Discussion

**LM baseline:**With the new configuration, the LM needs 350 steps to finish its trajectory. Figure 14 shows that on average 52.7 L items are found by the LM. During the first 31 steps, the agent does not find much litter. The reason for that is the poorer sensor, which must see again the same region to become sure that certain voxels are litter.

**Figure 14.**The results with the LM agent on $32\times 32\times 24$ maps with the $15\times 3$ sensor.

**The impact of entropy in the reward function:**An investigation was conducted to determine whether the entropy component is needed in the reward function. Figure 15a–c show results with no entropy component (corresponding to entropy parameter $\mathfrak{h}\to \infty $ in (25)), with $\mathfrak{h}=100$ (a low influence of entropy) and with $\mathfrak{h}=10$ (a larger influence of entropy). The litter parameter $\mathfrak{l}$ is always 1.

**Figure 15.**Rainbow agent performance for varying entropy influence on the rewards. From the left: (

**a**) No entropy reward. (

**b**) Entropy reward parameter $\mathfrak{h}=100$. (

**c**) $\mathfrak{h}=100$.

**Deep versus shallow network:**To check whether the complexity of the network in Figure 16 is justified, here a shallower network is used, show in Figure 17. This network structure is appropriate for Atari games, the usual DRL benchmark. It has fewer layers with larger convolutional kernels. The results in Figure 18 and Table 1 show that this shallow network finds after 50 steps only 11.88 L on average and after 350 steps 36.78 L on average. The deeper network finds after 50 steps 19.3 L and after 350 steps 46.25 L.

**Collision-free exploration:**For the final experiment, the following question was asked: could the collision and oscillation-avoidance measures, applied so far during validation, also help during training? Recall that, instead of simply avoiding collisions, the agent must additionally learn about them; therefore, the collision transitions are added to the PER as explained in Section 4.

**Table 1.**Numerical comparison between all the agents, in terms of how fast they find litter early on (after 50 steps) and the total amount of litter found at the end of the trajectory. The smaller-map results are given above the double line, and the larger-map results below this line. The cyan background highlights the best agent for each type of map. In all the experiments the amount of false-positive findings were between $0.15$ and $0.4$ L items on average, i.e., very low.

Agent | 50 Steps | End of Trajectory | ||
---|---|---|---|---|

Litter | % | Litter | % | |

LM | 14.5 | 23% | 62.0 | 98% |

DDDQN $\mathfrak{l}=10$, $\mathfrak{h}=100$ | 16.9 | 27% | 35.5 | 56% |

Deep Rainbow $\mathfrak{l}=1$, $\mathfrak{h}=25$; no coll. | 30.1 | 48% | 56.4 | 90% |

LM | 4.5 | 7% | 52.7 | 84% |

Deep Rainbow $\mathfrak{l}=1$, no entropy ($\mathfrak{h}\to \infty $) | 17.2 | 27% | 39.3 | 62% |

Deep Rainbow $\mathfrak{l}=1$, $\mathfrak{h}=10$ | 15.6 | 25% | 48.2 | 77% |

Deep Rainbow $\mathfrak{l}=1$, $\mathfrak{h}=100$ | 19.3 | 31% | 46.3 | 73% |

Shallow Rainbow $\mathfrak{l}=1$, $\mathfrak{h}=100$ | 11.9 | 19% | 36.8 | 58% |

Deep Rainbow $\mathfrak{l}=1$, $\mathfrak{h}=25$; no coll. | 24.4 | 39% | 55.4 | 88% |

## 6. Conclusions and Future Work

#### 6.1. Summary and Main Findings

#### 6.2. Limitations and Future Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Singh, A.; Krause, A.; Guestrin, C.; Kaiser, W.J. Efficient informative sensing using multiple robots. J. Artif. Intell. Res.
**2009**, 34, 707–755. [Google Scholar] [CrossRef] - Stachniss, C.; Grisetti, G.; Burgard, W. Information Gain-based Exploration Using Rao-Blackwellized Particle Filters. Robot. Sci. Syst.
**2005**, 2, 65–72. [Google Scholar] - Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
- Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv
**2015**, arXiv:1511.05952. [Google Scholar] - Ando, T.; Iino, H.; Mori, H.; Torishima, R.; Takahashi, K.; Yamaguchi, S.; Okanohara, D.; Ogata, T. Collision-free Path Planning on Arbitrary Optimization Criteria in the Latent Space through cGANs. arXiv
**2022**, arXiv:2202.13062. [Google Scholar] - Xue, Y.; Sun, J.Q. Solving the Path Planning Problem in Mobile Robotics with the Multi-Objective Evolutionary Algorithm. Appl. Sci.
**2018**, 8, 1425. [Google Scholar] [CrossRef][Green Version] - Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math.
**1959**, 1, 269–271. [Google Scholar] [CrossRef][Green Version] - Hitz, G.; Galceran, E.; Garneau, M.È.; Pomerleau, F.; Siegwart, R. Adaptive continuous-space informative path planning for online environmental monitoring. J. Field Robot.
**2017**, 34, 1427–1449. [Google Scholar] [CrossRef] - Popović, M.; Vidal-Calleja, T.; Hitz, G.; Chung, J.J.; Sa, I.; Siegwart, R.; Nieto, J. An informative path planning framework for UAV-based terrain monitoring. Auton. Robot.
**2020**, 44, 889–911. [Google Scholar] [CrossRef][Green Version] - Bottarelli, L.; Bicego, M.; Blum, J.; Farinelli, A. Orienteering-based informative path planning for environmental monitoring. Eng. Appl. Artif. Intell.
**2019**, 77, 46–58. [Google Scholar] [CrossRef] - Zimmermann, K.; Petricek, T.; Salansky, V.; Svoboda, T. Learning for active 3D mapping. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1539–1547. [Google Scholar]
- Wei, Y.; Zheng, R. Informative path planning for mobile sensing with reinforcement learning. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 864–873. [Google Scholar]
- Barratt, S. Active robotic mapping through deep reinforcement learning. arXiv
**2017**, arXiv:1712.10069. [Google Scholar] - Hung, S.M.; Givigi, S.N. A Q-learning approach to flocking with UAVs in a stochastic environment. IEEE Trans. Cybern.
**2016**, 47, 186–197. [Google Scholar] [CrossRef] [PubMed] - Li, Q.; Zhang, Q.; Wang, X. Research on Dynamic Simulation of Underwater Vehicle Manipulator Systems. In Proceedings of the OCEANS 2008-MTS/IEEE Kobe Techno-Ocean, Kobe, Japan, 8–11 April 2008; pp. 1–7. [Google Scholar]
- Manhães, M.M.M.; Scherer, S.A.; Voss, M.; Douat, L.R.; Rauschenbach, T. UUV Simulator: A Gazebo-based package for underwater intervention and multi-robot simulation. In Proceedings of the OCEANS 2016 MTS/IEEE Monterey, Monterey, CA, USA, 19–23 September 2016. [Google Scholar] [CrossRef]
- Mo, S.M. Development of a Simulation Platform for ROV Systems. Master’s Thesis, NTNU, Trondheim, Norway, 2015. [Google Scholar]
- Hausi, A.D. Analysis and Development of Generative Algorithms for Seabad Surfaces. Bachelor’s Thesis, Technical University of Cluj-Napoca, Cluj-Napoca, Romania, 2021. [Google Scholar]
- 2010 Salton Sea Lidar Collection. Distributed by OpenTopography. 2012. Available online: https://portal.opentopography.org/datasetMetadata?otCollectionID=OT.032012.26911.2 (accessed on 21 February 2022).
- Elfes, A. Occupancy grids: A stochastic spatial representation for active robot perception. arXiv
**2013**, arXiv:1304.1098. [Google Scholar] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Bellemare, M.G.; Dabney, W.; Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 449–458. [Google Scholar]
- Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy networks for exploration. arXiv
**2017**, arXiv:1706.10295. [Google Scholar] - Ota, K.; Jha, D.K.; Kanezaki, A. Training larger networks for deep reinforcement learning. arXiv
**2021**, arXiv:2102.07920. [Google Scholar] - Santurkar, S.; Tsipras, D.; Ilyas, A.; Madry, A. How does batch normalization help optimization? In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. Volume 31. [Google Scholar]
- Gogianu, F.; Berariu, T.; Rosca, M.C.; Clopath, C.; Busoniu, L.; Pascanu, R. Spectral normalisation for deep reinforcement learning: An optimisation perspective. In Proceedings of the International Conference on Machine Learning, Shenzhen, China, 18–24 July 2021; pp. 3734–3744. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

**Figure 4.**The lighter color visualizes which voxel is being measured. In the figures, the number of steps per voxel is $u=3$. From the left: (

**a**) The belief of the upper voxel is updated. (

**b**) Edge case: the ray is going through the upper voxel but the belief of the right lower voxel is updated. (

**c**) No voxel is updated because the surface was not hit by a ray.

**Figure 5.**A graphical visualization of the simulator speed depending on the map size and on the number of rays. A sensor range of 11 was chosen with a number of steps per voxel of $u=3$.

**Figure 6.**Representation of the state: The belief, height, entropy, position of the agent and attitude of the agent are each a 2D matrix. The picture illustrates how the kernel of the convolutional network connects the spatial information of the state already in the first layer.

**Figure 7.**Examples of gridworld environments where the white color represents obstacles and the blue pixel is the agent. Each subfigure is a different environment.

**Figure 8.**Solutions learned to a coverage problem for the environments above. The maps correspond one-to-one to those above. The lighter the gray color, the more entropy remains at this position at the end of the trajectory. Each subfigure is a different environment.

**Figure 9.**The structure of the PER for collision-free exploration. To save memory, every state is stored only once, linked to the next state in the trajectory. When a collision occurs, both the illegal transition $({s}_{2},{a}_{2.0},r=-1,\mathrm{done}=\mathrm{True},{s}_{2})$ is stored, and immediately after it, the legal transition that is actually performed in the environment, $({s}_{2},{a}_{2.1},r2,\mathrm{done}=\mathrm{False},{s}_{3})$. The dummy next state ${s}_{2}$ in the illegal transition is irrelevant, because the transition is terminal, and the next-state Q-values are always zero.

**Figure 10.**Distribution of litter for the experiment. More litter (gray voxels) was placed in the deeper layers (valleys) than in the higher layers.

**Figure 19.**(

**a**): The results of the collision-free Rainbow agent with the $3\times 15$ sensor on $32\times 32\times 24$ maps. (

**b**): The results of a similar agent but with the $5\times 15$ sensor on $27\times 27\times 24$ maps (the version used for DDDQN).

**Figure 20.**Visualization of the best agent’s trajectory using a sequence of snapshots (

**b**–

**i**). The first map (

**a**) shows the ground truth (recall that litter is gray, low beliefs are yellow, and large beliefs are blue). The agents moves from the center directly to a valley where it finds large quantities of litter (

**b**). Note that in (

**c**), the agent does not take the shortest path to the next valley, likely because it is more efficient to approach the valley from one side instead of driving directly into the middle and then measuring parts of the valley twice.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rosynski, M.; Buşoniu, L.
A Simulator and First Reinforcement Learning Results for Underwater Mapping. *Sensors* **2022**, *22*, 5384.
https://doi.org/10.3390/s22145384

**AMA Style**

Rosynski M, Buşoniu L.
A Simulator and First Reinforcement Learning Results for Underwater Mapping. *Sensors*. 2022; 22(14):5384.
https://doi.org/10.3390/s22145384

**Chicago/Turabian Style**

Rosynski, Matthias, and Lucian Buşoniu.
2022. "A Simulator and First Reinforcement Learning Results for Underwater Mapping" *Sensors* 22, no. 14: 5384.
https://doi.org/10.3390/s22145384