# Using Deep Reinforcement Learning with Automatic Curriculum Learning for Mapless Navigation in Intralogistics

^{1}

^{2}

^{3}

^{4}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

#### 2.1. Model-Free Deep Reinforcement Learning Algorithms

#### 2.2. Deep Reinforcement Learning for Robot Navigation Tasks

#### 2.3. Curriculum Learning for Reinforcement Learning Tasks

## 3. Materials and Methods

#### 3.1. Maximum Entropy Reinforcement Learning–Soft Actor-Critic

#### 3.2. Simulation Environment

^{TM}[21]. The target dolly consists of a steel frame that can be loaded with a pallet. The dolly stands on four passive wheels, which makes it transportable. Figure 1 illustrates the mobile robot and the dolly used for this paper. The simulated vehicle is a platform robot which is specifically built for load carrier docking and is actuated by a differential drive. The vehicle and dolly specification is shown in Appendix D. It is noteworthy that the width of the dolly is only 21 cm wider than vehicle so that very accurate steering efforts are required to successfully navigate underneath the dolly, corresponding to a constricted goal space.

#### 3.3. Reinforcement Learning Problem Setup

#### 3.4. Automatic Curriculum Learning: Extension of NavACL to NavACL-Q

Algorithm 1: GetDynamicTask-Q. |

#### 3.5. Pre-Training of the Feature Extractor

## 4. Results

#### 4.1. Training Results

#### 4.1.1. Pre-Trained Convolutional Encoders

#### 4.1.2. Performance of NavACL-Q SAC with Pre-Trained Convolutional Encoders

#### 4.2. Grid-Based Testing Scenarios

#### 4.3. Ablation Studies

#### 4.3.1. Ablation Studies: Effects of Automatic Curriculum Learning

#### 4.3.2. Ablation Studies: Effects of Pre-Trained Convolutional Encoder

#### 4.4. Comparison to a Map-Based Navigation Approach

## 5. Discussion

#### 5.1. Learned Behavior of the Agent

#### 5.2. Effects of Pre-Trained Feature Extractor

#### 5.3. Potential Improvements on NavACL-Q

#### 5.4. Effects of Problem Formulations on the Performance

## 6. Conclusions

## 7. Patents

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

AVG | Automated Guided Vehicle |

RL | Reinforcement Learning |

DRL | Deep Reinforcement Learning |

ACL | Automatic Curriculum Learning |

SLAM | Simultaneous Localization and Mapping |

SAC | Soft Actor-Critic |

NavACL-Q p.t. | NavACL-Q with pre-trained convolutional encoder using Soft Actor-Critic |

NavACL-Q e.t.e | NavACL-Q with Soft Actor-Critic, end-to-end learning |

RND | Soft Actor-Critic with pre-trained convolutional encoder using random starts |

PER | Prioritized Experience Replay |

DQN | Deep Q-Network |

LSTM | Long Short-Term Memory |

## Appendix A. Details for Training Via Soft Actor-Critic

**Figure A1.**Illustration of encoder part for the stacked camera images. Four Residual blocks are used. In the left panel, the architecture of the residual blocks is illustrated. The first two convolutions use $(3\times 3)$ filters, then the identity is concatenated to the output of the first two convolutions. Finally, we down-sample the image by half using a convolution with a filter of $(2\times 2)$ and a stride of 2 according to [74].

Algorithm A1: Distributed Soft Actor-Critic—Worker Process. |

Distributed Soft Actor-Critic Hyperparameters | |
---|---|

Parameter | Value |

Discount factor $\gamma $ | $0.999$ |

Target smoothing coefficient $\tau $ | 1 (hard update) |

Target network update interval $\eta $ | 1000 |

Initial temperature coefficient ${\alpha}_{0}$ | $0.2$ |

Learning rates for network optimizer ${\lambda}_{Q}$, ${\lambda}_{\alpha}$, ${\lambda}_{\pi}$ | $2\times {10}^{-4}$ |

Optimizer | Adam |

Replay buffer capacity | ${2}^{20}$ (Binary Tree) |

(PER) prioritization parameter c | $0.6$ |

(PER) initial prioritization weight ${b}_{0}$ | $0.4$ |

(PER) final prioritization weight ${b}_{1}$ | $0.6$ |

Algorithm A2: Distributed Soft Actor-Critic—Master Process. |

## Appendix B. Details for Training the NavACL-Q Algorithm

NavACL-Q Hyperparameters | |
---|---|

Parameter | Value |

Batch size m | 16 |

Upper-confidence coefficient for easy task $\beta $ | $1.0$ |

Upper-confidence coefficient for frontier task $\gamma $ | $0.1$ |

Additional threshold for easy task $\chi $ | $0.95$ |

Maximal number of trials to generate a task ${n}_{T}$ | 100 |

Learning rate for ${f}_{\pi}$ | $4\times {10}^{-4}$ |

## Appendix C. Arena Randomization

**Table A3.**Summary of the task randomization, including the initial pose of AVG, the pose of the target dolly and obstacles.

Description | Randomization | Induced Randomization with Respect to Geometric Property |
---|---|---|

Initial Robot Yaw-Rotation | Uniform sampled from the interval $[-{90}^{\circ},{90}^{\circ}]$ | Relative Rotation: [1.5 m, 5 m] |

Initial Dolly Yaw-Rotation | Uniform sampled from the interval $[-{15}^{\circ},{15}^{\circ}]$ | |

Number of Obstacles | 1 to 4 | Agent Clearance/ Goal Clearance: $[2\phantom{\rule{3.33333pt}{0ex}}\mathrm{m},8\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}]$ |

Position of Obstacles | Randomly placed left and right of the dolly, with a distance uniformly sampled from the interval $[2\phantom{\rule{3.33333pt}{0ex}}\mathrm{m},5\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}]$ | |

Initial Robot Position | −0.5 m to 0.5 m on y- and x-axis | Agent-Goal Distance: $[1.5\phantom{\rule{3.33333pt}{0ex}}\mathrm{m},5\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}]$ |

Initial Dolly Position | Uniformly sampled from a circle segment with radius = 5 m and central angle ${30}^{\circ}$, where the center of the segment corresponds to the center of the robot, with minimum 1.5 m distance to the robot |

## Appendix D. Mobile Robot and Target Dolly Specification

Mobile Robot | |
---|---|

Length, Width, Height | 1273 mm × 630 mm × 300 mm |

Maximum Speed | 1.2 m/s |

LiDAR Sensor | 2× 128 Beams, each FOV 225°, Max Distance: 6 m |

Frontal RGB Camera | $80\times 80\times 3$ pixel, FOV ${47}^{\circ}$ |

Dolly | |

Length, Width | 1230 mm × 820 mm |

## References

- Yang, S.; Li, J.; Wang, J.; Liu, Z.; Yang, F. Learning urban navigation via value iteration network. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 800–805. [Google Scholar]
- Macek, K.; Vasquez, D.; Fraichard, T.; Siegwart, R. Safe vehicle navigation in dynamic urban scenarios. In Proceedings of the 2008 11th International IEEE Conference on Intelligent Transportation Systems, Beijing, China, 12–15 October 2008; pp. 482–489. [Google Scholar]
- Huang, H.; Gartner, G. A survey of mobile indoor navigation systems. In Cartography in Central and Eastern Europe; Springer: Berlin/Heidelberg, Germany, 2009; pp. 305–319. [Google Scholar]
- Thrun, S. Probabilistic robotics. Commun. ACM
**2002**, 45, 52–57. [Google Scholar] [CrossRef] - LaValle, S.M. Planning Algorithms; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
- Kam, M.; Zhu, X.; Kalata, P. Sensor fusion for mobile robot navigation. Proc. IEEE
**1997**, 85, 108–119. [Google Scholar] [CrossRef] - Kocić, J.; Jovičić, N.; Drndarević, V. Sensors and sensor fusion in autonomous vehicles. In Proceedings of the 2018 26th Telecommunications Forum (TELFOR), Belgrade, Serbia, 20–21 November 2018; pp. 420–425. [Google Scholar]
- Zhang, J.; Springenberg, J.T.; Boedecker, J.; Burgard, W. Deep reinforcement learning with successor features for navigation across similar environments. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 2371–2378. [Google Scholar]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv
**2017**, arXiv:1712.01815. [Google Scholar] - Badia, A.P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo, Z.D.; Blundell, C. Agent57: Outperforming the atari human benchmark. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 507–517. [Google Scholar]
- Nguyen, H.; La, H. Review of deep reinforcement learning for robot manipulation. In Proceedings of the 2019 Third IEEE International Conference on Robotic Computing (IRC), Naples, Italy, 25–27 February 2019; pp. 590–595. [Google Scholar]
- Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation. arXiv
**2016**, arXiv:1610.00633. [Google Scholar] - Kahn, G.; Villaflor, A.; Ding, B.; Abbeel, P.; Levine, S. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 5129–5136. [Google Scholar]
- Ruan, X.; Ren, D.; Zhu, X.; Huang, J. Mobile robot navigation based on deep reinforcement learning. In Proceedings of the 2019 Chinese Control and Decision Conference (CCDC), Nanchang, China, 3–5 June 2019; pp. 6174–6178. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Camacho, E.F.; Alba, C.B. Model Predictive Control; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- NVIDIA Omniverse Platform. Available online: https://developer.nvidia.com/nvidia-omniverse-platform (accessed on 1 October 2021).
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] - Morad, S.D.; Mecca, R.; Poudel, R.P.; Liwicki, S.; Cipolla, R. Embodied Visual Navigation with Automatic Curriculum Learning in Real Environments. IEEE Robot. Autom. Lett.
**2021**, 6, 683–690. [Google Scholar] [CrossRef] - NVIDIA ISAAC SDK, Release: 2021.1. Version 2021.1. Available online: https://developer.nvidia.com/isaac-sdk (accessed on 1 October 2021).
- Van Hasselt, H.; Doron, Y.; Strub, F.; Hessel, M.; Sonnerat, N.; Modayil, J. Deep reinforcement learning and the deadly triad. arXiv
**2018**, arXiv:1812.02648. [Google Scholar] - Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy networks for exploration. arXiv
**2017**, arXiv:1706.10295. [Google Scholar] - Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2016; pp. 1928–1937. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Abdolmaleki, A.; Springenberg, J.T.; Tassa, Y.; Munos, R.; Heess, N.; Riedmiller, M. Maximum a posteriori policy optimisation. arXiv
**2018**, arXiv:1806.06920. [Google Scholar] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
- Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by random network distillation. arXiv
**2018**, arXiv:1810.12894. [Google Scholar] - Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; Abbeel, P. # exploration: A study of count-based exploration for deep reinforcement learning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1–18. [Google Scholar]
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv
**2018**, arXiv:1812.05905. [Google Scholar] - Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 5026–5033. [Google Scholar]
- Huang, D.; Cai, Z.; Wang, Y.; He, X. A real-time fast incremental SLAM method for indoor navigation. In Proceedings of the 2013 Chinese Automation Congress, Changsha, China, 7–8 November 2013; pp. 171–176. [Google Scholar]
- Kim, T.G.; Ko, N.Y.; Noh, S.W. Particle Filter SLAM for Indoor Navigation of a Mobile Robot Using Ultrasonic Beacons. J. Korea Inst. Electron. Commun. Sci.
**2012**, 7, 391–399. [Google Scholar] - Megalingam, R.K.; Teja, C.R.; Sreekanth, S.; Raj, A. ROS based autonomous indoor navigation simulation using SLAM algorithm. Int. J. Pure Appl. Math.
**2018**, 118, 199–205. [Google Scholar] - Lin, P.T.; Liao, C.A.; Liang, S.H. Probabilistic indoor positioning and navigation (PIPN) of autonomous ground vehicle (AGV) based on wireless measurements. IEEE Access
**2021**, 9, 25200–25207. [Google Scholar] [CrossRef] - Tai, L.; Paolo, G.; Liu, M. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 31–36. [Google Scholar]
- Koenig, N.; Howard, A. Design and use paradigms for gazebo, an open-source multi-robot simulator. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 3, pp. 2149–2154. [Google Scholar]
- Marchesini, E.; Farinelli, A. Discrete deep reinforcement learning for mapless navigation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10688–10694. [Google Scholar]
- Long, P.; Fan, T.; Liao, X.; Liu, W.; Zhang, H.; Pan, J. Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6252–6259. [Google Scholar]
- Zhelo, O.; Zhang, J.; Tai, L.; Liu, M.; Burgard, W. Curiosity-driven exploration for mapless navigation with deep reinforcement learning. arXiv
**2018**, arXiv:1804.00456. [Google Scholar] - Xie, L.; Wang, S.; Rosa, S.; Markham, A.; Trigoni, N. Learning with training wheels: Speeding up training with a simple controller for deep reinforcement learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6276–6283. [Google Scholar]
- Chen, X.; Chen, H.; Yang, Y.; Wu, H.; Zhang, W.; Zhao, J.; Xiong, Y. Traffic flow prediction by an ensemble framework with data denoising and deep learning model. Phys. A Stat. Mech. Its Appl.
**2021**, 565, 125574. [Google Scholar] [CrossRef] - Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3357–3364. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Kulhánek, J.; Derner, E.; De Bruin, T.; Babuška, R. Vision-based navigation using deep reinforcement learning. In Proceedings of the 2019 European Conference on Mobile Robots (ECMR), Prague, Czech Republic, 4–6 September 2019; pp. 1–8. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Chen, G.; Pan, L.; Xu, P.; Wang, Z.; Wu, P.; Ji, J.; Chen, X. Robot navigation with map-based deep reinforcement learning. In Proceedings of the 2020 IEEE International Conference on Networking, Sensing and Control (ICNSC), Nanjing, China, 30 October–2 November 2020; pp. 1–6. [Google Scholar]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
- Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. arXiv
**2020**, arXiv:2003.04960. [Google Scholar] - Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv
**2015**, arXiv:1511.05952. [Google Scholar] - Ren, Z.; Dong, D.; Li, H.; Chen, C. Self-Paced Prioritized Curriculum Learning with Coverage Penalty in Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst.
**2018**, 29, 2216–2226. [Google Scholar] [CrossRef] [PubMed] - Kim, T.H.; Choi, J. Screenernet: Learning self-paced curriculum for deep neural networks. arXiv
**2018**, arXiv:1801.00904. [Google Scholar] - Narvekar, S.; Sinapov, J.; Leonetti, M.; Stone, P. Source task creation for curriculum learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, Singapore, 9–13 May 2016; pp. 566–574. [Google Scholar]
- Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight experience replay. arXiv
**2017**, arXiv:1707.01495. [Google Scholar] - Fang, M.; Zhou, T.; Du, Y.; Han, L.; Zhang, Z. Curriculum-Guided Hindsight Experience Replay. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Florensa, C.; Held, D.; Wulfmeier, M.; Zhang, M.; Abbeel, P. Reverse curriculum generation for reinforcement learning. In Proceedings of the Conference on Robot Learning, PMLR, Mountain View, CA, USA, 13–15 November 2017; pp. 482–495. [Google Scholar]
- Ivanovic, B.; Harrison, J.; Sharma, A.; Chen, M.; Pavone, M. Barc: Backward reachability curriculum for robotic reinforcement learning. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 15–21. [Google Scholar]
- Hasselt, H. Double Q-learning. Adv. Neural Inf. Process. Syst.
**2010**, 23, 2613–2621. [Google Scholar] - Achiam, J. Spinning Up in Deep Reinforcement Learning. Available online: https://spinningup.openai.com/en/latest/2018 (accessed on 1 May 2021).
- Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A simple neural attentive meta-learner. arXiv
**2017**, arXiv:1707.03141. [Google Scholar] - Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. arXiv
**2020**, arXiv:2003.05991. [Google Scholar] - Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- NVIDIA TAO. Version 3.0. Available online: https://docs.nvidia.com/tao/tao-toolkit/text/overview.html (accessed on 1 July 2021).
- DetectNet: Deep Neural Network for Object Detection in DIGITS. Available online: https://developer.nvidia.com/blog/detectnet-deep-neural-network-object-detection-digits/ (accessed on 1 August 2021).
- Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv
**2017**, arXiv:1711.00199. [Google Scholar] - Toromanoff, M.; Wirbel, E.; Moutarde, F. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7153–7162. [Google Scholar]
- Raffin, A.; Hill, A.; Traoré, R.; Lesort, T.; Díaz-Rodríguez, N.; Filliat, D. Decoupling feature extraction from policy learning: Assessing benefits of state representation learning in goal based robotics. arXiv
**2019**, arXiv:1901.08651. [Google Scholar] - Sax, A.; Zhang, J.O.; Emi, B.; Zamir, A.; Savarese, S.; Guibas, L.; Malik, J. Learning to navigate using mid-level visual priors. arXiv
**2019**, arXiv:1912.11121. [Google Scholar] - Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv
**2014**, arXiv:1412.6806. [Google Scholar] - Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning, ICML, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]

**Figure 1.**Illustration of the dolly (blue) and the robot in our simulated warehouse environment. The lines connected to the robot’s chassis visualize the LiDAR distance measuring beams. In this figure, NVIDIA Omniverse™ [17] is used for visualization. The front-facing camera is placed right in the center of the chassis of the vehicle, highlighted by the red square and captures images with a resolution of $80\times 80$ pixels. Two additional LiDAR sensors are placed at the diagonal corners of the vehicle, with each emitting 128 beams and covering a field of view of ${225}^{\circ}$, respectively.

**Figure 2.**An illustration of the Critic network architecture, consisting of ResNet blocks [48] for feature extraction (highlighted by the yellow shape) and fully-connected layers for LiDAR inputs and historical action and rewards. We concatenate the outputs of the three parts (illustrated by the ∪ symbol) to establish a learned sensor fusion. For actor part, only the output layer is changed. The details of ResNet blocks are shown in Appendix A.

**Figure 3.**An Illustration of the designed training arena. It consists of 9 total cells of different sizes and layouts. For instance, the walls and floors feature different colors and patterns, the light sources differ also in each cell. The initial pose of robot, target dolly and the obstacles are also placed with some random settings. For details, please refer to Appendix C.

**Figure 4.**Comparison of (

**a**) groundtruth image sequence and (

**b**) reconstructed image sequence from auto-encoders.

**Figure 5.**Learning curves of the three variants. (

**a**) The episodic return, (

**b**) The docking success probability per episode. These two statistics are presented as a moving-average over 500 episodes, where the solid line and shaded area illustrates, respectively, the mean and the variance over three runs.

**Figure 6.**(

**a**) Schematic representation of the grid-based test-scenario. The coordinate system to which the test-scenario refers is shown in red. The fixed dolly position is marked with “D”. The blue grid represents the test zone divided into $11\times 11$ positions. The grid and the dolly are scaled up for this illustration to improve visibility. (

**b**) Graphical illustration of a ${0}^{\circ}$ rotation test, conducted in the simulated testing-environment.

**Figure 7.**Color-coded illustration on the grid-based testing result of one fully trained NavACL-Q agent. The average performance for each position on the grid is represented by a colorized circle, where yellow color indicates a high success rate and blue color indicates near-zero success probability. (

**a**) The testing result of NavACL-Q p.t. (

**b**) The testing result of RND (

**c**) The testing result of NavACL-Q e.t.e. A further summary of the statistics is available in Table 3.

**Figure 8.**Two-dimensional interpolation of the success probability estimated by ${f}_{\pi}$ at different stages of training, where red areas indicate high success probability estimates and blue areas indicate low success probability estimates. In this case, the plot is generated across the geometric properties Agent-Goal Distance and Relative Angle. The individual plots consist of the success predictions of the 10,000 tasks that followed the displayed episode.

**Figure 9.**Comparison of the task selection histograms with respect to the Agent-Goal Distance geometric property of exemplary training outcomes. We have recorded the statistics of initial position in terms of Agent-Goal Distance among different training stages, each with 10,000 episodes. The histogram counts the corresponding number of initial states in the defined distance bins among each 10,000 episodes. (

**a**) represents task distribution of a NavACL-Q agent and (

**b**) illustrates the task distribution of an $RND$ agent.

**Figure 10.**Similar to Figure 7, we demonstrate a color-coded illustration of the grid-based testing result of the baseline approach. The yellow color indicates a high success rate and blue color indicates near-zero success probability.

**Figure 11.**A selection of driven trajectories from three different initial positions, where the orange line represents the baseline trajectory, the blue line represents the NavACL-Q e.t.e. trajectory, the green line illustrates the NavACL-Q trajectory and the red line depicts the trajectory of the RND case. Some clipped trajectories signifies that the agent ended up with collision.

**Table 1.**Summary of the sensory observations and additional statistics that describe the state design of this thesis.

Observation Components | |
---|---|

Description | Dimensions |

Sequence of the four most recent camera RGB images | ${\mathbf{R}}^{4\times 3\times 80\times 80}$ |

Current LiDAR sensor input (front and back sensor concatenated) | ${\mathbf{R}}^{1\times 256}$ |

History of the four previously taken actions | ${\mathbf{R}}^{4\times 2}$ |

History of the four previously received rewards | ${\mathbf{R}}^{4\times 1}$ |

Agent-Goal Distance | Euclidean distance from ${\mathit{s}}_{0}$ to ${\mathit{s}}_{\mathit{g}}$ |

Agent Clearance | Distance from ${s}_{0}$ to the nearest obstacle |

Goal Clearance | Distance from ${s}_{g}$ to the nearest obstacle |

Relative Angle | The angle between the starting orientation and $\overrightarrow{{s}_{0},{s}_{g}}$ |

Initial Q-Value | The predicted Q-value ${Q}_{\varphi}({s}_{0},{a}_{0})$ from SAC critic network |

**Table 3.**The statistics of testing results are presented. The averaged success rate of reaching the target for each ablation variant and baseline approach are shown. The averaged success rate is calculated as the mean success rates over $11\times 11$ grid points from Figure 7.

Relative Orientation of AVG to Target | Average Success Rate | |||
---|---|---|---|---|

NavACL p.t. | RND | NavACL e.t.e. | Baseline | |

${0}^{\circ}$ | 86.6% | $58.5\%$ | $50.3\%$ | $16.5\%$ |

$-{45}^{\circ}$ | $\mathbf{93.7}\%$ | $55.6\%$ | $25.5\%$ | $3.3\%$ |

$+{45}^{\circ}$ | $\mathbf{88.0}\%$ | $52.2\%$ | $53.5\%$ | $5.0\%$ |

$-{90}^{\circ}$ | $\mathbf{90.5}\%$ | $43.2\%$ | $8.9\%$ | $0\%$ |

$+{90}^{\circ}$ | $18.5\%$ | $\mathbf{45.5}\%$ | $32.0\%$ | $0\%$ |

$-{135}^{\circ}$ | $\mathbf{48.2}\%$ | $36.9\%$ | $2.2\%$ | $0\%$ |

$+{135}^{\circ}$ | $11.9\%$ | $\mathbf{37.1}\%$ | $19.8\%$ | $0\%$ |

$+{180}^{\circ}$ | $15.7\%$ | $\mathbf{34.2}\%$ | $8.0\%$ | $0\%$ |

Mean of $\{{0}^{\circ},\pm {45}^{\circ},\pm {90}^{\circ}\}$ (Intrapolated Tasks) | $\mathbf{75.5}\%$ | $51.1\%$ | $34.1\%$ | $5.0\%$ |

Mean of $\{\pm {135}^{\circ},{180}^{\circ}\}$ (Extrapolated Tasks) | $25.3\%$ | $\mathbf{36.1}\%$ | $10.0\%$ | $0\%$ |

Mean of All Orientations | $\mathbf{56.6}\%$ | $45.4\%$ | $25.0\%$ | $3.1\%$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Xue, H.; Hein, B.; Bakr, M.; Schildbach, G.; Abel, B.; Rueckert, E.
Using Deep Reinforcement Learning with Automatic Curriculum Learning for Mapless Navigation in Intralogistics. *Appl. Sci.* **2022**, *12*, 3153.
https://doi.org/10.3390/app12063153

**AMA Style**

Xue H, Hein B, Bakr M, Schildbach G, Abel B, Rueckert E.
Using Deep Reinforcement Learning with Automatic Curriculum Learning for Mapless Navigation in Intralogistics. *Applied Sciences*. 2022; 12(6):3153.
https://doi.org/10.3390/app12063153

**Chicago/Turabian Style**

Xue, Honghu, Benedikt Hein, Mohamed Bakr, Georg Schildbach, Bengt Abel, and Elmar Rueckert.
2022. "Using Deep Reinforcement Learning with Automatic Curriculum Learning for Mapless Navigation in Intralogistics" *Applied Sciences* 12, no. 6: 3153.
https://doi.org/10.3390/app12063153