# MarsExplorer: Exploration of Unknown Terrains via Deep Reinforcement Learning and Procedurally Generated Environments

^{1}

^{2}

^{3}

^{*}

*Keywords:*Deep Reinforcement Learning; OpenAI gym; exploration; unknown terrains

Next Article in Journal

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

Department of Electrical and Computer Engineering, Democritus University of Thrace, 671 00 Xanthi, Greece

The Centre for Research & Technology, Information Technologies Institute, Hellas, 570 01 Thessaloniki, Greece

Department of Production and Management Engineering, Democritus University of Thrace, 671 00 Xanthi, Greece

Author to whom correspondence should be addressed.

Academic Editors: Ahmad Taher Azar, Anis Koubaa, Alaa Khamis, Ibrahim A. Hameed and Gabriella Casalino

Received: 5 October 2021
/
Revised: 5 November 2021
/
Accepted: 8 November 2021
/
Published: 11 November 2021

(This article belongs to the Special Issue Deep Learning Techniques for Manned and Unmanned Ground, Aerial and Marine Vehicles)

This paper is an initial endeavor to bridge the gap between powerful Deep Reinforcement Learning methodologies and the problem of exploration/coverage of unknown terrains. Within this scope, MarsExplorer, an openai-gym compatible environment tailored to exploration/coverage of unknown areas, is presented. MarsExplorer translates the original robotics problem into a Reinforcement Learning setup that various off-the-shelf algorithms can tackle. Any learned policy can be straightforwardly applied to a robotic platform without an elaborate simulation model of the robot’s dynamics to apply a different learning/adaptation phase. One of its core features is the controllable multi-dimensional procedural generation of terrains, which is the key for producing policies with strong generalization capabilities. Four different state-of-the-art RL algorithms (A3C, PPO, Rainbow, and SAC) are trained on the MarsExplorer environment, and a proper evaluation of their results compared to the average human-level performance is reported. In the follow-up experimental analysis, the effect of the multi-dimensional difficulty setting on the learning capabilities of the best-performing algorithm (PPO) is analyzed. A milestone result is the generation of an exploration policy that follows the Hilbert curve without providing this information to the environment or rewarding directly or indirectly Hilbert-curve-like trajectories. The experimental analysis is concluded by evaluating PPO learned policy algorithm side-by-side with frontier-based exploration strategies. A study on the performance curves revealed that PPO-based policy was capable of performing adaptive-to-the-unknown-terrain sweeping without leaving expensive-to-revisit areas uncovered, underlying the capability of RL-based methodologies to tackle exploration tasks efficiently.

At this very moment, three different uncrewed spaceships, PERSEVERANCE (USA), HOPE (UAE), TIANWEN-1 (China), are on the surface or in the orbit of Mars. Never before has such a diverse array of scientific gear arrived at a foreign planet at the same time, and with such broad ambitions [1]. On top of that, several lunar missions have been arranged for this year to enable extensive experimentation, investigation, and testing on an extraterrestrial body [2]. In this exponentially growing field of extraterrestrial missions, a task of paramount importance is the autonomous exploration/coverage of previously unknown areas. The effectiveness and efficiency of such autonomous explorers may significantly impact the timely accomplishment of crucial tasks (e.g., before the fuel depletion) and, ultimately, the success (or not) of the overall mission.

Exploration/coverage of unknown territories is translated into the online design of the path for the robot, taking as input the sensory information with the objective of mapping the whole area in the minimum possible time [3,4]. This setup shares the same properties and objectives with the well-known NP-complete setup of Traveling Salesman Problem (TSP), with the even more restrictive property that the area to be covered is discovered incrementally during the operation.

The well-established family of approaches incorporates the concept of next best pose process, i.e., a turn-based, greedy selection of the next best position (also known as frontier-cell) to acquire measurement, based on heuristic strategy (e.g., [5,6,7]). Although this family of approaches has been extensively studied, some inherent drawbacks significantly constrain its broader applicability. For example, every deadlock that may arise during the previously described optimization scheme should have been predicted, and a corresponding mitigation plan should have been already in place [8]; otherwise, the robot is going to be stuck in this locally optimal configuration [9]. On top of that, to engineer a multi-term strategy that reflects the task at hand is not always trivial [10].

The recent breakthroughs in Reinforcement Learning (RL), in terms of both algorithms and hardware acceleration, have spawned methodologies capable of achieving above human-level performance in high-dimensional, non-linear setups, such as the game of Go [11], atari games [12], multi-agent collaboration [13], robotic manipulation [14], etc. A milestone in the RL community was the standardization of several key problems under a common framework, namely openai-gym [15]. Such release eased the evaluation among different methodologies and ultimately led to the generation of a whole new series of RL frameworks with standardized algorithms (e.g., [16,17]), all tuned to tackle openai-gym compatible setups.

These breakthroughs motivated the appliance of RL methodologies in the path-planning/exploration robotic tasks. Initially, the problem of navigating a single robot in previously unknown areas to reach a destination, while simultaneously avoiding catastrophic collisions, was tackled with RL methods [18,19,20]. The first RL methodology solely developed for exploration of unknown areas was developed in [21], and has successfully presented the potential benefits of RL. Recently, RL methodologies have been proposed that seek to leverage the deployment of multi-robot systems to cover an operational area [22].

However, Ref. [22] assumes only a single geometry for the environment to be covered and thus being prone to overfit, rather than being able to generalize in different environments. This drawback is mitigated by Ref. [21] by introducing a learning scheme with 30 different environments during the training phase. Although such a methodology can adequately tackle the generalization problem, the RL agent’s performance is still bounded to the diversity of the human-imported environments.

The main contribution of this work is to provide a framework for learning exploration/coverage policies that possess strong generalization abilities due to the procedurally generated terrain diversity. The intuition behind such an approach to exploration tasks is the fact that most areas exhibit some kind of structure in their terrain topology, e.g., city blocks, trees in a forest, containers in ports, office complexes. Thereby, by training multiple times in such correlated and procedurally generated environments, the robot will grasp/understand the underlining structure and leverage it to efficiently complete its goal, even in areas that it has never been exposed to.

Within this scope, a novel openai-gym compatible environment for exploration/coverage of unknown terrains has been developed and is presented. All the core elements that govern a real exploration/coverage setup have been included. MarsExplorer is one of the few RL environments where any learned policy can be transferred to real-world robotic platforms, providing that a proper translation between the proprioceptive/exteroceptive sensors’ readings and the generation of 2D perception (occupancy map) and also an integration with the existing robotic systems (e.g., PID low level control, safety mechanisms, etc.) are implemented.

Four state-of-the-art RL algorithms, namely A3C [23], PPO [24], Rainbow [25], and SAC [26], have been evaluated on the MarsExplorer environment. To better comprehend these evaluation results, the average human-level performance in the MarsExplorer environment is also reported. A follow-up analysis utilizing the best-performing algorithm (PPO) is conducted with respect to the different levels of difficulty. The visualization of the produced trajectories revealed that the PPO algorithm had learned to apply the famous space-filling Hilbert curve, with the additional capability of avoiding on-the-fly obstacles that might appear on the terrain. The analysis is concluded with a sclability study and a comparison with non-learning methodologies.

It should be highlighted that the objective is not to provide another highly realistic simulator but a framework upon which RL methods (and also non-learning approaches) will be efficiently benchmarked in exploration/coverage tasks. Although several wrappers are available for high-fidelity simulators (e.g., Gazebo [27], ROS [28]) that could be tuned to formulate an exploration coverage setup, in practice the required execution time for each episode severely limits the type of algorithms that can be used (for example PPO usually needs several millions of steps for environment interactions to converge). To the best of our knowledge, this is the first openai-gym compatible framework oriented for robotic exploration/coverage of unknown areas.

Figure 1 presents 4 sample snapshots that illustrate the performance of a trained RL robot inside the MarsExplorer environment. Figure 1a demonstrates the robot’s entry inside the unknown terrain, which is annotated with black color. Figure 1b illustrates all the so-far gained “knowledge”, which is either depicted with Martian soil or brown boxes to denote free space or obstructed positions, respectively. An attractive trait is depicted in Figure 1c, where the robot chose to perform a dexterous maneuver between two obstacles to be as efficient as possible in terms of numbers of timesteps for the coverage task. Note that any collision with an obstacle would have resulted in a termination of the episode and, as a result, an acquisition of an extreme negative reward. Figure 1d illustrates the robot’s final position, along with all the gained information for the terrain (non-black region) during the episode.

All in all, the main contributions of this paper are:

- Develop an open-source (https://github.com/dimikout3/MarsExplorer (accessed on 4 November 2021)), openai-gym compatible environment tailored explicitly to the problem of exploration of unknown areas with an emphasis on generalization abilities.
- Translate the original robotics exploration problem to an RL setup, paving the way to apply off-the-shelf algorithms.
- Perform preliminary study on various state-of-the-art RL algorithms, including A3C, PPO, Rainbow, and SAC, utilizing the human-level performance as a baseline.
- Challenge the generalization abilities of the best performing PPO-based agent by evaluating multi-dimensional difficulty settings.
- Present side-by-side comparison with frontier-based exploration strategies.

The remainder of this paper is organized as follows: Section 2 presents the details of the openai-gym exploration environment, called MarsExplorer, along with an analysis of the key RL attributes inserted. Section 3 presents the experimental analysis from the survey regarding the performance of the state-of-the-art RL algorithm to the evaluation against standard frontier-based exploration. Finally, Section 4 summarizes the findings the draws the conclusions of this study.

This section identifies the fundamental elements that govern the family of setups that fall into the coverage/exploration class and translates them to the openai gym framework [15]. In principle, the objective of the robot is to cover an area of interest in the minimum possible time while avoiding any non-traversable objects, the position of which gets revealed only when the robot’s position is in close proximity [29,30].

Let us assume that the area to be covered is constrained within a square, which has been discretized into $n=\mathit{rows}\times \mathit{cols}$ identical grid cells:

$$\mathcal{G}=\left\{(x,y):x\in [1,\mathit{rows}],y\in [1,\mathit{cols}]\right\}$$

The robot cannot move freely inside this grid, as some grid cells are occupied by non-traversable objects (obstacles). Therefore, the map of the terrain is defined as follows:

$$\mathcal{M}\left(q\right)=\left\{\begin{array}{cc}0.3\hfill & \mathit{free}\mathit{space}\hfill \\ 1\hfill & \mathit{obstacle}\hfill \end{array}\right.\phantom{\rule{0.277778em}{0ex}}q=(x,y)\in \mathcal{G}$$

The values of $\mathcal{M}$ correspond to the morphology of the unknown terrain and are considered a priori unknown.

Keeping in mind that the movement capabilities of the robot mainly impose the discretization of the area into grid cells, the action space is defined in the same grid context as well. The position of the robot is denoted by the corresponding $x,y$ cell of the grid, i.e., ${p}_{a}\left(t\right)=\left[{x}_{a}\left(t\right),{y}_{a}\left(t\right)\right]$. Then, the possible next actions are simply given by the Von Neumann neighborhood [31], i.e.,

$${\mathcal{A}}_{{p}_{a}}=\left\{(x,y):\left|x-{x}_{a}\right|+\left|y-{y}_{a}\right|\le 1\right\}$$

In the openai-gym framework, the formulation above is realized by a discrete space of size 4 (North, East, South, West).

With each movement, the robot may acquire some information related to the formation of the environment that lies inside its sensing capabilities, according to the following lidar-like model:
where d denotes the maximum scanning distance.

$${y}_{q}\left(t\right)=\left\{\begin{array}{cc}1\hfill & \mathrm{if}\u2225{p}_{a}\left(t\right)-q\u2225\le d\mathrm{AND}\hfill \\ \phantom{\rule{1.em}{0ex}}\hfill & \exists \mathrm{line-of-sightbetween}\hfill \\ \phantom{\rule{1.em}{0ex}}\hfill & {p}_{a}\left(t\right)\mathrm{and}\text{}q\hfill \\ 0\hfill & \mathrm{otherwise}\hfill \end{array}\right.\phantom{\rule{0.277778em}{0ex}}\forall q\in \mathcal{G}\phantom{\rule{4pt}{0ex}}$$

An auxiliary boolean matrix $D\left(t\right)$ is introduced to denote all the cells that have been discovered from the beginning till t timestep. $D\left(t\right)$ annotates with one all cells that have been sensed and with zero all the others. Starting from a zero matrix $\mathit{rows}\times \mathit{cols}$, its values are updated as follows:
where ∨ denotes the logical OR operator. The state is simply an aggregation of the acquired information over all past measurements of the robot (4). Having updated (5), the state $s\left({t}_{k}\right)$ is a matrix of the same size as the grid to be explored (1), where its values are given by:

$${D}_{q}\left(t\right)={D}_{q}(t-1)\vee {y}_{q}\left(t\right),\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\forall q\in \mathcal{G}$$

$${s}_{q}\left(t\right)=\left\{\begin{array}{cc}{\mathcal{M}}_{q}\hfill & \mathrm{if}{D}_{q}\left(t\right)\hfill \\ 0\phantom{\rule{0.277778em}{0ex}}(=\mathit{undefined})\hfill & \mathrm{otherwise}\hfill \end{array}\right.\phantom{\rule{0.277778em}{0ex}}\forall q\in \mathcal{G}\phantom{\rule{4pt}{0ex}}$$

Finally, the robot’s position is declared by making the value of the corresponding cell equal to $0.6$, i.e., ${s}_{q={p}_{a}\left(t\right)}\left(t\right)=0.6$. Overall, state $s\left(t\right)$ is defined as a 2D matrix, that takes values from the following discrete set: $\{0,0.3,0.6,1\}$. Figure 2 presents an illustrative example of a registration between the graphical environment (Figure 2a) and the corresponding state representation (Figure 2b).

Having in mind that the ultimate objective is to discover all grid cells, the instantaneous core reward, at each timestep t, is defined as the number of newly explored cells, i.e.,

$${r}_{\mathit{explor}}\left(t\right)=\sum _{q\in \mathcal{G}}{D}_{q}\left(t\right)-\sum _{q\in \mathcal{G}}{D}_{q}(t-1)$$

Intuitively, if ${\sum}_{k=0}^{T}{r}_{\mathit{explor}}\left(k\right)\to n$, then the robot has explored the whole grid (1) in T timesteps.

To force the robot to explore the whole area (7), while avoiding unnecessary movements, an additional penalty ${r}_{\mathit{move}}=0.5$ per timestep is applied. In essence, this negative reward aims to distinguish among policies that lead to the same number of discovered cells but needed a different number of exploration steps. Please note that the value of ${r}_{\mathit{move}}$ should be less than 1, to have less priority than the exploration of a single cell.

The action space, as defined previously, may include invalid next movements for the robot, i.e., out of the operational area (1) or crashing into a discovered obstacle. Thus, apart from the problem at hand, the robot should be able to recognize these undesirable states and avoid them at all costs. Towards that direction, an additional penalty ${r}_{\mathit{invalid}}=n$ is introduced for the cases where the robot’s next movement leads to an invalid state. Along with such a reward, the episode is marked as “done”, indicating that a new episode should be initiated.

At the other side of the spectrum, a completion bonus ${r}_{\mathit{bonus}}=n$ is given to the robot when more than $\beta $% (e.g., 95%) of the cells have been explored. Similar to the previous case, this is also considered a terminal state.

Putting everything together, the reward is defined as:

$$r\left(t\right)=\left\{\begin{array}{cc}-{r}_{\mathit{invalid}}\hfill & \mathrm{if}\text{}\mathrm{next}\text{}\mathrm{state}\hfill \\ \phantom{\rule{1.em}{0ex}}\hfill & \mathrm{is}\text{}\mathrm{invalid}\hfill \\ {r}_{\mathit{explor}}\left(t\right)-{r}_{\mathit{move}}+\hfill & \\ \left\{\begin{array}{cc}{r}_{\mathit{bonus}}\hfill & \mathrm{if}\frac{{\sum}_{q\in \mathcal{G}}D\left(t\right)}{n}\ge \beta \hfill \\ 0\hfill & \mathrm{otherwise}\hfill \end{array}\right.\hfill & \mathrm{otherwise}\hfill \end{array}\right.\phantom{\rule{0.277778em}{0ex}}$$

MarsExplorer was designed as an initial endeavor to bridge the gap between powerful existing RL algorithms and the problem of autonomous exploration/coverage of a previously unknown, cluttered terrain. This subsection presents the build-in key attributes of the designed framework.

This section presents an experimental evaluation of the MarsExplorer environment. The analysis begins with all the implementation details that are important for realizing the MarsExplorer experimental setup. For the first evaluation iteration, 4 state-of-art RL algorithms are applied and evaluated in a challenging version of MarsExplorer that requires the development of strong generalization capabilities in a highly randomized scenario, where the underlying structure is almost absent. Having identified the best performing algorithm, a follow-up analysis is performed with respect to the difficulty vector values. The learned patterns and exploration policies for different evaluation instances are further investigated and graphically presented. The analysis is concluded with a scale-up study in two larger terrains and a comparison between the trained robot and two well-established frontier-based approaches.

Aside from the standardization as an openai-gym environment, MarsExplorer provides an API that allows manually controlled experiments, translating commands from keyboard arrows to next movements. Such a feature can assess human-level performance in the exploration/coverage problem and reveal important traits by comparing human and machine-made strategies.

Ray/RLlib framework [34] was utilized to perform all the experiments. The fact that RLlib is a well-documented, highly-robust library also eases the build-on developments (e.g., apply a different RL pipeline), as it follows a common framework. Furthermore, such an experimental setup may also leverage the interoperability with other powerful frameworks from the Ray ecosystem, e.g., Ray/Tune for hyperparameters’ tuning.

Table 1 summarizes all the fixed parameters used for all the performed experiments. MarsExplorer admits the distinguishing property of stochastically deploying the obstacles at the beginning of each episode. This stochasticity can be controlled and ultimately determines the difficulty level of the MarsExplorer setup. The state-space of MarsExplorer has a strong resemblance to thoroughly studied 2D environments, e.g., ALE [35], only with the key difference that the image is generated incrementally and based on the robot’s actions. Therefore, as it has been standardized from the DQN algorithm’s application domain [12], a vision-inspired neural network architecture is incorporated as a first stage. Figure 3 illustrates the architecture of this pre-processor, which is comprised of 2 convolutional layers followed by a fully connected one. The vectorized output of the fully connected layer is forwarded to a “controller” architecture dependent on the RL algorithm enabled.

Apart from the details described in the previous subsection, for the comparison study, at the beginning of each episode, the formation (position and shape) of obstacles was set randomly. This choice was made to force RL algorithms to develop novel generalization strategies to tackle such a challenging setup. The list of studied RL algorithms is comprised by the following model-free approaches: PPO [24], DQN-Rainbow [25], A3C [23] and SAC [26]. All hyperparameters of these algorithms are reported in the Appendix A.

Figure 4 presents a comparison study among the approaches mentioned above. For each RL agent, the thick colored lines stand for the episode’s total reward, while the transparent surfaces around them correspond to the standard deviation. Moreover, the episode’s reward (score) is normalized in such a way that 0 stands for an initial invalid action by the robot, ${r}_{\mathit{invalid}}$ in (8), while 1 correspond to the theoretical maximum reward, which is the ${r}_{\mathit{bonus}}$ in (8) plus the number of cells.

To increase the qualitative comprehension of the produced results, the average human-level performance is also introduced. To approximate this value, 10 players were drawn from the pool of $\mathit{CERTH}/\mathit{ConvCAO}$ employees to participate in the evaluation process. Each player had an initial warm-up phase of 15 episodes (non-ranked), and after that, they were evaluated on 30 episodes. The average achieved score of the 300 human-controlled experiments is depicted with a green dashed line.

A clear-cut outcome is that the PPO algorithm achieves the highest average episodic reward, reaching an impressive 85.8% of the human-level performance. DQN-Rainbow achieves the second-best performance; however, the average is 50.04% and 42.73% of the PPO and human-level performance, respectively.

Having defined the best performing RL algorithm (PPO), now the focus is shifted on producing some preliminary results, related to the difficulty settings of MarsExplorer. As mentioned in the definition section, MarsExplorer allows for setting the elements of difficulty vector independently. More specifically, the difficulty vector comprises 3 elements $[{d}_{t},{d}_{m},{d}_{b}]$, where:

- $\phantom{\rule{0.277778em}{0ex}}{d}_{t}$ denotes the
**topology stochasticity**, which defines the obstacles’ placement on the field. The fundamental positions of the obstacles are equally arranged in a 3 columns–3 rows format. The radius of deviation around these fundamental positions is controlled by ${d}_{t}$. As the value of ${d}_{t}$ increases, the obstacles’ topology has a more unstructured formation. ${d}_{t}$ takes values from $\{1,2,3\}$ discrete set. - $\phantom{\rule{0.277778em}{0ex}}{d}_{m}$ denotes the
**morphology stochasticity**, which defines the obstacles’ shape on the field. ${d}_{m}$ controls the area that might be occupied from each obstacle. The bigger the value of ${d}_{m}$, the larger the compound areas of obstacles that might appear on the MarsExplorer terrain. ${d}_{m}$ takes values from $\{1,2\}$ discrete set. - $\phantom{\rule{0.277778em}{0ex}}{d}_{b}$ denotes the
**bonus rewards**, that are assigned for the completion (${r}_{\mathit{bonus}}$) and failure (${r}_{\mathit{invalid}}$) of the mission (8). For this factor only two values are allowed $\{1,2\}$, that correspond to cases of providing and not-providing the bonus rewards, respectively.

Higher values in the elements of the difficulty vector correspond to less structured behavior in the formation of the obstacles. Thus, a trained agent that has been successfully trained in greater difficulty setups may exhibit increased generalization abilities. Overall, the aggregation of the aforementioned elements’ domain generates 12 combinations of difficulty levels. Figure 5 shows the total average return of the evolution of the average episodic reward for each one of the 12 levels during the training of the PPO algorithm. To improve the readability of the graphs, the results are organized into 3 graphs, one for each level of ${d}_{t}$, with 4 plot lines each.

A study on the learning curves reveals that ${d}_{m}$ has the largest effect on the learned policy. Blue and red lines (cases where ${d}_{m}=1$), in all three figures, demonstrate a similar convergence rate and also the highest-performance policies. However, a serious degradation in the results is observed in purple and gray lines (${d}_{m}=2$). As it was expected, when ${d}_{m}=2$ and also ${d}_{t}=3$ (purple and gray lines in Figure 5c), the final achieved performance reached only a little bit above 0.6 in the normalized scale. It seems that ${d}_{b}$ does not affect the overall performance much, at least until this vector of difficulty, apart from the convergence rate depicted in the gray line of Figure 5c.

This section is devoted to the characteristics of the learned policy from the PPO algorithm. For each of the 12 levels of difficulty defined in the previous section, the best PPO policy was extracted and evaluated in a series of 100 experiments with randomly (controlled by the difficulty setting) generated obstacles. Figure 6 presents one heat map for each difficulty level. The blue colormap corresponds to the frequency of the robot visiting a specific cell of the terrain. The green colormap corresponds to the number of detected obstacles in each position during the robot’s exploration.

A critical remark is that, for each scenario, the arrangement of discovered obstacles matches the drawn distribution as described in the previous subsection, implying that the learned policy does not have any “blind spots”.

Examining the heatmap of the trajectories in each scenario, it is crystal clear that the same family of trajectories has been generated in all cases and with great confidence. The important conclusion here is that this pattern is the first order of the Hilbert curve that has been utilized extensively in the space-filling domain (e.g., [36,37]). This highlights that such a pattern has not been imported to the simulator or rewarded when achieved from the RL algorithm; however, the algorithm learned that this is the most effective strategy by interacting with the environment.

It would be an omission not to mention the learned policy’s ability to adapt to changes in the obstacles’ distribution and, ultimately, find the most efficient obstacle-free route. This trait can be observed more clearly in Figure 6k,l, where the policy needed to be extremely dexterous and delicate to avoid obstacles’ encounters.

The analysis is concluded with a scalability study and comparison to non-learning methodologies. Two terrains with sizes $[42\times 42]$ and $[84\times 84]$ were used. The difficulty level was set to $[{d}_{t},{d}_{m},{d}_{b}]=[2,2,1]$, while 100 experiments were conducted for each scenario. Utility and cost-based frontier cell exploration methodologies [7] were enabled for positioning the achieved PPO policy in the context of non-learning approaches. In these frontier-based approaches, the exploration policy is divided into two categories based on the metric to be optimized:

- Cost: the next action is chosen based on the distance from the nearest frontier cell.
- Utility: the decision-making is governed by frequently updated information potential field.

Figure 7 summarizes the result of such evaluation study by presenting the average exploration time for each algorithm (PPO, cost frontier-based, utility frontier-based) over 100 procedurally generated runs. A direct outcome is that the learning-based approach requires the robot to travel less distance to explore the same percentage of terrain as the non-learning approaches. The final remark is devoted to the “knee” that can be observed in almost all the final stages of the non-learning approaches. Such behavior is attributed to having several distant sub-parts of the terrain unexplored, the exploration of which requires this extra effort. On the contrary, the learning-based approach (PPO) seems to handle this situation quite well, not leaving these expensive-to-revisit regions along its exploration path.

A new openai-gym environment called MarsExplorer that bridges the gap between reinforcement learning and the real-life exploration/coverage in the robotics domain is presented. The environment transforms the well-known robotics problem of exploration/coverage of a completely unknown region into a reinforcement learning setup that can be tackled by a wide range of off-the-shelf, model-free RL algorithms. An essential feature of the whole solution is that trained policies can be straightforwardly applied to real-life robotic platforms without being trained/tuned to the robot’s dynamics. To achieve that, the same level of information abstraction between the robotic system and the MarsExplorer is required. A detailed experimental evaluation was also conducted and presented. 4 state-of-the-art RL algorithms, namely A3C, PPO, Rainbow, and SAC, were evaluated in a challenging version of MarsExplorer, and their training results were also compared with the human-level performance for the task at hand. The PPO algorithm achieved the best score of 85.8%, which was the same as the human-level performance. Then, the PPO algorithm was utilized to study the effect of the multi-dimensional difficulty vector changes in the overall performance. The visualization of the paths for all these difficulty levels revealed a rather important trait. The PPO learned policy has learned to perform a Hilbert curve with the extra ability to avoid any encountered obstacle. Lastly, a scalability study clearly indicates the ability of RL approaches to be extended in larger terrains, where the achieved performance is validated with non-learning, frontier-based explorations strategies.

Conceptualization, D.I.K., A.C.K., A.A.A. and E.B.K.; methodology D.I.K. and A.C.K.; software, D.I.K.; validation, D.I.K., A.C.K., A.A.A. and E.B.K.; formal analysis, A.C.K. and E.B.K.; investigation, D.I.K., A.C.K. and A.A.A.; resources, E.B.K.; data curation, D.I.K.; writing—original draft preparation, A.C.K. and A.A.A.; writing—review and editing, A.C.K., A.A.A. and E.B.K.; visualization, D.I.K.; supervision, E.B.K.; project administration, A.C.K. and E.B.K.; funding acquisition, E.B.K. All authors have read and agreed to the published version of the manuscript.

This research was funded by European Commission Of European Union’s Horizon 2020 research and innovation programme under grant number 833464 (CREST). Also, we gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research.

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Parameter | Value | Comments |
---|---|---|

$\gamma $ | 0.95 | Discount factor of the MDP |

$\lambda $ | $5\times {10}^{-5}$ | Learning rate |

Critic | True | Used a critic as a baseline |

GAE l | 0.95 | GAE (lambda) parameter |

KL coeff | 0.2 | Initial coefficient for KL divergence |

Clip | 0.3 | PPO clip parameter |

Parameter | Value | Comments |
---|---|---|

$\gamma $ | 0.95 | Discount factor of the MDP |

$\lambda $ | $5\times {10}^{-4}$ | Learning rate |

Noisy Net | True | Used a noisy network |

Noisy $\sigma $ | 0.5 | initial value of noisy nets |

Dueling Net | True | Used dueling DQN |

Double dueling | True | Used double DQN |

$\u03f5$-greedy | [1.0, 0.02] | Epsilon greedy for exploration. |

Buffer size | 50,000 | Size of the replay buffer |

Priorited Replay | True | Prioritized replay buffer used |

Parameter | Value | Comments |
---|---|---|

$\gamma $ | 0.95 | Discount factor of the MDP |

$\lambda $ | $3\times {10}^{-4}$ | Learning rate |

Twin Q | True | Use two Q-networks |

Q hidden | [256, 256] | Hidden layer activation |

Policy hidden | [256, 256] | Hidden layer activation |

Buffer size | 1e6 | Size of the replay buffer |

Priorited Replay | True | Prioritized replay buffer used |

Parameter | Value | Comments |
---|---|---|

$\gamma $ | 0.95 | Discount factor of the MDP |

$\lambda $ | $1\times {10}^{-4}$ | Learning rate |

Critic | True | Used a critic as a baseline |

GAE | True | General Advantage Estimation |

GAE l | 0.99 | GAE(lambda) parameter |

Value loss | 0.5 | Value Function Loss coefficient |

Entropy coef | 0.01 | Entropy coefficient |

- Witze, A.; Mallapaty, S.; Gibney, E. All Aboard to Mars. 2020. Available online: https://www.nature.com/articles/d41586-020-01861-0 (accessed on 4 November 2021).
- Smith, M.; Craig, D.; Herrmann, N.; Mahoney, E.; Krezel, J.; McIntyre, N.; Goodliff, K. The Artemis Program: An Overview of NASA’s Activities to Return Humans to the Moon. In Proceedings of the 2020 IEEE Aerospace Conference, Big Sky, MT, USA, 7–14 March 2020; pp. 1–10. [Google Scholar]
- Shrestha, R.; Tian, F.P.; Feng, W.; Tan, P.; Vaughan, R. Learned map prediction for enhanced mobile robot exploration. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 1197–1204. [Google Scholar]
- Kapoutsis, A.C.; Chatzichristofis, S.A.; Doitsidis, L.; de Sousa, J.B.; Pinto, J.; Braga, J.; Kosmatopoulos, E.B. Real-time adaptive multi-robot exploration with application to underwater map construction. Auton. Robot.
**2016**, 40, 987–1015. [Google Scholar] [CrossRef] - Batinovic, A.; Petrovic, T.; Ivanovic, A.; Petric, F.; Bogdan, S. A Multi-Resolution Frontier-Based Planner for Autonomous 3D Exploration. IEEE Robot. Autom. Lett.
**2021**, 6, 4528–4535. [Google Scholar] [CrossRef] - Renzaglia, A.; Dibangoye, J.; Le Doze, V.; Simonin, O. Combining Stochastic Optimization and Frontiers for Aerial Multi-Robot Exploration of 3D Terrains. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4121–4126. [Google Scholar]
- Basilico, N.; Amigoni, F. Exploration strategies based on multi-criteria decision making for searching environments in rescue operations. Auton. Robot.
**2011**, 31, 401–417. [Google Scholar] [CrossRef] - Palacios-Gasós, J.M.; Montijano, E.; Sagüés, C.; Llorente, S. Distributed coverage estimation and control for multirobot persistent tasks. IEEE Trans. Robot.
**2016**, 32, 1444–1460. [Google Scholar] [CrossRef] - Koutras, D.I.; Kapoutsis, A.C.; Kosmatopoulos, E.B. Autonomous and cooperative design of the monitor positions for a team of UAVs to maximize the quantity and quality of detected objects. IEEE Robot. Autom. Lett.
**2020**, 5, 4986–4993. [Google Scholar] [CrossRef] - Popov, I.; Heess, N.; Lillicrap, T.; Hafner, R.; Barth-Maron, G.; Vecerik, M.; Lampe, T.; Tassa, Y.; Erez, T.; Riedmiller, M. Data-efficient deep reinforcement learning for dexterous manipulation. arXiv
**2017**, arXiv:1704.03073. [Google Scholar] - Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature
**2016**, 529, 484–489. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Baker, B.; Kanitscheider, I.; Markov, T.; Wu, Y.; Powell, G.; McGrew, B.; Mordatch, I. Emergent tool use from multi-agent autocurricula. arXiv
**2019**, arXiv:1909.07528. [Google Scholar] - Zhu, H.; Yu, J.; Gupta, A.; Shah, D.; Hartikainen, K.; Singh, A.; Kumar, V.; Levine, S. The Ingredients of Real World Robotic Reinforcement Learning. arXiv
**2020**, arXiv:2004.12570. [Google Scholar] - Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] - Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; Zhokhov, P. OpenAI Baselines. 2017. Available online: https://github.com/openai/baselines (accessed on 4 November 2021).
- Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Goldberg, K.; Gonzalez, J.; Jordan, M.; Stoica, I. RLlib: Abstractions for distributed reinforcement learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3053–3062. [Google Scholar]
- Lei, X.; Zhang, Z.; Dong, P. Dynamic path planning of unknown environment based on deep reinforcement learning. J. Robot.
**2018**, 2018, 5781591. [Google Scholar] [CrossRef] - Wen, S.; Zhao, Y.; Yuan, X.; Wang, Z.; Zhang, D.; Manfredi, L. Path planning for active SLAM based on deep reinforcement learning under unknown environments. Intell. Serv. Robot.
**2020**, 13, 263–272. [Google Scholar] [CrossRef] - Zhang, K.; Niroui, F.; Ficocelli, M.; Nejat, G. Robot navigation of environments with unknown rough terrain using deep reinforcement learning. In Proceedings of the 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Philadelphia, PA, USA, 6–8 August 2018; pp. 1–7. [Google Scholar]
- Niroui, F.; Zhang, K.; Kashino, Z.; Nejat, G. Deep reinforcement learning robot for search and rescue applications: Exploration in unknown cluttered environments. IEEE Robot. Autom. Lett.
**2019**, 4, 610–617. [Google Scholar] [CrossRef] - Luis, S.Y.; Reina, D.G.; Marín, S.L.T. A Multiagent Deep Reinforcement Learning Approach for Path Planning in Autonomous Surface Vehicles: The YpacaraC-Lake Patrolling Case. IEEE Access
**2021**, 9, 17084–17099. [Google Scholar] [CrossRef] - Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2018; Volume 32. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv
**2018**, arXiv:1801.01290. [Google Scholar] - Zamora, I.; Lopez, N.G.; Vilches, V.M.; Cordero, A.H. Extending the OpenAI Gym for robotics: A toolkit for reinforcement learning using ROS and Gazebo. arXiv
**2016**, arXiv:1608.05742. [Google Scholar] - Lopez, N.G.; Nuin, Y.L.E.; Moral, E.B.; Juan, L.U.S.; Rueda, A.S.; Vilches, V.M.; Kojcev, R. gym-gazebo2, a toolkit for reinforcement learning using ROS 2 and Gazebo. arXiv
**2019**, arXiv:1903.06278. [Google Scholar] - Kapoutsis, A.C.; Chatzichristofis, S.A.; Kosmatopoulos, E.B. A distributed, plug-n-play algorithm for multi-robot applications with a priori non-computable objective functions. Int. J. Robot. Res.
**2019**, 38, 813–832. [Google Scholar] [CrossRef] - Burgard, W.; Moors, M.; Fox, D.; Simmons, R.; Thrun, S. Collaborative multi-robot exploration. In Proceedings of the Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), San Francisco, CA, USA, 24–28 April 2000; Volume 1, pp. 476–481. [Google Scholar]
- Gray, L.; New, A. A mathematician looks at Wolfram’s new kind of science. Not. Am. Math. Soc.
**2003**, 50, 200–211. [Google Scholar] - Cobbe, K.; Hesse, C.; Hilton, J.; Schulman, J. Leveraging procedural generation to benchmark reinforcement learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 2048–2056. [Google Scholar]
- Yin, H.; Chen, J.; Pan, S.J.; Tschiatschek, S. Sequential Generative Exploration Model for Partially Observable Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 10700–10708. [Google Scholar]
- Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Gonzalez, J.; Goldberg, K.; Stoica, I. Ray rllib: A composable and scalable reinforcement learning library. arXiv
**2017**, arXiv:1712.09381, 85. [Google Scholar] - Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The Arcade Learning Environment: An Evaluation Platform for General Agents. J. Artif. Intell. Res.
**2013**, 47, 253–279. [Google Scholar] [CrossRef] - Kapoutsis, A.C.; Chatzichristofis, S.A.; Kosmatopoulos, E.B. DARP: Divide areas algorithm for optimal multi-robot coverage path planning. J. Intell. Robot. Syst.
**2017**, 86, 663–680. [Google Scholar] [CrossRef] - Sadat, S.A.; Wawerla, J.; Vaughan, R. Fractal trajectories for online non-uniform aerial coverage. In Proceedings of the 2015 IEEE international conference on robotics and automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 2971–2976. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).