MarsExplorer: Exploration of Unknown Terrains via Deep Reinforcement Learning and Procedurally Generated Environments

This paper is an initial endeavor to bridge the gap between powerful Deep Reinforcement Learning methodologies and the problem of exploration/coverage of unknown terrains. Within this scope, MarsExplorer, an openai-gym compatible environment tailored to exploration/coverage of unknown areas, is presented. MarsExplorer translates the original robotics problem into a Reinforcement Learning setup that various off-the-shelf algorithms can tackle. Any learned policy can be straightforwardly applied to a robotic platform without an elaborate simulation model of the robot's dynamics to apply a different learning/adaptation phase. One of its core features is the controllable multi-dimensional procedural generation of terrains, which is the key for producing policies with strong generalization capabilities. Four different state-of-the-art RL algorithms (A3C, PPO, Rainbow, and SAC) are trained on the MarsExplorer environment, and a proper evaluation of their results compared to the average human-level performance is reported. In the follow-up experimental analysis, the effect of the multi-dimensional difficulty setting on the learning capabilities of the best-performing algorithm (PPO) is analyzed. A milestone result is the generation of an exploration policy that follows the Hilbert curve without providing this information to the environment or rewarding directly or indirectly Hilbert-curve-like trajectories. The experimental analysis is concluded by evaluating PPO learned policy algorithm side-by-side with frontier-based exploration strategies. A study on the performance curves revealed that PPO-based policy was capable of performing adaptive-to-the-unknown-terrain sweeping without leaving expensive-to-revisit areas uncovered, underlying the capability of RL-based methodologies to tackle exploration tasks efficiently. The source code can be found at: https://github.com/dimikout3/MarsExplorer.


Motivation
At this very moment, three different uncrewed spaceships, PERSEVERANCE (USA), HOPE (UAE), TIANWEN-1 (China), are in the surface or in the orbit of Mars. Never before such a diverse array of scientific gear had arrived at a foreign planet at the same time, and with such broad ambitions [1]. On top of that, several lunar missions have been arranged for this year to enable extensive experimentation, investigation, and testing on an extraterrestrial body [2]. In this exponentially growing field of extraterrestrial missions, a task of paramount importance is the autonomous exploration/coverage of previously unknown areas. The effectiveness and efficiency of such autonomous explorers may significantly impact the timely accomplishment of crucial tasks (e.g., before the fuel depletion) and, ultimately, the success (or not) of the overall mission.
Exploration/coverage of unknown territories is translated into the online design of the path for the robot, taking as input the sensory information and having as objective to map the whole area in the minimum possible time [3] [4]. This setup shares the same properties and objectives with the well-known NP-complete setup of Traveling Salesman Problem (TSP), with the even more restrictive property that the area to be covered is discovered incrementally during the operation.

Related Work
The well-established family of approaches incorporates the concept of next best pose process, i.e. a turn-based, greedy selection of the next best position (also known as frontier-cell) to acquire measurement, based on heuristic strategy (e.g., [5], [6], [7]). Although this family of approaches has been extensively studied, some inherent drawbacks significantly constrain its broader applicability. For example, every deadlock that may arise during the previously described optimization scheme should have been predicted, and a corresponding mitigation plan should have been already in place [8]; otherwise, the robot is going to be stuck in this locally optimal configuration [9]. On top of that, to engineer a multi-term strategy that reflects the task at hand is not always trivial [10].
The recent breakthroughs in Reinforcement Learning (RL), in terms of both algorithms and hardware acceleration, have spawned methodologies capable of achieving above human-level performance in high-dimensional, non-linear setups, such as the game of Go [11], atari games [12], multi-agent collaboration [13], robotic manipulation [14], etc. A milestone in the RL community was the standardization of several key problems under a common framework, namely openai-gym [15]. Such release eased the evaluation among different methodologies and ultimately led to the generation of a whole new series of RL frameworks with standardized algorithms (e.g., [16], [17]), all tuned to tackle openai-gym compatible setups.
These breakthroughs motivated the appliance of RL methodologies in the path-planning/ exploration robotic tasks. Initially, the problem of navigating a single robot in previously unknown areas to reach a destination, while simultaneously avoiding catastrophic collisions, was tackled with RL methods [18], [19], [20]. The first RL methodology solely developed for exploration of unknown areas was developed in [21], and has successfully presented the potential benefits of RL. Recently, there have been proposed RL methodologies that seek to leverage the deployment of multi-robot systems to cover an operational area [22].
However, [22] assumes only a single geometry for the environment to be covered, and thus being prone to overfit, rather than being able to generalize in different environments. [21] mitigates this drawback by introducing a learning scheme with 30 different environments during the training phase. Although such a methodology can adequately tackle the generalization problem, the RL agent's performance is still bounded to the diversity of the human-imported environments.

Contributions
The main contribution of this work is to provide a framework for learning exploration/coverage policies that possess strong generalization abilities due to the procedurally generated terrain diversity. The intuition behind such an approach to exploration tasks is the fact that most areas exhibit some kind of structure in their terrain topology, e.g., city blocks, trees in a forest, containers in ports, office complexes. Thereby, by training multiple times in such correlated and procedurally generated environments, the robot will grasp/understand the underlining structure and leverage it to efficiently complete its goal, even in areas that it has never been exposed to.
Within this scope, a novel openai-gym compatible environment for exploration/coverage of unknown terrains has been developed and is presented. All the core elements that govern a real exploration/coverage setup have been included. MarsExplorer is one of the few RL environments where any learned policy can be transferred to real-world robotic platforms, providing that a proper translation between the proprioceptive/exteroceptive sensors' readings and the generation of 2D perception (occupancy map), as depicted in figure 2, and also an integration with the existing robotic systems (e.g., PID low level control, safety mechanisms, etc.) are implemented.
Four state-of-the-art RL algorithms, namely A3C [23], PPO [24], Rainbow [25] and SAC [26], have been evaluated on MarsExplorer environment. To better comprehend these evaluation results, the average human-level performance in the MarsExplorer environment is also reported. A follow-up analysis utilizing the best-performing algorithm (PPO) is conducted with respect to the different levels of difficulty. The visualization of the produced trajectories revealed that the PPO algorithm had learned to apply the famous space-filling Hilbert curve, with the additional capability of avoiding on-the-fly obstacles that might appear on the terrain. The analysis is concluded with a scalability study and a comparison with non-learning methodologies.
It should be highlighted that the objective is not to provide another highly realistic simulator but a framework upon which RL methods (and also non-learning approaches) will be efficiently benchmarked in exploration/coverage tasks. Although there are available several wrappers for high-fidelity simulators (e.g. Gazebo [27], ROS [28]) that could be tuned to formulate an exploration coverage setup, in practice the required execution time for each episode severely limits the type of algorithms that can be used (for example PPO usually needs several millions of steps to environment interactions to converge). To the best of our knowledge, this is the first openai-gym compatible framework oriented for robotic exploration/coverage of unknown areas.    Figure 1a demonstrates the robot's entry inside the unknown terrain, which is annotated with black color. Figure 1b illustrates all the so-far gained "knowledge", which is either depicted with Martian soil or brown boxes to denote free space or obstructed positions, respectively. An attractive trait is depicted in figure 1c, where the robot chose to perform a dexterous maneuver between two obstacles to be as efficient as possible in terms of numbers of timesteps for the coverage task. Note that any collision with an obstacle would have resulted in a termination of the episode and, as a result, an acquisition of an extreme negative reward. Figure 1d illustrates the robot's final position, along with all the gained information for the terrain (non-black region) during the episode.
All in all, the main contributions of this paper are: • Develop an open-source 2 , openai-gym compatible environment tailored explicitly to the problem of exploration of unknown areas with an emphasis on generalization abilities. • Translate the original robotics exploration problem to an RL setup, paving the way to apply off-the-shelf algorithms. • Perform preliminary study on various state-of-the-art RL algorithms, including A3C, PPO, Rainbow, and SAC, utilizing the human-level performance as a baseline. • Challenge the generalization abilities of the best performing PPO-based agent by evaluating multi-dimensional difficulty settings. • Present side-by-side comparison with frontier-based exploration strategies.

Paper outline
The remaining of this paper is organized as follows: Section 2 presents the details of the openaigym exploration environment, called MarsExplorer, along with an analysis of the key RL attributes inserted. Section 3 presents the experimental analysis from the survey regarding the performance of the state-of-the-art RL algorithm to the evaluation against standard frontier-based exploration. Finally, section 4 summarizes the findings the draws the conclusions of this study.

Environment
This section identifies the fundamental elements that govern the family of setups that fall into the coverage/exploration class and translates them to the openai gym framework [15]. In principle, the objective of the robot is to cover an area of interest in the minimum possible time while avoiding any non-traversable objects, the position of which gets revealed only when the robot's position is in close proximity [29], [30].

Setup
Let us assume that area to be covered is constrained within a square, which has been discretized into n = rows × cols identical grid cells: The robot cannot move freely inside this grid, as some grid cells are occupied by non-traversable objects (obstacles). Therefore, the map of the terrain is defined as follows: The values of M correspond to the morphology of the unknown terrain and are considered a priori unknown.

Action space
Keeping in mind that the movement capabilities of the robot mainly impose the discretization of the area into grid cells, the action space is defined in the same grid context as well. The position of the robot is denoted by the corresponding x, y cell of the grid, i.e. p a (t) = [x a (t), y a (t)]. Then, the possible next actions are simply given by the Von Neumann neighborhood [31], i.e.
In the openai-gym framework, the formulation above is realized by a discrete space of size 4 (North, East, South, West).

State space
With each movement, the robot may acquire some information related to the formation of the environment that lies inside its sensing capabilities, according to the following lidar-like model: where d denotes the maximum scanning distance.
An auxiliary boolean matrix D(t) is introduced to denote all the cells that have been discovered from the beginning till t timestep. D(t) annotates with one all cells that have been sensed and with zero all the others. Starting from a zero matrix rows × cols, its values are updated as follows: where ∨ denotes the logical OR operator. The state is simply an aggregation of the acquired information over all past measurements of the robot (4). Having updated (5), the state s(t k ) is a matrix of the same size as the grid to be explored (1), where its values are given by: Finally, the robot's position is declared by making the value of the corresponding cell equal to 0.6, i.e. s q=pa(t) (t) = 0.6. Overall, state s(t) is defined as a 2D matrix, that takes values from the following discrete set: {0, 0.3, 0.6, 1}.

Reward function
Having in mind that the ultimate objective is to discover all grid cells, the instantaneous core reward, at each timestep t, is defined as the number of newly explored cells, i.e.
Intuitively, if T k=0 r explor (k) → n , then the robot has explored the whole grid (1) in T timesteps. To force robot to explore the whole area (7), while avoiding unnecessary movements, an additional penalty r move = 0.5 per timestep is applied. In essence, this negative reward aims to distinguish among policies that lead to the same number of discovered cells but needed a different number of exploration steps. Please note that the value of r move should be less than 1, to have less priority than the exploration of a single cell.
The action space, as defined previously, may include invalid next movements for the robot, i.e., out of the operational area (1) or crashing into a discovered obstacle. Thus, apart from the problem at hand, the robot should be able to recognize these undesirable states and avoid them at all costs. Towards that direction, an additional penalty r invalid = n is introduced for the cases where the next robot's movement leads to an invalid state. Along with such a reward, the episode is marked as "done", indicating that a new episode should be initiated.
At the other side of the spectrum, a completion bonus r bonus = n is given to the robot when more than β% (e.g., 95%) of the cells have been explored. Similar to the previous case, this is also considered a terminal state.
Putting everything together, the reward is defined as:

Key RL Attributes
MarsExplorer was designed as an initial endeavor to bridge the gap between powerful existing RL algorithms and the problem of autonomous exploration/coverage of a previously unknown, cluttered terrain. This subsection presents the build-in key attributes of the designed framework.
Straightforward applicability. One of the fundamental attributes of MarsExplorer is that any learned policy can be straightforwardly applied to an appropriate robotic platform with little effort required. This can be achieved by the fact that the policy calculates a high-level exploration path based on the perception of the environment (6). Thus, assuming that a smooth integration with the sensor's readings (for example, using a Kalman filter), can be used to represent the environment as in (6), no elaborate simulation model of the robot's dynamics is required to adjust the RL algorithm into the specifics of the robotic platform.
Terrain Diversity. For each episode, the general dynamics are determined by a specific automated process that has different levels of variation. These levels correspond to the randomness in the number, size, and positioning of obstacles, the terrain scalability (size), the percentage of the terrain that the robot must explore to consider the problem solved, and the bonus reward it will receive in that case. This procedural generation [32] of terrains allows training in multiple/diverse layouts, forcing, ultimately, the RL algorithm to enable generalization capabilities, which are of paramount importance in real-life applications where unforeseen cases may appear.
Partial Observability. Due to the nature of the exploration/coverage setup, at each timestep, the robot is only aware of the location of the obstacles that have been sensed from the beginning of the episode (5). Therefore, any long-term plan should be agile enough to be adjusted on the fly, based on future information about the unknown obstacles' positions. Such a property renders the acquisition of a global exploration strategy quite tricky [33].
Fast Evaluation. Disregarding the environment from any irrelevant physics dynamics and focusing only on the exploration/coverage aspect (1)-(8), MarsExplorer allows rapid execution of timesteps. This feature can be of paramount importance in the RL ecosystem, where the algorithms usually need millions of timesteps to converge, as it can enable fast experimental pipelines and prototyping.

Performance Evaluation
This section presents an experimental evaluation of the MarsExplorer environment. The analysis begins with all the implementation details that are important for realizing the MarsExplorer experimental setup. For the first evaluation iteration, 4 state-of-art RL algorithms are applied and evaluated in a challenging version of MarsExplorer that requires the development of strong generalization capabilities in a highly randomized scenario, where the underlying structure is almost absent. Having identified the best performing algorithm, a follow-up analysis is performed with respect to the difficulty vector values. The learned patterns and exploration policies for different evaluation instances are further investigated and graphically presented. The analysis is concluded with a scale-up study in two larger terrains and a comparison between the trained robot and two well-established frontier-based approaches.

Implementation details
Aside from the standardization as an openai-gym environment, MarsExplorer provides an API that allows manually controlled experiments, translating commands from keyboard arrows to next movements. Such a feature can assess human-level performance in the exploration/coverage problem and reveal important traits by comparing human and machine-made strategies.
Ray/RLlib framework [34] was utilized to perform all the experiments. The fact that RLlib is a well-documented, highly-robust library also eases the build-on developments (e.g., apply a different RL pipeline), as it follows a common framework. Furthermore, such an experimental setup may also leverage the interoperability with other powerful frameworks from the Ray ecosystem, e.g., Ray/Tune for hyperparameters' tuning.   [35], only with the key difference that the image is generated incrementally and based on the robot's actions. Therefore, as it has been standardized from the DQN algorithm's application domain [36], a vision-inspired neural network architecture is incorporated as a first stage. Figure 3 illustrates the architecture of this pre-processor, which is comprised of 2 convolutional layers followed by a fully connected one. The vectorized output of the fully connected layer is forwarded to a "controller" architecture dependent on the RL algorithm enabled.

State-of-the-art RL algorithms comparison
Apart from the details described in the previous subsection, for the comparison study, at the beginning of each episode, the formation (position and shape) of obstacles was set randomly. This choice was made to force RL algorithms to develop novel generalization strategies to tackle such a challenging setup. The list of studied RL algorithms is comprised by the following model-free approaches: PPO [24], DQN-Rainbow [25], A3C [23] and SAC [26]. All hyperparameters of these algorithms are reported in the Appendix A. Figure 4 presents a comparison study among the approaches mentioned above. For each RL agent, the thick colored lines stand for the episode's total reward, while the transparent surfaces around them correspond to the standard deviation. Moreover, the episode's reward (score) is normalized in such a way that 0 stands for an initial invalid action by the robot, r invalid in (8), while 1 correspond to the theoretical maximum reward, which is the r bonus in (8) plus the number of cells. To increase the qualitative comprehension of the produced results, the average human-level performance is also introduced. To approximate this value, 10 players were drawn from the pool of CERTH/ConvCAO employees to participate in the evaluation process. Each player had an initial warm-up phase of 15 episodes (non-ranked), and after that, they were evaluated on 30 episodes. The average achieved score of the 300 human-controlled experiments is depicted with a green dashed line.
A clear-cut outcome is that the PPO algorithm achieves the highest average episodic reward, reaching an impressive 85.8% of the human-level performance. DQN-Rainbow achieves the second-best performance; however, the average is 50.04% and 42.73% of the PPO and human-level performance, respectively.

Multi-dimensional difficulty
Having defined the best performing RL algorithm (PPO), now the focus is shifted on producing some preliminary results, related with the difficulty settings of MarsExplorer. As mentioned in the definition section, MarsExplorer allows for setting the elements of difficulty vector independently. More specifically, the difficulty vector comprised of 3 elements [d t , d m , d b ], where: • d t denotes the topology stochasticity, which defines the obstacles' placement on the field. The fundamental positions of the obstacles are equally arranged in a 3 columns -3 rows format. d t controls the radius of deviation around these fundamental positions. As the value of d t increases, the obstacles' topology has more unstructured formation. d t takes values from {1, 2, 3} discrete set. • d m denotes the morphology stochasticity, which defines the obstacles' shape on the field. d m controls the area that might be occupied from each obstacle. The bigger the value of d m , the larger the compound areas of obstacles that might appear on the MarsExplorer terrain. d m takes values from {1, 2} discrete set.
• d b denotes the bonus rewards, that are assigned for the completion (r bonus ) and failure (r invalid ) of the mission (8). For this factor only two values are allowed {1, 2}, that correspond to cases of providing and not-providing the bonus rewards, respectively.
Higher values in the elements of the difficulty vector correspond to less structured behavior in the obstacles formation. Thus, a trained agent that has been successfully trained in greater difficulty setups may exhibit increased generalization abilities. Overall, the aggregation of the aforementioned elements' domain generates 12 combinations of difficulty levels. Figure 5 shows the total average return of the evolution of the average episodic reward for each one of the 12 levels during the training of the PPO algorithm. To improve the readability of the graphs, the results are organized into 3 graphs, one for each level of d t , with 4 plot lines each.
A study on the learning curves reveals that d m has the largest effect on the learned policy. Blue and red lines (cases where d m = 1), in all three figures, demonstrate a similar convergence rate and also the highest-performance policies. However, a serious degradation in the results is observed in purple and gray lines (d m = 2). As it was expected, when d m = 2 and also d t = 3 (purple and gray lines in figure 5c) the final achieved performance reached only a little bit above 0.6 in the normalized scale. d b seems that does not affect much the overall performance, at least until this vector of difficulty, apart from the convergence rate depicted in the gray line of figure 5c.

Learned policy evaluation
This section is devoted to the characteristics of the learned policy from the PPO algorithm. For each of the 12 levels of difficulty defined in the previous section, the best PPO policy was extracted and evaluated in a series of 100 experiments with randomly (controlled by the difficulty setting) generated obstacles. Figure 6 presents one heat map for each difficulty level. Blue colormap corresponds to the frequency of the robot visiting a specific cell of the terrain. Green colormap corresponds to the number of detected obstacles in each position during the robot's exploration.
A critical remark is that, for each scenario, the arrangement of discovered obstacles matches the drawn distribution as described in the previous subsection, implying that the learned policy does not have any "blind spots".
Examining the heatmap of the trajectories in each scenario, it is crystal clear that the same family of trajectories has been generated in all cases and with great confidence. The important conclusion here is that this pattern is the first order of the Hilbert curve that has been utilized extensively in the space-filling domain (e.g., [37], [38]). Please highlight that such a pattern has not been imported to the simulator or rewarded when achieved from the RL algorithm; however, the algorithm learned that this is the most effective strategy by interacting with the environment.
It would be an omission not to mention the learned policy's ability to adapt to changes in the obstacles' distribution and, ultimately, find the most efficient obstacle-free route. This trait can be observed more clearly in subfigures 6k and 6l, where the policy needed to be extremely dexterous and delicate to avoid obstacles' encounters.   [7] were enabled for positioning the achieved PPO policy in the context of non-learning approaches. In these frontier-based approaches, the exploration policy is divided into two categories based on the metric to be optimized:

Comparison with frontier-based methodologies for varying terrain sizes
• Cost: the next action is chosen based on the distance from the nearest frontier cell.
• Utility: the decision-making is governed by frequently updated information potential field. Figure 7 summarizes the result of such evaluation study by presenting the average exploration time for each algorithm (PPO, cost frontier-based, utility frontier-based) over 100 procedurally generated runs. A direct outcome is that the learning-based approach requires the robot to travel less distance to explore the same percentage of terrain as the non-learning approaches. The final remark is devoted to the "knee" that can be observed in almost all the final stages of the non-learning approaches. Such behavior is attributed to having several distant sub-parts of the terrain unexplored, the exploration of which requires this extra effort. On the contrary, the learning-based approach (PPO) seems to handle this situation quite well, not leaving these expensive-to-revisit regions along its exploration path.

Conclusions
A new openai-gym environment called MarsExplorer that bridges the gap between reinforcement learning and the real-life exploration/coverage in the robotics domain is presented. The environment transforms the well-known robotics problem of exploration/coverage of a completely unknown region into a reinforcement learning setup that can be tackled by a wide range of off-the-shelf, model-free RL algorithms. An essential feature of the whole solution is that trained policies can be straightforwardly applied to real-life robotic platforms without being trained/tuned to the robot's dynamics. To achieve that, the same level of information abstraction between the robotic system and the MarsExplorer is required. A detailed experimental evaluation was also conducted and presented. 4 state-of-the-art RL algorithms, namely A3C, PPO, Rainbow, and SAC, were evaluated in a challenging version of MarsExplorer, and their training results were also compared with the human-level performance for the task at hand. PPO algorithm achieved the best score, which was also 85.8% of the human-level performance. Then, the PPO algorithm was utilized to study the effect of the multi-dimensional difficulty vector changes in the overall performance. The visualization of the paths for all these difficulty levels revealed a quite important trait. The PPO learned policy has learned to perform a Hilbert curve with the extra ability to avoid any encountered obstacle. Lastly, a scalability study clearly indicates the ability of RL approaches to be extended in larger terrains, where the achieved performance is validated with non-learning, frontier-based explorations strategies.