A Deep Reinforcement Learning Approach for Active SLAM

: In this paper, we formulate the active SLAM paradigm in terms of model-free Deep Reinforcement Learning, embedding the traditional utility functions based on the Theory of Optimal Experimental Design in rewards, and therefore relaxing the intensive computations of classical approaches. We validate such formulation in a complex simulation environment, using a state-of-the-art deep Q-learning architecture with laser measurements as network inputs. Trained agents become capable not only to learn a policy to navigate and explore in the absence of an environment model but also to transfer their knowledge to previously unseen maps, which is a key requirement in robotic exploration.


Introduction
Simultaneous Localization and Mapping (SLAM) refers to the problem of incrementally building the map of a previously unseen environment while at the same time locating the robot on it.Since its statement, more than three decades ago, it has significantly drawn the attention of the robotics community and numerous approaches have been developed, either based on novel filter techniques or optimization methods where inertial and visual measurements may be jointly optimized.See [1][2][3][4] and references there in.
Active localization was first introduced by Burgard et al. [5], where it was proven that picking actions to minimize the localization's uncertainty would result in a better localization than using a passive approach (e.g., Markov localization or adaptive Monte Carlo localization).Active SLAM augments this approach to the SLAM problem, and it can be defined as the paradigm of controlling a robot which is performing SLAM so as to reduce the uncertainty of its localization and the map's representation [6,7].Typically, it consists of three stages [8]: (i) the identification of all possible locations to explore (ideally infinite), (ii) the computation of the utility or reward generated by the actions that would take the robot from its current position to each of those locations and (iii) the selection and execution of the optimal action.In the second step, the utility of each action is traditionally computed by quantifying the uncertainty in the estimation of the two target random variables: the robot's pose and the map's representation.This matrix quantification can be done on the basis of either Theory of Optimal Experimental Design (TOED) or Information Theory (IT).According to the first one, four optimality criteria can be inferred from the generic covariance matrix Σ ∈ R × of the state vector x = (x 1 ...x ) T ∈ R , with eigenvalues λ 1 ...λ :

•
T-optimality criterion: captures the average variance, T-opt • A-optimality criterion: captures the harmonic mean variance, which is sensitive to a lower-than average value, A-opt • D-optimality criterion: captures the whole covariance (hyper) ellipsoid, • E-optimality criterion: captures the minimum eigenvalue, Within IT, Shannon's entropy (or its variation) is the most commonly used metric, which can be expressed for an -dimensional multivariate Gaussian distribution, X ∼ N (µ, Σ), as follows: The proper choice of the criterion is still an open research issue.Since each of them captures a different metric of Σ, they do not lead to the same behavior neither require the same computational resources.It is worth noticing that every criterion is right-unbounded and that the following relationship holds for every well-defined covariance matrix: E-opt ≤ A-opt ≤ D-opt ≤ T-opt.Moreover, the robot's state space representation (using quaternions, rotation matrices or Lie algebras) and the uncertainty representation have recently been shown to be of utmost importance in the criteria properties and therefore in the algorithm performance [9].While differential state representations allow the use of any criterion, absolute representations can only rely on D-opt (or Shannon's entropy, equivalently); being the other criteria unable to capture a true metric of the covariance matrix nor to ensure certain key properties such as their monotonicity during exploratory trajectories.Further discussion about the monotonocity of optimality criteria under Lie algebra representations can be found in Appendix A, where the proof of their monotonocity under differential uncertainty representations is presented.
Aside from traditional methods, such as particle filters [10] or model predictive control [11] where frontier-based or random tree strategies are generally used, the problem of selecting the most informative action in robotic exploration has been lately addressed with Deep Neural Networks (DNN), often equating to Convolutional Neural Networks (CNN).Notably, Deep Reinforcement Learning (DRL) is suitable for such problems in which the agent must learn by directly interacting with the environment under partial observability, and where its a priori knowledge is nonexistent.Thus, some DRL algorithms and strategies originally developed for other decision making problems (e.g., video-games) have already been successfully applied to robotic navigation and exploration, such as Deep Q-Networks (DQN) [12] and their double [13], dueling [14] and Rainbow [15] successors, the Asynchronous Advantage Actor-Critic (A3C) architecture [16], the Soft Actor-Critic (SAC) [17] or the Prioritized Experience Replay (PER) buffer [18].
A first example of those methods appears in [19], where Tai and Liu used a large CNN to extract the feature maps of a front-view camera that would be later fed into a 2-layer DQN capable of selecting the best next action to execute.In this case, experiments were carried out in a complex simulator, but convergence of training was only achieved for topologically simple scenarios (e.g., a straight corridor).Moreover, the same simple scenarios were used in both training and testing stages, being the agent's generalization ability unclear.Work in [20] was among the firsts showing that navigation via DRL was also achievable using the A3C algorithm, notably improving speed convergence with respect to DQN.However, the acquired knowledge was never tested in environments different from the training one, hence the environment was again known in all cases.A critic generalization experiment, though, was carried out in [21], showing that trained agents do not perform as well in previously unseen maps, and could even demonstrate behaviors comparable to random agents in certain scenarios.[22] shows a similar approach in which obstacle avoidance problem is tackled using the D3QN architecture and FastSLAM backend to map the environment and locate the robot on it.
The aforementioned works used simple extrinsic rewards to encourage obstacle avoidance.However, once navigation was proved to be achievable, motivation and curiosity capabilities [23] were added to the agents in order to expand the problem to robotic exploration.While some approaches, frequently called curiosity-driven, encourage the agent to visit user-defined or novel states [24] or to maximize the coverage of a known map [25], others motivate those actions that minimize the environment's uncertainty (i.e., the agent's knowledge of the environment is maximized).The latter, that will be referred as uncertainty-based, have been commonly addressed by encouraging the visit of those states which are more difficult to predict [26,27], although they have been recently tackled on the basis of true uncertainty metrics.In [28,29] actions are selected so as to maximize the entropy reduction (information gain or Kullback-Leibler divergence) of a 2-dimensional occupancy grid map; either by learning in a supervised fashion with labelled cells, or by introducing a metric of it in the DRL's reward function, respectively.Another approach to solve the active localization problem that also acts directly on the reward function design was proposed in [30,31], where the entropy reduction was achieved by including on it the maximum likelihood (ML) of being in any state (belief accuracy).Recently, Chen et al. [32] trained both DQN and A2C agents using the underlying SLAM pose-graph through Graph Neural Networks (GNN).In this case, they combined DRL with exploration graphs in order to achieve true exploration policies based on the map's uncertainty (T-optimality), outperforming random and nearest frontier agents.They showed good generalization results with higher dimensional state spaces.However, the simulation environment used was a simple grid world in which obstacle avoidance was not needed and only landmark positions were changed.
In the literature, experiments are usually conducted in extremely-simplified simulation environments, such as grid-worlds, where the agent moves between cells in a discrete fashion, and sensors are never modelled.In these cases, the inputs to the DNN range from simple occupancy-grid maps [33] to combinations of frontier's and robot's groundtruth locations [32,34].More complex and realistic scenarios, although seldom used, are required to feed images [19,22] or laser measurements [35] to the network.In such cases, it is worth noticing the complexity tackled; not only in the interaction between the simulator and the training algorithm but also in the physics of the robot movement, the environment itself and the sensors modelled (and thus their probabilistic behavior).Knowledge transfer to real environments is yet to be evaluated due to the stage these approaches are at, and only a few of them have effectively addressed it for navigation-and not exploration-purposes [19,35].
Over the past few years, the basics of DRL have proved to have a great potential in navigation, mapping and even exploration tasks.In this work, we aim to study that potential for the active SLAM decision making problem.In this way, one of the main drawbacks of traditional active SLAM methods is relaxed by transferring the intensive computations to the DNN training phase, requiring only feed-forward propagation during evaluation.The paradigm is formulated as a multi-reward DRL problem, on the basis of classical multi-objective optimization approaches and TOED.On top of Gazebo simulator, a state-of-the-art DQN is embedded within a Robot Operating System (ROS) framework.Thus, the use of both an uncertainty-based reward and a complex simulator are jointly taken into account.Trained agents become capable of making quasi-optimal decisions in order to navigate and explore an environment, just from raw laser measurements.Acquired knowledge transfer is tested in previously unseen scenarios to clarify their generalization ability.Also, traditionally extrinsic and uncertainty-based rewards are compared to evidence the value of the proposed approach.Table 1 summarizes the differences between our approach and other mentioned literature.It can be noticed that true uncertainty metrics are needed in order to tackle Active SLAM.

Partial Observability
Tai and Liu [19] Navigation DQN Extrinsic Mirowski et al. [20] Navigation A3C Extrinsic Wen et al. [22] Navigation D3QN Extrinsic Zhelo et al. [26] Goal navigation A3C Intrinsic Oh and Cavallaro [27] Goal navigation A3C Intrinsic Tai et al. [35] Goal navigation ADDPG Intrinsic Zhu et al. [33] Frontier selection A3C Intrinsic Niroui et al. [34] Frontier selection A3C Intrinsic Gottipati et al. [31] Active localization A2C Posterior belief Chaplot et al. [30] Active localization A3C Posterior belief Chen et al. [29] Exploration DQN Entropy Chen et al. [32] Active SLAM This section provided a brief but comprehensive introduction to the problem tackled.The rest of the paper is organized as follows.In Section 2, active SLAM problem is formulated and the proposed DRL approach is presented.The experimental setup (consisting of the simulation, SLAM and decision making modules) and he results obtained and their discussion are shown in Sections 3 and 4. Finally, the paper conclusions are in Section 5.

Traditional Approach
The active SLAM problem is included within a wider mathematical framework, Partially Observable Markov Decision Processes (POMDP), that generalize such decision making processes under both action and observation uncertainties.A POMDP is formally defined as a 7-tuple (S, A, T , Ω, O, R, γ) consisting of the finite agent's state space S, a finite set of actions A, a transition function between states T : S × A → Π(S ) where Π(S ) is the probability distribution over S, a finite observation space Ω and its conditional probabilities O : S × A → Π(Ω), a reward function R : S × A → R and the discount factor γ which allows to work with finite rewards even when planning has an infinite time horizon.Every time step, the agent selects an action to execute a t ∈ A based on the current policy π, generating a transition from s t to s t+1 ≡ s , both contained in S, where a new observation o t ∈ Ω is made.The goal of the agent is to find the optimal policy π * that maximizes a metric of the reward function, usually defined as the future expected discounted reward R t ∑ ∞ t=0 γ t R(s t , a t ).Resolution of POMDP can be based on several methods, such as Monte Carlo, dynamic programming or temporal differences.Reinforcement Learning (RL) algorithms, based on the latter and Bellman equations, solve the problem by iterating either directly over the policy or over a value function.This function is just a mapping to real numbers from the value encoded in being at a certain state, or in a more complex fashion, from the value of being at a certain state and also executing a certain action according to π; and are denoted as V(s) and Q(s, a), respectively.Both value-and policy-based trends allow a POMDP resolution when the process itself (i.e., T and O) is not known nor modelled (model-free).
In any case, RL algorithms are just recurrences over the expected rewards, as shown in Algorithm 1 of Q-learning, or more specifically, in Equation ( 6)-the Q-function's Bellman update.The design of the reward function is then extremely important in order to achieve the desired goal, as it occurs in optimization problems with cost functions.
end while end for Furthermore, active SLAM can be formulated as a multi-objective optimization problem [7] where the cost function J must contain the cost of performing a free-collision trajectory (F ) and a metric of the covariance matrix (U ), which represents the uncertainties in the estimates of the robot's pose and the map.As shown hereafter, both terms should be scaled by task-dependent coefficients, and can be considered in an n-step horizon, Such formulation can be straightforward generalized to RL problem, just by designing the reward function as follows, where the first term refers to a fully extrinsic reward that accounts for the collision-free trajectories (i.e., navigation), and the second one is an intrinsic reward, that can be defined inversely proportional to a metric of the covariance matrix f (Σ), e.g., the aforementioned utility functions.This definition would motivate the robot not only to navigate the environment but also to explore it, in terms of improving the estimate of the map's representation and the robot's localization on it.

Neural Network Approach
In Deep Q-learning formulation, the analytical computation of the action-value function (Q-function) is approximated by a DNN which coefficients θ are iteratively updated by using the well-known back-propagation algorithm [36].Let the loss function of this new optimization problem be the quadratic loss, where n is the batch size, ŷ = Q(s, a|θ) the estimated Q-value by the DNN, and y the target Q-value, that can be computed using an auxiliary identical neural network, following [13], with parameters θ * , as: Therefore, just by feeding reward function from Equation ( 8) in the previous equation, the neural network is recursively taught to visit those states that contributes to an uncertainty reduction.For the sake of understanding of the approach, Algorithm 2 shows the proposed learning process, using a dueling double DQN with PER embedded in Gazebo simulation environment.
Algorithm 2 Dueling Double Deep Q-learning with PER in simulation environment.

Environment Setup
In order to evaluate the proposed approach, several experiments have been conducted in Gazebo robotics simulator (http://gazebosim.org/).A Turtlebot robot has been used, consisting of with two differential wheels and a modelled laser sensor that generates n rays equally distributed in the 180 • front field of view of the robot, each of them with a minimum range of 0.1 m, a maximum range of 10 m, two-digit decimal precision and a Gaussian noise model with zero mean and small variance.Three different environments (Figure 1), adapted from [37], have been used during training and testing stages.The first one, where the agent is trained, consists of a simple maze with 90 • turnarounds.Second and third environments have a more complex topology, including consecutive turnarounds and dead-ends, and have been used only during testing (i.e., agents do not improve their policy anymore, see Figure 2).A training/testing stage consists of a certain number of episodes, in which the robot moves until either a collision occurs (i.e., gets closer to any obstacle than a defined threshold of 0.2 m) or 500 steps (i.e., decisions) are consumed.A ROS framework has been used to connect the simulation environments with the SLAM algorithm and the decision making module, by using GymGazebo (https://github.com/erlerobot/gym-gazebo)library, based on OpenAI Gym (https://gym.openai.com/).Every step, the DNN receives the n-dimensional sensed vector and estimates the Q-values for each possible action, namely going forward, turning right and turning left: where v is the linear velocity in m/s and ω is the angular velocity in rad/s; and v = 0 in a 2 and a 3 to avoid continuous turnings on the same spot.By using a fully extrinsic approach, after each decision making, the environment would reward the agent with a large negative value if a collision occurred, or with a small positive value otherwise: Note that the reward values have been designed empirically, and the penalty to nonzero angular velocities prevents continuous spins and uneven trajectories.
The proposed uncertainty-based approach requires, following Equation ( 8), its augmentation as follows, where η is a task-dependent scale factor and f (Σ) is the D-optimality criterion.Since this criterion -likewise any other of the presented criteria-is right-unbounded, the tanh(•) function has been used to bound it in the interval [0, 1] and therefore to maintain a relationship between the free-collision trajectory and uncertainty terms.

SLAM
From previous formulation it is clear that the covariance matrix recovery is a key requirement.In this sense, a slightly modified gmapping [2] algorithm runs on background of the training, re-initializing its map every episode and communicating the covariance estimate to the learning module via the ROS architecture (see Algorithm 2).Despite other SLAM algorithms (e.g., [4,38]) would able to compute more accurate state and covariance estimates, gmapping has been chosen because of its simplicity and low computational load, which were requirements that have been considered essential for the stage the study is at.
Every time the algorithm processes a laser scan, it recovers the distribution of the particles' state as a Gaussian with mean µ ∈ R 3 and covariance Σ ∈ R 3×3 : where n p is the total number of particles, index i represents a single particle, wi = w i / ∑ n p j=1 w j its normalized weight and x i = (x i , y i , θ i ) T its state vector.After its retrieval, the covariance matrix is referenced to the previous pose following the absolute formulation available in [9], and published in ROS.
The main parameters of the algorithm have been tuned to achieve a balance between performance and low computational load.Since each particle has to carry its own map, only five particles have been used.The minimum effective number of particles, a key parameter to detect loop closures, has been set to one-fourth of the total number of particles.

Decision Making Module
Hitherto, the simulation environment and the SLAM algorithm have been described.Decision making is the third and last module of the approach, and it has been done via deep Q-learning due to the high computational load that multi-agent training (e.g., A3C) would require in Gazebo.
First of all, a vanilla DQN and a double DQN (DDQN henceforth) have been implemented in TensorFlow.The networks contain two hidden layers, each of them with 24 hidden units and LeakyReLU activation with negative slope f (x) = −0.01x,and a linear output layer with a number of hidden units equal to the dimension of the action set.Secondly, a dueling double DQN architecture (D3QN henceforth) with PER has been implemented, akin to the state-of-the-art network described by Hessel et al. [15].It contains two parallel separate flows that encode the value function V(s) and the advantage function A(s, a).Each flow contains only one hidden layer, with the same structure as the previous networks, and an intermediate output linear layer (1-dimensional for V and 3-dimensional for A).Both flows are then non-trivially aggregated to generate the estimate Q(s, a) in the output layer, again of dimension equal to the size of the action set.In addition, the weights of this network have been non-randomly initialized to known values that encode a reasonable policy, so as to speed convergence.In all cases, the target network is hard-updated every 10000 steps, the training algorithm uses RMSProp optimizer ( = 1 × 10 −6 , ρ = 0.9), and the agents follow an ε-greedy policy with decreasing ε in the interval (1, 0.02] during the first 50 training episodes.This policy is nevertheless changed to a greedy one during testing.The main parameters of the learning algorithm and the simulator are listed in Table 2.

Reinforcement Learning
To conceptually prove the validity of the approach, a preliminary simple experiment has been performed where decision making is done via Q-learning.In this case, only five laser measurements have been used to reduce the computational complexity of this tabular approach, and the reward function of Equation ( 12) has been modified so f (Σ) equals to Shannon's entropy and η = 1.The agent has been pre-trained in the first scenario to learn some general skills and then trained again in the second environment, which is more complex.Figure 3 shows the entropy evolution along episodes during the training phase for the traditional (left) and uncertainty-based (right) approaches.Color maps indicate the logarithmic occurrence of entropy (darker means lower).In the left figure, entropy trend is not ordered and any improvements are achieved accidentally, since the algorithm does not bother on its minimization.The rightmost figure, however, shows a decreasing entropy trend as training progresses, which matches with the proposed formulation.

Deep Reinforcement Learning
Once the suitability of the approach has been conceptually proved, it has been extended to include the benefits of DL.First of all, the three agents (DQN, DDQN and D3QN) have been trained in the simplest environment with an extrinsic reward, in order to compare their performance, study their generalization ability and prove that exploration is not achievable with such formulation.Figure 4 contains their training curves (i.e., the cumulative reward after each episode of the training) after roughly 30 h and 500 episodes (all experiments have been performed on a laptop with Intel Core i7-7500U CPU, Nvidia Quadro M520 (2GB) GPU and a 32GB RAM).Light-colored curves correspond to raw data while bold ones do to outlier-filtered moving averages.DQN agent, depicted in blue, demonstrates permanent instabilities due to the memory saturation with useless experiences, resulting in continuous cycles of forgetting and relearning similar policies.DDQN agent (red) mitigates them, although converged maximum values are slightly lower (i.e., policies are less optimal and thus trajectories become rougher).Finally, D3QN curve (green) has a significantly more stable behavior attributable to the PER, and also higher maximum converged values.Note that, in this case, convergence is much faster because of the non-random initialization of the network weights.Table 3 shows the performance of each agent in the three scenarios during testing (generalization), in terms of success ratio (SR, i.e., number of times over all trials that they sense more than 95% of the map at least once, where 95% have been used instead of 100% due constraints of the laser sensor), mean number of steps per episode and mean reward per episode.In addition, results have been averaged over five experiments with different simulation random seeds, and their standard deviation is presented in parenthesis.D3QN results are the most remarkable in the first environment, as it outperforms the reward that a human would obtain by manually controlling the robot (≈350).In the second environment both DDQN and D3QN show a good behavior.Despite DDQN have higher SR (and mean steps, thus), the higher mean reward obtained by D3QN proves the generation of more optimal trajectories: smoother movements and less spins.The third environment presents quite a challenge not only for being unknown but also because of its complex topology.None of the agents trained with R F is capable of going across the whole map, leaving some areas such as the top right corner and the middle dead-end unvisited (see Figure 1f).Besides that, D3QN navigation performance is exceedingly superior, with mean rewards several times higher than the ones obtained by DQN and DDQN.
The three trained agents performed well during generalization in the second environment, due to its similarity to the training one.After the training, the agents have learned to decide when to go forward and when to steer when an obstacle is close enough.As the complexity of the agent's architecture increases, the learned skills complexity grows too.For instance, it can be noticed how the D3QN agent learned the exact moment to steer so as to maximize rewards (i.e., minimize the traveled distance).In this sense, the second scenario keep this topology and thus the agent is capable of transferring the learned skills.The third scenario, however, is more challenging.Several dead-end corridors appear and also a corridor with two ways.Therefore, the agent must be capable of choosing the right path or, at least, being able to return when a wrong action is selected.Note that these are skills that were no taught in the first scenario.In the third environment, only the D3QN agent is capable of generalizing the learned skills to those needed in order to complete it.
For the purpose of showing the strong effect of not having a priori knowledge of the environment, a second experiment has been conducted where the simplest trained agent was allowed to store information of the second environment during a few episodes before being tested on it (denoted as DQN † in Table 3).Results of the retraining stage are shown in Figure 5.In the beginning, the agent worsens its performance while the network weights are adjusted, but after about 65 episodes, it is capable to outperform agents with a more complex architecture, as it is also shown in Table 3.It is worth noticing too how the standard deviation of the reward significantly increases when the environment is unknown for the agents, since the input states are unfamiliar.This experiment has not been performed in the first scenario since DQN agent has been already trained on it.In the third environment, the required amount of warm-up steps needed for the neural network to adapt was too high, so it is not presented either.Despite demonstrating a superior behavior, the agent would be learning a whole new policy based on the information gathered in the new scenario and forgetting the old one.Table 3. Mean evaluation results in all environments for DQN, DDQN and D3QN agents trained with R F .Superscript † refers to an agent that was allowed to train briefly before its evaluation on that scenario, while ‡ does to an agent retrained with R aug .

Agent SR(%) Steps R
Env. Lastly, an experiment has been carried out using the reward proposed in Equation ( 12).The previous D3QN agent has been trained again in the first environment during 250 episodes (reaching a tradeoff between performance and low training time).Evaluation results are shown in Table 3 under D3QN ‡ notation.Note that, for comparison purposes, the mean rewards shown correspond only to the extrinsic term of R aug , albeit the complete function was used during training.Although in the first environment the mean extrinsic reward is significantly lower; in the second one the agent improves the performance of its predecessors, achieving a 100% SR and obtaining mean rewards close to the ones that an agent with previous knowledge of the environment would obtain.In the third environment, the increase in mean steps and reward is also remarkable.However, the most notable fact is the nonzero SR, meaning that the agent is capable of visiting the whole scenario in some occasions: a fact that any agent had achieved due the topological complexity of the environment and since exploration was not encouraged until now.In all cases, the inclusion of R U has led to the selection of more optimal actions in terms navigation and robotic exploration; but also to a significant increase of the standard deviation.This could be mitigated with longer training stages, since the optimization problem becomes more complex with multi-reward functions.
Previous data has proved the effectiveness of the proposed approach only in terms of navigation, though.Next, it is shown how uncertainty evolves in both the traditional and the proposed methods.Figure 6 shows the evolution of the D-optimality criterion during evaluation of D3QN (red) and D3QN ‡ (blue) agents in the three environments.This figure contains the mean behavior (bold curves) of 50 different episodes (shown in the background), and their confidence interval at 95% (light-colored areas).In the first two environments the evolution is similar with both agents, however, there is a big difference on how they achieve such behavior: while D3QN ‡ agent looks for the uncertainty decrease, D3QN achieves it accidentally due to the reduced topological complexity of the environments.In the third scenario, which is topologically more complex and where D3QN left several areas unvisited (and therefore loop closures), the use of D3QN ‡ results on a significant reduction of D-opt (≈ 34%).Also, note how the increasing monotonicity property is hold and how discontinuities occur due to loop closures.Figure 7 contains the maps generated online by the SLAM algorithm in the three environments after about 150 testing episodes.Images also contain starting positions (red circles), the detection of loop closures (black stars), the robot's trajectory (red arrows) and its uncertainty in X and Y directions (yellow ellipses).Maps are depicted for D3QN (top) and D3QN ‡ (bottom) agents.Firstly, it is notable the lower uncertainty achieved by the proposed agent in all cases, and the effectiveness of resampling in comparison to the traditional agent.Moreover, trajectories followed are more optimal and angular velocity is applied more sparsely (source of uncertainty increase).In the third environment, D3QN ‡ agent is strikingly capable of selecting different paths to explore a greater area and resample the particles (i.e., close the loop) twice.Note that although these maps have been hand-selected to represent a common behavior of the agent's policy, do not always reproduce equally due to the environment and simulation's randomness.
Finally, a discussion on the complexity of this approach with respect to the traditional active SLAM algorithm is presented in Appendix B.

Conclusions
In this paper, we have presented a novel approach to solve the Active SLAM paradigm by using model-free DRL.In contrast to prior approaches that try to effectively address autonomous exploration with intrinsic rewards, we embedded true uncertainty metrics in the reward function, that stem from traditional TOED.Despite many works claim to succeed in exploration with extrinsic rewards, we have proved that performing Active SLAM with such rewards is not feasible.We embedded a state-of-the-art deep Q-learning architecture on top of Gazebo simulator; and executed a lightweight SLAM algorithm back-end that allowed the retrieval of the robot's uncertainty and therefore the computation of D-optimality.Thus, we were able to truly perform Active SLAM in a complex environment with realistic models for the robot, the sensors and the environment physics, unlike most related works.Trained agents are able to select actions so as to reduce the uncertainty via quasi-optimal trajectories and place recognition.More interestingly, they are also able to transfer the acquired knowledge to previously unseen maps, a key requirement to solve Active SLAM.
As future work, we aim to embed a more complex state-of-the-art SLAM system to obtain a more precise uncertainty estimate which also includes a state representation based on Lie algebras, e.g., ORB-SLAM2 [39], in order to accurately compute optimality criteria, see [9].To our understanding, the main limitations of our approach are the DRL architecture and the environments used.Thus, we plan to test actor-critic algorithms and explore new inputs to feed the networks, e.g., images or the underlying pose-graphs.Finally, experiments in real world will be conducted where our approach will be compared to classical active SLAM algorithms, although for now, further theoretical exploration with simulator validation seems to be more valuable.
For notational convenience, the hat operator of Equation (A2) converts the vector d k to a member of the Lie algebra se(n).That is, for the 2 and 3D cases, respectively: Elements of Equations (A5) and (A6) from the tangent space se(n) can be associated to their underlying Lie groups SE(n) by the group exponential maps [43].Note that an alternative representation is also possible by considering the "small" perturbation being expressed in the k-th reference frame Thus, the time complexity of solving the problem will be linear in the action set and logarithmic in the observation set according to Θ (|A| + log(|O|)).
Traditional resolution of the problem may result in different complexities depending on the method selected.For example, value iteration for a MDP is known to be O |A| × |S| 2 complex, where |S| is the dimension of the state set; and for a POMDP, O c |A| × c |O| with c > 1, i.e., exponentially complex in the action and observation sets.Note that the previous complexity corresponds to the most demanding stage, being lower order terms ignored, such as the search over the action set, or the computation of the utility function that would be, for example, for D-opt, Ω(|S| 2.373 ), where Ω(•) is a lower bound.

Figure 1 .
Figure 1.Gazebo simulation environments (top) and their corresponding maps generated by the SLAM algorithm (bottom).

Figure 3 .
Figure 3. Logarithmic occurrence of entropy along episodes in the second scenario.Results are shown for traditional (left) and uncertainty-based (right) RL's reward functions.Darker colors denote lower occurrence.

Figure 4 .
Figure 4. Cumulative reward during training in environment 1 for DQN (blue), DDQN (red) and D3QN (green) agents.Light-colored curves correspond to raw data while bold ones do to outlier-filtered moving averages.

Figure 5 .
Figure 5. Mean cumulative reward and its deviation for DQN (blue), DDQN (red) and D3QN (green) agents in the second environment.Also, evolution of the cumulative reward for DQN † (orange) over the number of retraining episodes.

Figure 6 .
Figure 6.Evolution of the D-optimality criterion during evaluation for the agents trained with R F (D3QN, red) and R aug (D3QN ‡ , blue).

Figure 7 .
Figure 7. Maps built by D3QN (top) and D3QN ‡ (bottom) agents in the three environments during testing.Red circles indicate the starting position, red arrows the trajectory followed, black stars point out a resample of algorithm particles (loop closures) and yellow ellipses illustrate a measurement of the uncertainty of robot's 2D position.
T k = Tk exp( d k ) where d k and d k are related by the adjoint action on the Lie dim(A) ≡ |A| = m, and the number of inputs directly affects the observation set, O.For example, in the studied case where a laser scan is used, the dimension of the observation set would be, dim(O) ≡ |O| = (max range − min range ) × 10 2 n = k n (A22)

Table 1 .
Comparison with prior methods.
with arbitrary Q-values for episode ← 1 to max episodes do Perceive s t while s t not terminal do Select a t ← π(s t ) Take a t , get r t and perceive s t+1 1] Initialize DNN with parameters θ and θ * ← θ Initialize ROS master, Gazebo and SLAM algorithm for episode ← 1 to max episodes do Unpause simulation Reset robot pose, map and SLAM algorithm Perceive s t Pause simulation while s t not terminal do Select a t ← π(s t |ε) Unpause simulation Execute a t during t a seconds Recover Σ, compute desired f (Σ) and r t Perceive s t+1 Pause simulation if s t+1 is terminal then e t = 1 else e t = 0 Store tuple (s t , a t , r t , s t+1 , e t ) in M with max(p) Sample minibatch from M for each tuple in minibatch do if s i+1 is terminal then

Table 2 .
Training and simulation main (hyper)parameters.