Autonomous Exploration of Mobile Robots via Deep Reinforcement Learning Based on Spatiotemporal Information on Graph

: In this paper, we address the problem of autonomous exploration in unknown environments for ground mobile robots with deep reinforcement learning (DRL). To effectively explore unknown environments, we construct an exploration graph considering historical trajectories, frontier waypoints, landmarks, and obstacles. Meanwhile, to take full advantage of the spatiotemporal feature and historical information in the autonomous exploration task, we propose a novel network called Spatiotemporal Neural Network on Graph (Graph-STNN). Speciﬁcally, the proposed Graph-STNN extracts the spatial feature using graph convolutional network (GCN) and the temporal feature using temporal convolutional network (TCN). Then, gated recurrent unit (GRU) is performed to synthesize the spatial feature, the temporal feature, and the historical state information into the current state feature. Combined with DRL, our Graph-STNN helps estimation of the optimal target point through extracted hybrid features. The simulation experiment shows that our approach is more effective than the GCN-based approach and the information entropy-based approach. Moreover, Graph-STNN also performs better generalization ability than GCN-based, information entropy-based, and random methods. Finally, we validate our approach on the simulation platform Stage with the actual robot model.


Introduction
The prior information about the environment (usually a map) is always helpful for robots performing specific tasks, e.g., service, security, rescue, etc. To explore the unknown environments, the manually remote operation of the mobile robots to gather information is probably not a good option due to the possible poor communication link [1]. On the contrary, an autonomous scheme by selecting a target point to navigate in the unknown region is more popular. By repetitively selecting and navigating to the candidate point, the robot would cover the whole region [2,3]. However, this method requires expert human experience to choose the suitable point and thus is less adaptable to different situations. Moreover, with the increasing dimensionality of the state space and the action space, the computational burden of the multi-objective optimization problem will increase rapidly. Besides, these methods do not take the localization uncertainty into account during the exploration process, which may eventually lead to an inaccurate map [4].
There are also some deep learning-based methods. Typically, neural networks are used to map the sensor data or the global occupancy maps built by Simultaneous Localization and Mapping (SLAM) into the actions of robots [5][6][7]. However, such frameworks require significant computational resources and may take a longer time to converge during the training phase, even fail to converge. Some works aim to combine the advantages of strategy, many approaches mainly focus on environmental uncertainty evaluation and optimal target point selection. Makarenko et al. [12] proposed a utility function considering information gain, distance cost, and covariance matrix-based localization effect, and use the function to evaluate the candidate points of the robot. Kaufman et al. [13,14] proposed to compute the expected entropy gain of an occupancy grid map and further applied it to select the target points. Carrillo et al. [15] proposed Shannon entropy for map uncertainty and Rényi entropy for robot pose uncertainty to estimate the cost of candidate points. Bai et al. [16] assessed the candidate points using mutual information obtained by a Bayesian optimization algorithm, which improved computational efficiency and significantly reduced the map entropy. The frontier waypoints, however, are not the only choice. Gonzalez et al. [17] proposed to use the points with the most observation information as the local goal. However, it results in a larger search space, which increases the computational burden. Traditional approaches have achieved positive results in some specific scenarios. However, the common problem of such approaches is hard to balance exploration effectiveness and computational burden as the exploration area increases. Moreover, these methods highly depend on human-designed experts' experience. While the environment is changing, the algorithm needs to be redesigned.
The learning-based approaches employ machine learning to replace the expert algorithm for target selection. Li et al. [9] measured the quality of mapping using Shannon entropy and defined the action space on a grid map. The candidates in the action space are evaluated by DRL using a fully convolutional Q-network. Niroui et al. [10] combined traditional boundary-based exploration methods with DRL to choose the best target points. Some other learning frameworks focus on learning the exploration actions in an end-to-end fashion. For instance, Tai et al. [5] employed CNN-based DRL to learn the exploration strategy, which is obtained end-to-end from RGBD camera data. Furthermore, Bai et al. [6] directly used a neural network to learn the mapping from the maps created during exploration to actions. Tai et al. [7] used the distances in ten directions, the targets, and previous actions as the inputs to obtain the robot's velocities. Usually, these learning-based methods may afford high calculation burdens and cannot be generalized to new environments with different obstacles, scales, etc.
Recently, GNN shows powerful capabilities in solving problems in non-Euclidean spaces [18]. Since the exploration problem is closely related to the spatial structure of the environment, Chen et al. [8] firstly proposed a generalized graph representation for mobile robot exploration and solved the exploration problem by using Deep Q-Learning (DQN) [19] with graph convolutional network (GCN) [20] as the policy and/or value networks. In addition, recurrent neural networks (RNNs) have achieved great success in processing sequential data [21], such as trajectory prediction [22,23] and speaker identification [24]. Inspired by them, based on the exploration graph constructed by a more general environment with obstacles, we propose a novel network to extract spatiotemporal information to generate a more effective exploration strategy.

Our Approach
We propose a novel Graph-STNN incorporated with DRL to address the autonomous exploration of mobile robots. The overall framework of our approach is shown in Figure 1.
In the beginning, we calculate the frontier waypoints and take the optimal one as the candidate. Then, the robot updates the observed environment while moving to the targets by path planning until the whole map is covered.
The problem is first formulated in Section 3.1. Then we model the environment as an exploration graph in Section 3.2. In Section 3.3, we detail how the novel Graph-STNN generates the best target point. Next, Graph-STNN performs as a policy network trained with DRL, as described in Section 3.4. Figure 1. The framework of our approach. The grid environment is modeled as an exploration graph consisting of historical trajectories, frontier waypoints, landmarks, and obstacles shown in red, blue, yellow, and black. The red pentagram represents the currently selected target point, and the exploration graph represents the current state. In the training phase, the robot optimizes the network parameters to maximize the desired reward of the whole exploration progress. Then in the testing phase, our approach evaluates the frontier waypoints and selects the optimal one as the target point.

Problem Formulation
The exploration environments are two-dimensional grid maps, in which each historical trajectory, frontier waypoint, landmark, and obstacle corresponds to a grid. Then, four-type grids serve as the nodes in the exploration graph described in Section 3.2. Inspired by reference [8], to represent the uncertainty and occupancy probability of every grid, we construct the raster virtual map by Expectation-Maximization (EM) exploration algorithm [25], where each grid represents a virtual landmark with an initial identical error covariance matrix. Meanwhile, A-optimality metric [26] is used to measure the uncertainty of each virtual landmark v i . Then, the uncertainty U of the whole map is calculated by: where V represents the set of all virtual landmarks, and tr(Σ v i ) represents the trace of each virtual landmark's error covariance matrix Σ v i . To measure the robot exploration speed, we take the convergence speed. Specifically, we average the exploration rate for 50 tests at each moment and plot the average exploration rate over time.
In addition, to evaluate the exploration algorithm, we define the following exploration efficiency: where ∆U is the uncertainty reduction in time T. Therefore, η is used to measure the reduction of map uncertainty per unit time. Finally, to compare the mapping accuracy of different methods, we define the following trajectory error: e = n a n g max where, n a is the total number of exploration steps, n g is the total number of grids in the map, and V P is the trajectory nodes. Furthermore, similarly, we take the mean value of the trajectory error for 50 identical test scenarios. Our approach adopts the frontier-based exploration scheme with path planning. The exploration problem can be viewed as a sequential decision-making problem, whereby the robot learns to act optimally from changes to the environment that result from our selection of frontier nodes. Thus, our goal is to choose the best target point to maximize η, meanwhile, minimize U and e.

Exploration Graph
The observed information changes incrementally during the exploration progress, leading to the increasing dimension of state and action spaces. Compared with learning from occupancy map or sensor data, the graph offers a smaller state space, a more generalized representation of a SLAM-dependent mobile robot's environment [4]. Moreover, the graph-learning approach is more adaptable to dynamic environments, in which the input is variable-size graphs, and the size of the output data adjusts to the number of nodes of the input graph.
For these reasons, we represent the graph topology of the exploration problem and model the exploration process as a generalized exploration graph, as shown in Figure 2. We denote the exploration graph as G = (V, E ), where V represents the set of nodes and E the set of edges. Extracting the exploration graph. The historical trajectories, frontier waypoints, landmarks, and obstacles are shown in red, blue, yellow, and black, respectively. The current position in trajectories is stressed with the yellow border , and the frontier waypoints are served as the candidate target points.
There are four types of nodes in V: historical trajectories (red), frontier waypoints (blue), landmarks (yellow), and obstacles (black), where the obstacle node is firstly introduced in this paper, and the frontier waypoint nodes are performed as the candidate points. The edges E between the nodes are constructed as follows: (i) adjacent trajectory nodes are connected to each other, (ii) each trajectory is connected to the nearest landmark, (iii) each frontier waypoint is connected to the nearest landmark or the current position of the robot, (iv) the landmarks are fully connected, (v) each obstacle is connected to the nearest trajectory. After establishing the exploration graph, the robot can obtain the local target point to guide the exploration of unknown areas from candidates by network reasoning about the information on the graph.
As a start point of node feature aggregation, we initialize the node feature as: where d is the distance of the node and the current position of robot, ϕ the angle between the heading of robot and direction from the robot to the node, H γ the Shannon entropy, H β the Renyi entropy, φ the flag. The correspondence between flag φ and nodes are shown in Table 1. The Shannon entropy and Renyi entropy [15] are used to measure the grid uncertainty and the pose uncertainty, respectively. The Shannon entropy is defined as: where m ij is the Bernoulli random variable. m ij = 0 indicates that the grid is free, and m ij = 1 means grid occupied. P(m ij = 1) is the probability of grid being occupied, which is abbreviated as P(m ij ). The Renyi entropy is defined as: where β = 1 + 1 tr(Σ g ij ) , in which Σ g ij denotes the error covariance matrix of the grid g ij and P i the probability associated with the random variable m ij .
Then, we use a multi-layer perceptron (MLP) [27,28] to map the initial feature into a high dimensional vector as the input of the following network.
where X = [x 1 , · · · , x N ] T , MLP(·) is a multi-layer perceptron, and N the number of the nodes.

Graph-STNN
In this section, we propose a novel Graph-STNN, which extracts multiple features from the exploration graph to help choose the best frontier waypoints for robots to map the unknown environment effectively. The overview of our approach is shown in Figure 3, in which three different encoders are designed to extract information from the incremental exploration graph. Specifically, we design a spatial encoder to extract spatial features from the neighbor nodes and a temporal encoder to extract the temporal features from the historical robot trajectories. Besides, we design a state encoder to embed the spatiotemporal information and the historical state into the current state feature. The current state is represented as the exploration graph s T . All frontier waypoints can be represented as C 1 ∼C st , in which st is the number of frontier waypoints in the current state. The inputs of the network are an adjacency matrix A and an original node feature matrix X of the exploration graph. First, the original features encoded by a multi-layer perceptron (MLP), the spatial features extracted by the spatial encoder, and the temporal features extracted by the temporal encoder are concatenated together. Then, the concatenated feature is input to the state encoder to obtain the state feature, which is the same for each node. Finally, the state feature combined with the original features and spatial features is input to a liner layer after dropout, and the output is the value of each frontier waypoint. The frontier waypoint with the largest value is the local target point.

Spatial Encoder
The spatial encoder is designed to encode the structural information around the frontier waypoints. As GCN has achieved great success in graph structure data processing, we use GCN to deal with the spatial interaction. Specifically, we use a two-layer GCN to construct the spatial encoder. We denote the l-layer node features as the node aggregation at layer l is then calculated as: whereÃ = A + I, I is the unit matrix, A the adjacency matrix of exploration graph, D the degree matrix, W (l) S the learnable weight matrix of layer l, and σ s (·) the ReLU function. The input of the encoder is F (0) = F = [f 1 , · · · , f N ] T . Then, we use a two-layer GCN. The features of adjacent layers are combined by using the attention mechanism. The output spatial feature of node i can be represented as: where exp(·) is the expedition function, and W 1 and b 1 are the parameters of attention mechanism. The spatial feature matrix is F Spatial = [f 1 Spatial , · · · , f N Spatial ] T , which consists of all nodes' spatial feature.

Temporal Encoder
The temporal encoder is designed to encode the temporal information of the exploration process from the position change of the robot in the map. Thus, we only consider the historical trajectory nodes in incremental graphs.
For the temporal encoder, we use a gated temporal convolutional network (gated TCN) [29]. Different from the original temporal convolutional network (TCN) [30], gated TCN uses a gating mechanism to improve the ability of information extraction. Compared with recurrent neural networks (RNNs) [21], gated TCN deals with parallel input, which reduces memory expense and amounts of parameters. Therefore, our temporal encoder consists of two same TCNs with different gates, i.e., Tanh and Sigmod function. We concatenate the initial features with the spatial features as: The input of our temporal encoder is P, which consists of the trajectory nodes' features in P . We denote the t-th node in P as p t . The update of p (l) t at layer l in TCN is represented as: where K is the one-dimensional convolution kernel size, d l the expansion factor controlling the jump connection of layer l, f 1 ∼ f K the learnable parameters of convolution kernel. Denote the output of these two TCNs as P 1 and P 2 , respectively. The final temporal feature is calculated as: where σ h is Tanh function, σ g Sigmod function, and the element-wise product.

State Encoder
Although the spatiotemporal information can be obtained, we still expect to obtain the global information to effectively accomplish the exploration task. Hence, a state encoder is proposed to further aggregate the global and historical environment information.
A similar idea of separating considerations of state and action is recently proposed by Wang et al. [31]. They use two MLPs at the end of the feature aggregation to extract the state value and the state-action value, respectively. Different from them, we explicitly divide the node feature into a state feature and an action feature. Notice that the state feature is independent of the robot's actions.
We construct the state encoder based on gated recurrent unit (GRU) [32]. We concatenate the initial feature, the spatial feature, and the temporal feature together as the input of the state encoder.
We then use a two-layer GRU to extract the state feature. The structure of GRU is shown in Figure 4, and the convolution process of layer l in the GRU is calculated by: where z (l) T is the update gate defined as: and h T (l) is the current memory information defined as: and r (l) T is the reset gate defined as: where is the element-wise product, h T the hidden vector passing to the next layer and the output, h h the learnable parameters in layer l of GRU. Three components work together and construct the state feature. The update gate passes the information from the previous step as well as the valuable information from the current input to the future. The reset gate determines the part of the previous message to be retained based on the current input and the output of the previous step. The current memory information consists of the current input information and the part of the previous information to be retained as determined by the reset gate. Similar to the spatial encoder, we also use an additional attention layer to further aggregate the two layers' output to the final vector, which can be represented by: where T−1 ] are the input of our state encoder in current time T; h (l) T is the output of the layer l of GRU at T, W 2 and b 2 are the learnable parameters of the attention mechanism.
To reduce the risk of overfitting and improve the exploration ability, we perform dropout to each node feature, concatenated by state feature, spatial features, and temporal feature. The dropout rate decreases gradually with the number of exploration steps. Once the node feature is calculated, the final values of each candidate target point are then determined by an additional MLP. The candidate point with the most significant value is selected as the target point.

Policy Training with DRL
The proposed Graph-STNN is performed as the value network, which helps to estimate optimal action through combining NoisyNet DQN [33]. Moreover, a multi-step strategy is used to estimate the value function by taking longer trajectories that contain more observations of multi steps' rewards, which can be represented as: where Q(s t , a t , θ, ς) represents the value of taking an action a t in the current state s t ; ς represents the Gaussian noise added in the final MLP; γ is the discount factor; m is the step to future observation, which is an adjustable parameter. θ is the learnable parameter, which is soft updated as: where L is the loss function, α the soft update coefficient. The state transition process in DRL is shown in Figure 5, in which the exploration sequence consists of series of exploration graphs. Besides, each exploration graph represents a state, consisting of the adjacency matrix and the feature matrix. In each step, the robot selects the best frontier waypoint as the local target according to the evaluation of Graph-STNN. By using A* algorithm [34], the robot is able to plan a path and navigate to the target. Then, the exploration graph is refreshed as well as the state is updating to s t+1 with the environment reward r t . The robot repeats this process until finishing exploring the whole environment. Eventually, the goal is to maximize the total rewards obtained after the iteration. Extracting the exploration sequence. The exploration sequence consists of a series of exploration graphs, and each exploration graph represents a state. In the current state s t , if the robot takes the action a t to the next state s t+1 , it will obtain the immediate reward r t .
Since state transition in each episode is sequentially related in our approach, the number of steps to future observation is the total step of the whole exploration progress. Thus, we update the network once after each episode, and the policy network is trained with the loss function as follows: where T denotes the number of states for the entire training episode.
The environment reward is used to evaluate the action taken. The larger rewards obtained from the environment, the stronger tendency of the robot to adopt the strategy in the future. Furthermore, the ultimate goal of the robot is to maximize the desired reward of the whole exploration progress. Thus, to guide the selection of target points that can encourage efficient exploration, the reward function is defined as: where R 1 = ∆U is the variation of the whole map's uncertainty calculated by Equation (1) from the current state to the next state; R 2 and R 3 are the robot's angle change and the path length, according to the local path plan result by using A* algorithm; R 4 = e next is the maximum trajectory error calculated by Equation (3) after changing to the next state; R 5 is the total number of exploration steps from the beginning to the next state; R 6 is the environmental exploration rate (the ratio of the known area to the entire map area); ω 1 ∼ ω 6 are constants larger than zero.

Experimental Evaluation
In this section, we use a graph-based SLAM method called GTSAM [35] to build the map of the environment during exploration.
Our experiments are designed to support our claims that our approach (i) chooses more reasonable target points by taking into obstacle information; (ii) can better adapt to the exploration problem using temporal and spatial information and lead to more accurate exploration; (iii) has better robustness and generalizability to new environments against state-of-the-art approaches; (iv) converges faster and obtains more rewards.

Experimental Setup
The exploration environment is a two-dimensional grid map with a random distribution of landmarks (number proportional to the map area) and obstacles (number proportional to the map edge length). It is assumed that the landmarks can be recognized once they enter the detection range of the robot. In addition, the horizontal view of the robot is set to 360 degrees, and the robot measurement range is set to 5 m. Assume the robot is able to rotate from −180 degrees to 180 degrees, and the noise in the simulation is Gaussian noise. The grid size is 2 m × 2 m, and the initial occupancy probability of each grid is 0.5. In this setting, the density of landmarks in the map is 3 landmarks per 100 grids.
Each time a new environment is reset, the position of landmarks and obstacles, and the initial position of the robot in the map are generated randomly. The robot has to avoid obstacles and landmarks while traveling to the target. First, we select the five points closest to the robot and the three points closest to each landmark from frontier waypoints as candidates. Then, the robot evaluates the candidates by our algorithm and selects the best one among them. Then the robot uses the A* algorithm to plan a collision-free path to achieve the goal. The exploration is finished while 90% of the area is checked by the robot. It is important to note that the models with obstacle avoidance and path planning module are closer to the actual situation.
An example exploration environment is shown in Figure 6, in which the frontier waypoints serve as candidate target points. The robot continually changes the local target and navigates to the local destination while updating the observed environment information until covering the whole map.

Results and Analysis
In Graph-STNN, the first MLP only maps the initial feature into high-dimensional space, so we use one fully connected layer for reducing computational complexity. Similarly, the output MLP maps the hybrid features into the evaluation of each frontier. It is constructed from one linear layer with Gaussian noise (adding uncertainty to training and improving training efficiency) and one fully connected layer. In recurrent neural networks, the addition layer increases the level of abstraction of input observations over time [36]. However, the increase in layers will lead to an exponential growth in time and memory overhead and a sharp decline in convergence effect and efficiency. Therefore, combined with our actual computing capacity, we design the state encoder as a two-layer GRU. Finally, to determine the structure of the spatial encoder, we test one to three layers of GCN, and the training performance is shown in Figure 7. The test results are shown in Figure 8. As we see, Graph-STNN with a two-layer GCN has the best performance with the fast convergence, the minimum trajectory error, and the highest efficiency. The entropy-based method [15] and the uncertainty-based methods [25] are the most popular traditional methods in this field, which perform well in autonomous exploration. Very recently, the learning-based method proposed in reference [8] achieved great progress on exploration performance, which is also inspired us. Therefore, to verify our approach, we compare our methods with them. Specifically, we train the GCN network proposed in reference [8] in the same experimental setting (with obstacles) as a comparison, which is denoted as GCN-obs for shorting. In addition, we also test the original GCN network of [8] in our experimental environment, which is trained in the environment without obstacles and denoted as GCN-no_obs. The following figure shows the convergence of rewards between Graph-STNN and the GCN-obs under the same reward function settings. The rewards are averaged every ten thousand steps. As shown in Figure 9, the proposed Graph-STNN is able to converge more quickly. We test the trained strategies in randomly generated maps with sizes 20 m × 20 m and 40 m × 40 m. We test 50 episodes, and the results of the comparison are shown in Figures 10 and 11. To evaluate the performance, we use three metrics: the average exploration speed, e ( the average trajectory error, which measures the mapping accuracy), and η (the average exploration efficiency, which measures the reduction of map uncertainty per unit time). The experimental value of e and η of different methods in 20 m × 20 m and 40 m × 40 m environments are shown in Table 2.  The result shows that Graph-STNN gets the shortest path, the smallest trajectory error, and the largest exploration efficiency. Besides, the Random method has the worst performance in all three metrics. In addition, by introducing Uncertainty in reference [25] and Entropy in reference [15], we can see that the performance is better than the Random method. However, both of them are still not good enough.
Intuitively, it is evident in Figure 11 that Graph-STNN with obstacle information consideration has a better performance compared with the model without that (GCN-no_obs), which is closer to the actual situation. Because we take the obstacles as nodes in the graph, which enriches graphical representation. Meanwhile, by adding path planning, the approach can learn the path changes due to obstacle avoidance, which supports our claim (i): our approach chooses more reasonable target points by taking into obstacle information.
Furthermore, it can be seen in Figure 11a, in the test of 50 episodes, Graph-STNN finish exploring the environment fastest, with the highest accuracy and efficiency of mapping compared with GCN-based and information entropy-based methods. Meanwhile, the advantage is more significant than in the 20 m × 20 m environment, showing higher robustness and generalization, as our claim (iii). Eventually, the results verify that the global spatiotemporal information plays a more critical role in the exploration problem than single structural information, as our claim (ii).
To verify the role of the spatial encoder, temporal encoder, state encoder, and their combination in Graph-STNN, we train the networks with different components separately. The reward curves change during the training process are as shown in Figure 12, in which the rewards are averaged every ten thousand steps. Furthermore, as shown in the picture, the approaches considering separately state features and action features converge faster and obtain more rewards, as our claim (iv).  Table 3. The results show that the convergence speed of the models without spatial or temporal information (no temporal encoder, no spatial encoder, no spatial and temporal encoder) is close to Graph-STNN. However, they obtain fewer rewards and have worse efficiency. Besides, we can see that without state features (no state encoder, no spatial and state encoder, no temporal and state encoder) the performance of convergence speed, mapping accuracy, and exploration efficiency are all not very good, and they are difficult to converge. The reason is that spatial and temporal information tend to guide robots to reduce the environment's uncertainty, which leads to fast exploration. In the meanwhile, the state information tends to balance the exploration speed and the map accuracy, since it always contains the global information during the exploration progress.
To further check the model's robustness and generalization, we directly extend the trained model to 30 m × 30 m and 40 m × 40 m environment.    As shown in Figures 14 and 15, the complete Graph-STNN achieves the best results compared to the models using temporal, spatial, and state features alone or in part. These results once again confirm our claim (ii): Graph-STNN can better adapt to the exploration problem using temporal and spatial information and lead to more accurate exploration. As we can see, the state encoder can promote exploration speed and efficiency, which plays a major role in improving the model's generalization. Meanwhile, the temporal and spatial encoders mainly play a role in improving the mapping accuracy.
Finally, we use a physical robot model and a lifelike environment to further validate our approach on Stage. Stage is a simulation tool for researching mobile robots and intelligent sensor systems, which can simulate objects such as mobile robots, sensors, and obstacles. As shown in Figure 16a, the size of our scene is 20 m × 20 m, with dense obstacles. The robot achieves omni-directional motion by differential rotation of two wheels, and it carries a 2D lidar with the field of view (FOV) of 360 degrees and the range of 10 m. Then, we test our autonomous exploration in this platform. First, the robot builds the map through the Simultaneous Localization and Mapping (SLAM) algorithm Karto [37], and the frontier waypoints are obtained from the map. The test result is shown in Figure 16b, in which the established map, the robot's historical trajectories, and the robot's current path obtained by Mobile Robot Local Planning Benchmark (MRPB) [38] are visualized by Rviz. As we can see, our approach can guide the robot equipped with 2D lidar to effectively explore the unknown environment, which means the proposed method has great potential in actual application.
(a) (b) Figure 16. The test result of our autonomous exploration algorithm in Stage. (a) The Stage scene. In this scene, the black squares are obstacles, the green square is the robot's position, the blue squares are the robot's historical trajectories, and the light blue area is the radar coverage area. (b) The explored map. In this map, the black grids are obstacles, the white grids are the free areas, and the dark green grids are unknown areas. The blue curve is the robot's historical trajectories and the blue line indicates that a closed loop has been detected. Red and green curves are the robot's current path.

Conclusions
This research develops a novel DRL-based approach called Graph-STNN for mobile robot autonomous exploration in unknown environments. We modeled the exploration problem as an exploration graph that considers the historical trajectories, frontier waypoints, landmarks, and obstacles. We employ GCN as a spatial encoder to extract topological structure information, TCN as a temporal encoder to extract sequence information of trajectories, GRU to capture dynamic state change of spatiotemporal information to obtain the current state feature. The simulation experiment shows that our approach could choose a better target point by considering obstacle avoidance. Besides, the results also show that the method can promote the exploration efficiency by both using temporal and spatial information. In addition, it performs better robustness and generalization by using hybrid global information. Next, we are going to validate the network on physical robots, and DRL-based path planning and heterogeneous graphs will be considered.