Next Article in Journal
Regional Thermal Radiation Characteristics of FY Satellite Remote Sensing Based on Big Data Analysis
Previous Article in Journal
Fast Recognition and Counting Method of Dragon Fruit Flowers and Fruits Based on Video Stream
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Semantic Topology Graph to Detect Re-Localization and Loop Closure of the Visual Simultaneous Localization and Mapping System in a Dynamic Environment

1
School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
2
Advanced Manufacturing and Automatization Engineering Laboratory, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(20), 8445; https://doi.org/10.3390/s23208445
Submission received: 5 September 2023 / Revised: 2 October 2023 / Accepted: 10 October 2023 / Published: 13 October 2023
(This article belongs to the Section Sensors and Robotics)

Abstract

:
Simultaneous localization and mapping (SLAM) plays a crucial role in the field of intelligent mobile robots. However, the traditional Visual SLAM (VSLAM) framework is based on strong assumptions about static environments, which are not applicable to dynamic real-world environments. The correctness of re-localization and recall of loop closure detection are both lower when the mobile robot loses frames in a dynamic environment. Thus, in this paper, the re-localization and loop closure detection method with a semantic topology graph based on ORB-SLAM2 is proposed. First, we use YOLOv5 for object detection and label the recognized dynamic and static objects. Secondly, the topology graph is constructed using the position information of static objects in space. Then, we propose a weight expression for the topology graph to calculate the similarity of topology in different keyframes. Finally, the re-localization and loop closure detection are determined based on the value of topology similarity. Experiments on public datasets show that the semantic topology graph is effective in improving the correct rate of re-localization and the accuracy of loop closure detection in a dynamic environment.

1. Introduction

Simultaneous localization and mapping (SLAM) is one of the core problems in mobile robotics research [1,2]. Compared to laser sensors, vision sensors have the advantages of fine perception, low price, smaller size, and lighter weight. Thus, Visual SLAM (VSLAM) has made great progress in the last few decades. Among various VSLAM algorithms, feature-based algorithms are widely used in long-term robot deployment because of their high efficiency and scalability. However, most existing SLAM systems rely on hand-crafted visual features, such as SIFT [3], the Shi–Tomasi method [4], and ORB [5], which may not provide consistent feature detection and association results in dynamic environments. For example, when either the scene or the viewpoint has been changed, ORB-SLAM2 frequently fails to recognize previously visited scenes because of less visual feature information [6]. Thus, the mobile robot needs to use re-localization and loop closure detection to identify whether this scene has been visited before. Generally, the re-localization is accomplished using loop closure detection to correlate information. Firstly, the bag-of-words model is used to extract candidate frames for relocation with high similarity to the current frame. Then, the frame-to-frame matching method is used to match the local features of the current frame and re-localize candidate frames until all the candidate frames are traversed or the information is associated. However, in dynamic environments, Visual SLAM often fails in re-localization and loop closure detection due to the insufficient numbers of matched local feature point pairs. However, the numbers of matched local feature point pairs are insufficient because of dynamic objects, which can vastly impair the performance of the Visual SLAM system [7]. Therefore, Visual SLAM often fails in re-localization and loop closure detection. In order to address this challenging topic, some researchers have performed some work on feature removal [8,9]. The stability of the system’s visual odometer is enhanced through removing feature points from dynamic objects.
In the real dynamic scene, re-localization and loop closure detection will fail with limited feature information [10]. Therefore, the mobile robot often needs to go back to the previous place to extract more feature information, so as to complete the feature information matching to realize re-localization and loop closure detection. The combination of Visual SLAM and deep learning can solve this problem better than traditional methods [11]. The main idea of deep learning is to extract image features using a network trained in advance [12]. However, the disadvantage of deep learning is that it is computationally intensive and requires high-performance equipment. Furthermore, it is difficult to construct suitable models to store high-dimensional feature vectors. Thus, in recent years, researchers have attempted to introduce semantic information into Visual SLAM [13,14]. Using semantic information to describe the environment can effectively simplify the process of saving and comparing environmental information [15,16].
ORB-SLAM2 is the most widely used of the Visual SLAM frameworks [6]. The re-localization and loop closure detection of the ORB-SLAM2 framework are mainly accomplished by using the current frame to match feature points with the candidate keyframes. Therefore, the numbers of extracted feature points determine the accuracy and matching speed of re-localization and loop closure detection. The more feature points are extracted, the more accurate the localization and the faster the corresponding speed. However, it will affect the feature point extraction if the dynamic object moves fast in a dynamic environment, which will affect the accuracy of re-localization and loop closure detection. Despite this, the relative positions and distances of static objects do not change, irrespective of the motion of dynamic objects in the dynamic environment. Thus, we can build a semantic topology graph using the relative positions of static objects to assist in re-localization and loop closure detection, instead of relying only on static feature points.
Therefore, based on the above problems, this paper proposes a Visual SLAM method with semantic topology based on ORB-SLAM2. The proposed method improves the failure of re-localization and loop closure detection in dynamic environments due to limited feature information. The specific improvements are as follows:
(1)
The method of object detection is used to obtain information about static objects in the dynamic environment.
(2)
Low feature information and frame loss may occur due to the occlusion of dynamic objects. Thus, this paper proposes to construct a topological graph by utilizing the property of the invariant spatial position of static objects by judging the similarity of the semantic topological graph to find similar keyframes and then quickly determining its own pose information.
(a)
Semantic nodes are obtained from the central points of static objects.
(b)
The edges between nodes are obtained using the Delaunay triangulation method.
(c)
An innovative topological graph similarity comparison algorithm is achieved.
This paper is organized as follows: Section 2 presents the related work, and Section 3 is a description of the method, which first summarizes the system structure of this paper and then describes the algorithm of this paper in detail. Section 4 is the experimental section, which first verifies the feasibility of the method by using different data and then tests and evaluates the re-localization and loop closure detection. Section 5 is the conclusion.

2. Related Work

2.1. Re-Localization

ORB-SLAM2, as a sparse feature-based VSLAM method, is prone to tracking loss in position estimation. The re-localization function of ORB-SLAM2 is activated when frames are lost. The re-localization is realized by discriminant coordinate regression. This method uses the PnP algorithm to obtain the essential matrix after using RANSAC to obtain the fundamental matrix describing the relationship between two image positions, and then uses the PnP algorithm to obtain the essential matrix jointly with the camera internal parameters to solve the camera poses [17]. The purpose is to obtain a sufficient number of matched feature points from the before and after sequence frame images for rotation and translation to solve the camera poses so that the lost camera poses can be estimated and the tracking process can be resumed [18]. The core idea of re-localization is to find the keyframe that is closest to the current frame among the previous keyframes. Firstly, the BoW of the current frame must be calculated. Then, the frames with high similarity and a higher number of matching feature points than 15 in the BoW model are determined as the sequence of candidate keyframes. Finally, the success of re-localization is determined by whether the number of interior points matching the current frame with the candidate frames is greater than 50. The specific process is shown in Figure 1. Due to the presence of dynamic objects in the environment, incorrect matching information often occurs during feature matching.

2.2. Loop Closure Detection (LCD)

Loop closure detection (LCD) is the ability of a mobile robot to recognize a scene that has been reached and to realize loop closure. The basic process is to calculate the similarity by comparing between keyframes and determine whether they pass through the same place or “return to the origin”. The LCD problem is the process of determining the correlation between current and historical data to identify whether a location has been reached before. The essence of the LCD problem is to reduce the cumulative error in map construction [19]. With the development of computer vision, the LCD algorithm based on appearance information has become the mainstream algorithm in the early stage, among which BoVW is the most common algorithm [20]. BoVW has been widely used in the LCD of VSLAM systems because of its high detection efficiency and retrieval accuracy. However, the presence of dynamic objects will interfere with the judgment of LCD [21]. In recent years, the continuous development of deep learning technology in the fields of image recognition, computer vision, and mobile robotics has provided new solution ideas for the LCD module in SLAM systems. DS-SLAM [22] and SegNet [23] were employed to segment dynamic objects. DS-SLAM removed the feature points located in the area of the dynamic scene to alleviate dynamic interference in loop detection. The authors of reference [24] proposed to realize LCD by integrating visual–spatial–semantic information with features of a topological graph and a convolutional neural network. The authors first built semantic topological graphs based on semantic information and then used random walk descriptors to characterize the topological graphs for graph matching. Finally, the authors calculated the geometric similarity and the appearance similarity to determine the loop closure detection. Reference [25] presents a strategy that models the visual scene as a semantic sub-graph by only preserving the semantic and geometric information from object detection. The authors used a sparse Kuhn–Munkres algorithm to speed up the search for correspondence among nodes. The shape similarity and the Euclidean distance between objects in the 3D space were leveraged and united to measure the image similarity through graph matching. However, topology building is more complex, and these studies have only involved LCD without considering the re-localization of a mobile robot. Inspired by the above modules, the re-localization and loop closure detection methods based on a semantic topology graph are proposed in this paper.

3. Methodology

3.1. Method Overview

To solve the problem of re-localization and loop closure detection failing due to the interference of dynamic objects, we constructed a topology graph for assisted localization based on ORB-SLAM2 using the static regions obtained from object detection. Five threads run in parallel in the proposed system: tracking, semantic topology graph, local map, loop closing, and full BA. The framework of the system is shown in Figure 2. The raw RGB images are processed in the tracking thread and semantic topology graph thread simultaneously. The tracking thread first extracts ORB feature points and waits for the image that has a classification of the dynamic properties via object detection. Then, ORB feature point outliers are judged and removed based on the classification of the dynamic properties. The ORB feature points in the highly dynamic region and low dynamic region are considered outliers. The semantic topology graph thread is mainly designed to build a semantic topology graph using the results of object detection. The detailed process of the semantic topology graph is shown in Figure 3. The RGB images first are classified for dynamic properties by object detection. This process is described in Section 3.2. Then, we establish the semantic topology graph with a low dynamic static region. This process is described in Section 3.3. In order to reduce the computation load, we created a semantic topology graph only for the keyframe. Furthermore, the semantic topology graph was saved as a bag of topology graphs, which is convenient for loop-closing threads and re-localization threads.
This paper proposes to use the nature of the invariant spatial position of static objects to construct the semantic topography graph. By judging the similarity of semantic topology, we can find similar key frames and then quickly determine their own pose information. The process of the proposed solution is presented in Figure 3. Firstly, we extracted semantic information using the YOLOv5 network. The low dynamic regions based on the classification of the dynamic properties are saved as static regions. Secondly, the topology graph was obtained using Delaunay triangulation. Furthermore, we acquired the information of semantic nodes and edges in the topology separately. Finally, we obtained the weight of topologies with the information of semantic nodes and edges, which is used to compute similarity.

3.2. Object Detection

The robustness of SLAM is improved so that dynamic objects are identified and eliminated by the semantic information. For instance, among the same type of SLAM systems, DynaSLAM [26], Detect-SLAM [27], and DS-SLAM [22] introduce instance segmentation, object detection, and semantic segmentation, respectively. Each of these methods has its own advantages. They are capable of greatly improving the performance of SLAM by detecting and removing dynamic objects. Each of these methods has its own advantages. Table 1 shows the brief relationship between segmentation accuracy and efficiency of different methods. As part of the vSLAM optimization process, dynamic objects are usually detected and then treated as outliers. However, semantic segmentation networks are computationally expensive, making them impractical for real-time or robotic applications. Thus, object detection methods are widely used in the preprocessing stage. In reference [8], for instance, YOLOv4 is adopted to predict classes and bounding boxes of objects in real time. Then, the dynamic object probability model is added to enhance the real-time performance of the ORB-SLAM2 system. Similarly, Theodorou, C. et al. [9] proposed a VSLAM system based on ORB-SLAM3 and on YOLOR. This system uses the object detection models YOLOX and YOLOR to detect moving objects and extract feature points. On the basis of the results and semantic data in the image, a module is introduced that can remove dynamic objects. Lastly, the results show that the accuracy of this system is significantly improved in dynamic indoor environments.
In this paper, we are more concerned with the segmentation efficiency of neural networks, so it was decided to introduce the YOLOv5 network. YOLOv5 is a robust and efficient model that can detect target objects well in images [28]. Compared with YOLOv3 and YOLOv4, this model has higher accuracy and better real-time performance. The main structure of the YOLOv5 network consists of a feature extractor, a multi-scale feature fusion module, a prediction head, and a post-processing module. The feature extractor uses the CSPNet structure, which can quickly extract image features [29]. The module of multi-scale feature fusion can improve the detection accuracy through the fusion of feature maps with different resolutions. The prediction head consists of a classifier and a repressor, which are used to quickly predict the location and category of the detection box. The post-processing module can remove redundant detection results by performing non-extreme value suppression operations on the data.
We obtained the training model trained on YOLOv5 through the MS COCO datasets [30]. The training data of the MS COCO datasets include more than 80 different classes of objects, which is basically sufficient for the use of VSLAM in dynamic environments. The result of the object detection using YOLOv5 is shown in Figure 4. As can be seen from the object detection results, the various categories of objects, such as people, chairs, monitors, keyboards, mice, and cups, were successfully detected using YOLOv5. Based on life experience, we carried out a simple classification of motion for various different categories of objects. It is shown in Table 2. Furthermore, there is an overlapping part between the person and the chair in the picture, and the YOLOv5 algorithm can successfully detect it. At the same time, some relatively small objects, such as keyboards, mice, and books, also were successfully detected. Therefore, the overall detection results are in line with the application requirements of VSLAM.

3.3. Establishment of Topology Graph

Firstly, we obtained the object detection frame by detecting the original frame using YOLOv5 and retaining the static objects. Secondly, we computed the central points of static objects boxes, which are used as the vertices of Delaunay triangulation. Finally, the topology graph was obtained according to Delaunay triangulation. The process is shown in Figure 5.

3.3.1. Establishment of Semantic Nodes

The result of the image object detection contains the object class, information about the detection frame coordinates, and confidence values. The center point of the object detection box is used as the vertex of the topology grape. Thus, for each detected object in an image, we represent it as a triple:
o i = ( m i , z i , c i ) ,
where  m i = ( u i , v i ) u i v i  represents the horizontal and vertical coordinates of the center point of the enclosing object  o i  box.  z i  is the depth information at the  ( u i , v i )  position in the depth image that corresponds to the current image.  c i  is the class label of the object. In accordance with the bounding box, the geometric center  r i = ( u i , v i )  of the detected object is defined:
u i = x i 1 x i 2 2 + x i 1 v i = y i 1 y i 2 2 + y i 1 ,
where  ( x i 1 , y i 1 )  and  ( x i 2 , y i 2 )  are the coordinates of the upper left and lower right corner of the bounding box, which are relative to the pixel coordinate system. In order to facilitate matrix similarity calculation, numerical numbers are used to represent the categories of objects identified after object detection, and their corresponding relationships are shown in the following Table 3. Therefore,  c i  takes the value of the class label number corresponding to the different classes in Table 3.
The set of vertex information in each image is represented as:
O = o i i = 1 , 2 , n ,
where n represents the numbers of categories in the image.

3.3.2. Establishment of Edges

We obtained the topology graph using Delaunay triangulation from all nodes in the image [31]. The generated graph structure is similar to a sparse mesh, as shown in Figure 6a. In Delaunay triangulation, for a given set of discrete points, triangulation is carried out such that no point is inside the circumcircle of any triangle. It maximizes the minimum angle of all of the angles of the triangles. Therefore, only adjacent feature points are connected in the graph. Furthermore, Delaunay triangulation is applied to reduce the complexity of the construction of the topology graph [32]. On the other hand, the topology graph using Delaunay triangulation does not change if the relative positions of the vertices do not change. This is one of the reasons why the Delaunay triangulation method is used to construct the topology in the method of this paper: the positions of static objects do not change in the dynamic environment, so the relative positions of the static objects detected by object detection do not change either. In other words, the static topology obtained by Delaunay triangulation is unique if the position of the static object remains unchanged in a image. At the same time, Delaunay triangulation is regional; that is, adding, deleting, or moving a vertex only affects the neighboring triangles. As shown in Figure 6, Figure 6a has one more target point P than Figure 6b. Furthermore, the line of other points does not change except for the line of point P and its neighboring points. For instance, Figure 6b shows that the yellow line is newly increased and the red line remains unchanged. This means that the topology graph between the other points is not changed. Therefore, the similarity of the frames can be judged by verifying the similarity of the topology graph in the next step, on account of the unique and regional topology graph.
Because the relative distance between static objects in the scene does not change with the positions of the bounding boxes, in order to better calculate the similarity of topology structure, we calculated the distance value of the line between vertices. The relative distance  d i j  between  o i  and  o j  in three-dimensional space was calculated using Euclidean distance according to the following formula:
d i j = o i o j 2 , o i is connected to o j 0 , o i is not connected to o j ,
where  o i o j 2 = ( X i X j ) 2 + ( Y i Y j ) 2 + ( Z i Z j ) 2 ( X i , Y i , Z i )  is the camera coordinate of point  o i .
The camera coordinate of point  o i  is as follows:
Z i = z i F X i = ( u i c x ) Z i f x Y i = ( v i c y ) Z i f y ,
where value  z i  is the depth value at position of point  o i ( u i , v i )  in the aligned depth image.  F  is the scaling factor for the depth image. Furthermore,  f x , f y , c x , c y  are the intrinsic parameters of the camera.
Thus, in the current graph, the correlation matrix D is used to describe the relationships of edges in the topology.
D = d i i = 1 , 2 , , n , d i = ( d i 1 , d i 2 , , d i n ) ,
where n indicates the numbers of categories in the image.

3.3.3. The Topology Graph Representation

The topology graph  H = ( O , D )  is established for each frame, where O and D represent the set of vertices’ information and the correlation matrix of the edges in the topology graph, respectively. In particular, the vertices of our topology graph are built on the basis of the results of object detection. Every vertex has its own attribute, as described in Equation (3), and the numbers of vertices are equal to the numbers of the bounding boxes. All vertices are connected to each other using undirected edges. The edges of a topological graph represent the distance between two objects.
Thus, based on Equations (3) and (6), we can obtain  H p = ( O p , D p )  for frame  I p , where  O p  is an  n × 4  matrix, and n represents the numbers of categories obtained after object detection in frame  I p O p  is defined as:
O p = o 1 o 2 o n = u 1 v 1 z 1 c 1 u 2 v 2 z 2 c 2 u n v n z n c n
D p  is an  n × n  correlation matrix with weights, and n represents the numbers of categories obtained after object detection in frame  I p D p  is defined as based on Equation (6):
D p = 1 d 12 d 1 n d 21 1 d 2 n d n 1 d n 2 1

3.4. Similarity Comparison Algorithm

In order to calculate the similarity of topologies in the two images more efficiently, we define  W p  to denote the weights of topologies in frame  I p W p  is defined as:
W p = D p O p = 1 d 12 d 1 n d 21 1 d 2 n d n 1 d n 2 1 · u 1 v 1 z 1 c 1 u 2 v 2 z 2 c 2 u n v n z n c n = ω 11 ω 12 ω 13 ω 14 ω 21 ω 22 ω 23 ω 24 ω n 1 ω n 2 ω n 3 ω n 4 = ω i i = 1 , 2 , , n ,
where  ω i = ( ω i 1 , ω i 2 , ω i 3 , ω i 4 ) , and n is the number of nodes in frame  I P . Therefore, the  W P  is an  n × 4  matrix. Similarly, the weight  W Q  of the topology graph in frame  I Q  is:
W Q = D Q O Q = ω i i = 1 , 2 , , k ,
where k denotes the number of nodes in frame  I Q .
This paper uses the cosine similarity to express the relative differences between the topology of the two frames  I P  and  I Q . Since the number of nodes in each frame is uncertain, the numbers of nodes in frames  I P  and  I Q  are not necessarily equal. Thus, before calculating the cosine similarity, we need to ensure that  W p  and  W Q  have the same dimension. The dimension is labeled N, which is defined as:
N = n , n k k , n < k
We first need to calculate the cosine angle of  W p  and  W Q  with the same dimension. Then, we take the weighted average sum of the cosine angle matrix. We consider the value of weighted average  S ( I P , I Q )  as the similarity of matrix  W p  and  W Q S ( I P , I Q )  is defined as:
S ( I P , I Q ) = 1 N W p · W Q | | W p | | × | | W Q | |
In summary, a semantic topology graph of static objects is established using the above method. Ultimately, the similarity of the two images is judged by calculating the value of  S ( I P , I Q ) .

4. Experiment

4.1. Datasets

In this study, the effect of the proposed algorithm was evaluated using the Technical University of Munich (TUM) RGB-D datasets and OpenLORIS-Scene datasets, which are shown in Table 4. The TUM datasets are the most used SLAM data in the literature. There are both high- and low-dynamic office sequences in the TUM RGB-D datasets [33,34] recorded with a Microsoft Kinect sensor at full frame rate (30 Hz). RGB (640 × 480), depth images, and ground-truth trajectory were recorded with a high-accuracy motion capture system.
The OpenLORIS-Scene datasets by Xuesong Shi et al. are designed to be tested for the real-world practicality of lifelong SLAM algorithms for service robots [35]. OpenLORIS-Scene is a recently published dataset. It provides real-world robotic data with more challenging factors and significant environmental changes such as blurred, featureless images and dim lighting. Environmental changes are likely to be the main challenge to re-localization. We mainly used the Office series of OpenLORIS-Scene to evaluate the algorithm, specifically seven sequences including dynamic objects (persons) in a university office with benches and cubicles. The images of the TUM RGBD datasets and OpenLORIS-Scene datasets are shown in Figure 7.

4.2. Calculated Similarity

We selected four keyframes from “freiburg3_w_xyz” to verify the feasibility of the proposed topology in this paper, as shown in Figure 8. Figure 8a is the first keyframe, and the topology graph is constructed based on the object detection results. Similarly, Figure 8b–d are the topology graphs of the 22nd, 44th, and 197th keyframes, respectively.
In order to verify the correctness of the proposed method, we used the 22nd, 44th, and 197th keyframes to calculate the similarity with the first keyframe. At the same time, we also carried out the similarity calculation with the first keyframe and itself. The obtained results are shown in Table 5.

4.3. Re-Localization Evaluation

Re-localization is often formed as a pipeline of image retrieval followed by relative pose estimation, similar to LCD but often with a much larger database of candidate images and with more emphasis on high recall as opposed to high precision of LCD. In this subsection, we refer to the experiments of Reference [35].
  • Correctness Score of Re-Localization (CS-R)
In order to better evaluate the performance of re-localization, the authors proposed a score to evaluate the correctness of re-localization [35]. The score is called the correctness score of re-localization (CS-R), which is defined as follows:
C ε , ϕ S τ R = e t 0 t m i n τ · C ε , ϕ ( P 0 ) ,
where  τ  is a scaling factor, and Reference [35] suggests to set  τ = 60  s. The absolute trajectory error (ATE) threshold  ε  and absolute orientation error (AOE) threshold  ϕ  should be set according to the area of the scene and the expected drift of the SLAM algorithm. Furthermore,  ( t 0 t m i n )  is the algorithm initialization.
C ε , ϕ ( P 0 )  is the correctness of the position at time  t 0 . For each pose,  p k  is estimated at time  t k , given the ground-truth pose at that time, and we assess the correctness of the estimate according to its ATE and AOE.
C ε , ϕ ( P κ ) = 1 , i f A T E ( P κ ) ε a n d A O E ( P κ ) ϕ 0 , o t h e r w i s e
Reference [35] applied Equation (13) to verify the correctness of the re-localization on datasets office2 and office7 of OpenLORIS-Scene. The authors of Reference [35] suggest setting  ε = 0.3  and  ϕ = . Thus, we performed a similar comparison test based on Reference [35]. The results were compared with the data provided in the reference, as follows in Table 6.
From the results of the correct rate, our method is better than the traditional method. However, the proposed method in this paper is dependent on the results of object detection. Our experimental results are slightly lower than those of DXSLAM [36], a global feature-based image retrieval and group matching method.
  • Re-localization test
To test whether algorithms could continuously achieve re-localization in changed scenes, we entered the seven data of the office in turn. We first counted whether the different methods could accomplish re-localization on seven sets of data. The results are shown in Table 7. “•”and “∘” indicate successful and unsuccessful re-localization, respectively. Experimental results show that most algorithms fail to re-localize on the fifth sequence. Our analyses consider that the light is too dark, which affects the feature points and object extraction. Thus, re-localization fails eventually.
Then, we documented whether algorithms could consistently perform the correct pose estimation. The results are shown in Figure 9. The office sequence has seven sets of data, and each black dot on the top line represents the start of one data sequence in Figure 9. For the four different algorithms in Figure 9, the blue dots indicate successful re-localization, while red crosses are unsuccessful re-localization. The blue line and red line indicate correct and incorrect pose estimation, respectively. The experimental results show that re-localization fails on the second, fifth, and seventh sequences for most algorithms. Similarly, it was because of the light problem that affected the results of the experiment.
Finally, we evaluate the average accuracy on the seven sets of data of the office sequence. The results are shown in Table 8. A larger average accuracy means a more robust approach. The statistical results show that the average accuracy of the method in this paper is 60.8%, which is slightly higher than the other methods, so that indicates the good robustness of our proposed method.

4.4. Loop Closure Detection Evaluation

In this study, in order to verify the effectiveness of the proposed method, the VSLAM system was improved based on the ORB-SLAM2 algorithm and utilized in the mobile robot platform of the built VSLAM system. As shown in Figure 10, the software control of the mobile robot is realized by the ROS kinetic software system (ubuntu18.04). The vision sensor equipped in this mobile robot is an Intel RealSense D435I depth camera. The camera is capable of acquiring RGB and depth information with high resolution and low latency. The camera employs the latest depth perception technology, which not only enables high-precision depth measurement in both indoor and outdoor environments but also supports application scenarios such as dynamic object tracking and gesture recognition. Its pixel resolution and depth perception distance are 1280 × 720 and 0.105–10 m, respectively.
We conducted experiments with DS-SLAM, DynaSLAM, ORB-SLAM2, and our methods and plotted the precision–recall curve. The results are shown in Figure 11. It can be seen from the figure that our experimental results are slightly better than those of the other methods. High accuracy is also guaranteed in the case of high recall. In order to evaluate the accuracy of the algorithms on loop closure detection, we calculated the accuracy of loop closure detection about the different methods on the mobile robot experimental platform. We judge the number of correct loops in the detection results by manual labeling. We then express the accuracy by dividing the number of correct loopbacks by the total number of identifications. The results are shown in Table 9. From the statistical results, it can be seen that the traditional methods BOW and ORB-SLAM2 are less low due to the interference of dynamic objects, while our method is slightly better than the DS-SLAM and DynaSLAM algorithms for dynamic scenes. It can be seen that the algorithm in this paper improves the accuracy of the loop closure detection.
At the same time, we created the trajectory plots of the mobile robot platform. Figure 12a shows the keyframe trajectory without the closed loop. Figure 12b shows a loop closure detection result obtained by the proposed method, which is shown by the red line. It can be seen that the proposed method could still detect a large number of loop closures under the influence of dynamic scenes. Finally, we obtained the keyframe trajectory after applying the proposed method, which is shown in Figure 12c. In order to evaluate the accuracy of the algorithms on loop closure detection, we calculated the accuracy of loop closure detection about the different methods on the mobile robot experimental platform. We judge the number of correct loops in the detection results from manual labeling. We then express the accuracy by dividing the number of correct loopbacks by the total number of identifications. The results are shown in Table 9. From the statistical results, it can be seen that the traditional methods BOW and ORB-SLAM2 are less low due to the interference of dynamic objects, while our method is slightly better than DS-SLAM and DynaSLAM algorithms for dynamic scenes. It can be seen that the algorithm in this paper improves the accuracy of the loop closure detection.

5. Conclusions

Re-localization and loop closure detection in dynamic environments, due to limited feature points information, often fail, and the correct re-localization and recall of loop closure detection are also low when the mobile robot loses frames in a dynamic environment. So, this paper proposes the re-localization and loop closure detection method with a semantic topology graph based on ORB-SLAM2. We conducted experiments on public datasets such as TUM, OpenLORIS-Scene, and a self-made platform. The results clarify that our method can improve the feasibility, accuracy, and stability of the VSLAM system in dynamic scenes. However, the proposed method relies heavily on the results of object detection. Whether the detected objects are sufficiently reliable greatly impacts our experimental performance. At the same time, the detector model may hardly predict correct results when there are significant differences between training scenes and actual scenes. In future work, we can employ self-supervised or unsupervised deep learning approaches in order to overcome this issue. On the other hand, the “bag of topology graphs” will take up storage space, so we will also further improve the real-time performance of the system.

Author Contributions

Conceptualization, Y.W. and Y.Z.; methodology, Y.W.; validation, L.H., W.W. and G.G.; formal analysis, Y.W.; investigation, S.T.; resources, S.T.; data curation, L.H.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W.; visualization, Y.W.; supervision, Y.Z.; project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Doctoral Talent Train Project of Chongqing University of Posts and Telecommunications (grant number BYJS202115).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository that does not issue DOIs Publicly available datasets were analyzed in this study. These data can be found here: https://cvg.cit.tum.de/data/datasets/rgbd-dataset/download (accessed on 1 October 2023); https://shimo.im/docs/HhJj6XHYhdRQ6jjk/read (accessed on 1 October 2023).

Acknowledgments

The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part i. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
  2. Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
  3. Low, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  4. Tomasi, J.S.C. Good features to track. In Proceedings of the 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; Volume 84, pp. 593–600. [Google Scholar]
  5. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G.R. Orb: An efficient alternative to sift or surf. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
  6. Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  7. Tourani, A.; Bavle, H.; Sanchez-Lopez, J.L.; Voos, H. Visual SLAM: What are the current trends and what to expect? Sensors 2022, 22, 9297. [Google Scholar] [CrossRef]
  8. Ai, Y.b.; Rui, T.; Yang, X.q.; He, J.l.; Fu, L.; Li, J.b.; Lu, M. Visual SLAM in dynamic environments based on object detection. Def. Technol. 2021, 17, 1712–1721. [Google Scholar] [CrossRef]
  9. Theodorou, C.; Velisavljevic, V.; Dyo, V. Visual SLAM for Dynamic Environments Based on Object Detection and Optical Flow for Dynamic Object Removal. Sensors 2022, 22, 7553. [Google Scholar] [CrossRef]
  10. Wang, Y.; Bu, H.; Zhang, X.; Cheng, J. YPD-SLAM: A Real-Time VSLAM System for Handling Dynamic Indoor Environments. Sensors 2022, 22, 8561. [Google Scholar] [CrossRef]
  11. Mokssit, S.; Licea, D.B.; Guermah, B.; Ghogho, M. Deep learning techniques for visual slam: A survey. IEEE Access 2023, 11, 20026–20050. [Google Scholar] [CrossRef]
  12. Wang, S.; Lv, X.; Liu, X.; Ye, D. Compressed holistic convnet representations for detecting loop closures in dynamic environments. IEEE Access 2020, 8, 60552–60574. [Google Scholar] [CrossRef]
  13. Ge, G.; Zhang, Y.; Wang, W.; Jiang, Q.; Hu, L.; Wang, Y. Text-mcl: Autonomous Mobile Robot Localization in Similar Environment Using Text-Level Semantic Information. Machines 2022, 10, 169. [Google Scholar] [CrossRef]
  14. Yang, S.; Fan, G.; Bai, L.; Zhao, C.; Li, D. SGC-VSLAM: A semantic and geometric constraints VSLAM for dynamic indoor environments. Sensors 2020, 20, 2432. [Google Scholar] [CrossRef]
  15. Singh, G.; Wu, M.; Do, M.V.; Lam, S.-K. Fast semantic-aware motion state detection for visual slam in dynamic environment. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23014–23030. [Google Scholar] [CrossRef]
  16. Shao, C.; Zhang, L.; Pan, W. Faster r-cnn learning-based semantic filter for geometry estimation and its application in vslam systems. IEEE Trans. Intell. Transp. Syst. 2022, 23, 5257–5266. [Google Scholar] [CrossRef]
  17. Sweeney, C.; Fragoso, V.; Höllerer, T.; Turk, M. gDLS: A Scalable Solution to the Generalized Pose and Scale Problem. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 16–31. [Google Scholar]
  18. Memon, A.R.; Wang, H.; Hussain, A. Loop closure detection using supervised and unsupervised deep neural networks for monocular slam systems. Robot. Auton. Syst. 2020, 126, 103470. [Google Scholar] [CrossRef]
  19. Ma, J.; Wang, S.; Zhang, K.; He, Z.; Huang, J.; Mei, X. Fast and robust loop-closure detection via convolutional auto-encoder and motion consensus. IEEE Trans. Ind. Inform. 2022, 18, 3681–3691. [Google Scholar] [CrossRef]
  20. Williams, B.; Cummins, M.; Neira, J.; Newman, P.; Reid, I.; Tardos, J. A comparison of loop closing techniques in monocular slam. Robot. Auton. Syst. 2009, 57, 1188–1197. [Google Scholar] [CrossRef]
  21. Zhang, G.; Yan, X.; Ye, Y. Loop closure detection via maximization of mutual information. IEEE Access 2019, 7, 124217–124232. [Google Scholar] [CrossRef]
  22. Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar] [CrossRef]
  23. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  24. Yuan, Z.; Xu, K.; Zhou, X.; Deng, B.; Ma, Y. Svg-loop: Semantic-visual-geometric information-based loop closure detection. Remote. Sens. 2021, 13, 3520. [Google Scholar] [CrossRef]
  25. Qin, C.; Zhang, Y.; Liu, Y.; Lv, G. Semantic loop closure detection based on graph matching in multi-objects scenes? J. Vis. Commun. Image Represent. 2021, 76, 103072. [Google Scholar] [CrossRef]
  26. Besc, B.; Fácil, J.M.; Civera, J.; Neira, J. Dynslam: Tracking, mapping and inpainting in dynamic scenes. arXiv 2018, arXiv:1806.05620. [Google Scholar] [CrossRef]
  27. Zhong, F.; Wang, S.; Zhang, Z.; Chen, C.; Wang, Y. Detect-slam: Making object detection and slam mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1001–1010. [Google Scholar]
  28. Wang, J.; Chen, Y.; Gao, M.; Dong, Z. Improved yolov5 network for real-time multi-scale traffic sign detection. arXiv 2021, arXiv:2112.08782. [Google Scholar] [CrossRef]
  29. Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar] [CrossRef]
  30. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  31. Dai, W.; Zhang, Y.; Li, P.; Fang, Z.; Scherer, S. Rgb-d slam in dynamic environments using point correlations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 373–389. [Google Scholar] [CrossRef]
  32. Barber, C.B.; Dobkin, D.P.; Huhdanpaa, H. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 1996, 22, 469–483. [Google Scholar] [CrossRef]
  33. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of rgb-d slam systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
  34. Sturm, J.; Burgard, W.; Cremers, D. Evaluating egomotion and structure-from-motion approaches using the tum rgb-d benchmark. In Proceedings of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS), Vilamoura, Algarve, Portugal, 7–12 October 2012. [Google Scholar]
  35. Shi, X.; Li, D.; Zhao, P.; Tian, Q.; Tian, Y.; Long, Q.; Zhu, C.; Song, J.; Qiao, F.; Song, L.; et al. Are we ready for service robots? The openloris-scene datasets for lifelong slam. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3139–3145. [Google Scholar]
  36. Li, D.; Shi, X.; Long, Q.; Liu, S.; Yang, W.; Wang, F.; Wei, Q.; Qiao, F. Dxslam: A robust and efficient visual slam system with deep features. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2020. [Google Scholar]
Figure 1. The process of re-localization.
Figure 1. The process of re-localization.
Sensors 23 08445 g001
Figure 2. The framework of the proposed Visual SLAM system.
Figure 2. The framework of the proposed Visual SLAM system.
Sensors 23 08445 g002
Figure 3. The process of the proposed solution.
Figure 3. The process of the proposed solution.
Sensors 23 08445 g003
Figure 4. The result of the object detection using YOLOv5.
Figure 4. The result of the object detection using YOLOv5.
Sensors 23 08445 g004
Figure 5. The obtaining process of topological graph.
Figure 5. The obtaining process of topological graph.
Sensors 23 08445 g005
Figure 6. The process of creating topological edges.
Figure 6. The process of creating topological edges.
Sensors 23 08445 g006
Figure 7. The images of the TUM RGB-D and OpenLORIS-Scene. (ac), respectively, are “freibureg2_desk_with_person”, “freiburg3_walking_rpy”, and “freiburg3_walking_xyz” in TUM RGB-D; (dj) are in turn the pictures of the seven sequences in OpenLORIS-Scene.
Figure 7. The images of the TUM RGB-D and OpenLORIS-Scene. (ac), respectively, are “freibureg2_desk_with_person”, “freiburg3_walking_rpy”, and “freiburg3_walking_xyz” in TUM RGB-D; (dj) are in turn the pictures of the seven sequences in OpenLORIS-Scene.
Sensors 23 08445 g007
Figure 8. The topology graph of different frames.
Figure 8. The topology graph of different frames.
Sensors 23 08445 g008
Figure 9. Whether the algorithms consistently performed correct pose estimation.
Figure 9. Whether the algorithms consistently performed correct pose estimation.
Sensors 23 08445 g009
Figure 10. Service robot platform physical picture and Intel RealSense D435I depth camera.
Figure 10. Service robot platform physical picture and Intel RealSense D435I depth camera.
Sensors 23 08445 g010
Figure 11. Precision–recall curve.
Figure 11. Precision–recall curve.
Sensors 23 08445 g011
Figure 12. The keyframe trajectory with loop closure detection for the city center. (a) The keyframe trajectory without loop closure detection. (b) The results of loop closure detection. The red lines indicate the detected closed loop. (c) The keyframe trajectory with loop closure detection using the method of this paper.
Figure 12. The keyframe trajectory with loop closure detection for the city center. (a) The keyframe trajectory without loop closure detection. (b) The results of loop closure detection. The red lines indicate the detected closed loop. (c) The keyframe trajectory with loop closure detection using the method of this paper.
Sensors 23 08445 g012
Table 1. Brief relationship between segmentation accuracy and efficiency of different methods.
Table 1. Brief relationship between segmentation accuracy and efficiency of different methods.
MethodSegmentation AccuracySegmentation Efficiency
object detectionlowhigh
semantic segmentationmiddlemiddle
instance segmentationhighlow
Table 2. Classification of the dynamic properties of common objects in life.
Table 2. Classification of the dynamic properties of common objects in life.
ClassificationObjects
Highly DynamicPeople
Medium DynamicChairs, Books
Low DynamicDesks, TVs
Table 3. The value of the  c i  class label.
Table 3. The value of the  c i  class label.
ClassClass Label NumberApplicable Dataset
tv1fre2, fre3, office
keyboard2fre2, fre3, office
chair3fre2, fre3, office
book4fre2, fre3, office
mouse5fre2
teddy_bear6fre2
potted_plant7fre2
cup8fre2, office
vase9fre2
car10fre2, office
desk11office
water_dispenser12office
bucket13office
door14office
bag15office
bookshelf16office
“fre2” represents “freiburg2_desk_with_person”; “fre3” represents “freiburg3_walking_rpy” and ”freiburg3_walking_xyz”; “office” represents OpenLORIS-Scene.
Table 4. The datasets used in this paper.
Table 4. The datasets used in this paper.
DatasetsCameraImages
TUMfreiburg2_desk_with_personRGB-D4067
freiburg3_walking_rpy910
freiburg3_walking_xy859
OpenLORIS-Scene (Office)Office1RGB-D809
Office2899
Office3360
Office4870
Office51589
Office61080
Office71141
Table 5. The similarity values between different keyframes and the first keyframe.
Table 5. The similarity values between different keyframes and the first keyframe.
Similarity1th22th44th197th
1th100%95%55%68%
Table 6. The value of CS-R for different methods.
Table 6. The value of CS-R for different methods.
Office2,7ORB-SLAM2DS-SLAMDXSLAMDynamic-SLAMOurs
CS-R0.9970.9960.9990.9960.998
Table 7. Whether algorithms could re-localize.
Table 7. Whether algorithms could re-localize.
Office SequenceSequence 1Sequence 2Sequence 3Sequence 4Sequence 5Sequence 6Sequence 7
DXSLAM
ORB-SLAM2
DS-SLAM
Ours
“•” is successful re-localization; “∘” is unsuccessful re-localization.
Table 8. The average accuracy on office sequence for different methods.
Table 8. The average accuracy on office sequence for different methods.
Office SequenceORB-SLAM2DS-SLAMDXSLAMOurs
Average accuracy52.5%28.6%54.6%60.8%
Table 9. The accuracy for different methods on the mobile robot experimental platform.
Table 9. The accuracy for different methods on the mobile robot experimental platform.
BoWORB-SLAM2DS-SLAMDynaSLAMOurs
Accuracy61.8%69.3%76.2%78.4%80.1%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhang, Y.; Hu, L.; Wang, W.; Ge, G.; Tan, S. A Semantic Topology Graph to Detect Re-Localization and Loop Closure of the Visual Simultaneous Localization and Mapping System in a Dynamic Environment. Sensors 2023, 23, 8445. https://doi.org/10.3390/s23208445

AMA Style

Wang Y, Zhang Y, Hu L, Wang W, Ge G, Tan S. A Semantic Topology Graph to Detect Re-Localization and Loop Closure of the Visual Simultaneous Localization and Mapping System in a Dynamic Environment. Sensors. 2023; 23(20):8445. https://doi.org/10.3390/s23208445

Chicago/Turabian Style

Wang, Yang, Yi Zhang, Lihe Hu, Wei Wang, Gengyu Ge, and Shuyi Tan. 2023. "A Semantic Topology Graph to Detect Re-Localization and Loop Closure of the Visual Simultaneous Localization and Mapping System in a Dynamic Environment" Sensors 23, no. 20: 8445. https://doi.org/10.3390/s23208445

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop