A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments

: In unknown environments, mobile robots can use visual-based Simultaneous Localization and Mapping (vSLAM) to complete positioning tasks while building sparse feature maps and dense maps. However, the traditional vSLAM works in the hypothetical environment of static scenes and rarely considers the dynamic objects existing in the actual scenes. In addition, it is difﬁcult for the robot to perform high-level semantic tasks due to its inability to obtain semantic information from sparse feature maps and dense maps. In order to improve the ability of environment perception and accuracy of mapping for mobile robots in dynamic indoor environments, we propose a semantic information-based optimized vSLAM algorithm. The optimized vSLAM algorithm adds the modules of dynamic region detection and semantic segmentation to ORB-SLAM2. First, a dynamic region detection module is added to the vision odometry. The dynamic region of the image is detected by combining single response matrix and dense optical ﬂow method to improve the accuracy of pose estimation in dynamic environment. Secondly, the semantic segmentation of images is implemented based on BiSeNet V2 network. For the over-segmentation problem in semantic segmentation, a region growth algorithm combining depth information is proposed to optimize the 3D segmentation. In the process of map building, semantic information and dynamic regions are used to remove dynamic objects and build an indoor map containing semantic information. The system not only can effectively remove the effect of dynamic objects on the pose estimation, but also use the semantic information of images to build indoor maps containing semantic information. The proposed algorithm is evaluated and analyzed in TUM RGB-D dataset and real dynamic scenes. The results show that the accuracy of our algorithm outperforms that of ORB-SLAM2 and DS-SLAM in dynamic scenarios.


Introduction
Simultaneous Localization and Mapping (SLAM) [1] is considered to be the basis for intelligent mobile robots to perceive their surroundings and perform tasks such as navigation, obstacle avoidance, and path planning. Currently, SLAM has been used in many industries, such as auto pilot and virtual reality [2]. The main problem solved by SLAM is that of mobile robots using sensors to build maps while estimating their pose in unknown environments [3]. At present, the main research directions of SLAM can be divided into two categories, laser SLAM and vSLAM. Compared with the expensive sensors of laser SLAM, vSLAM is popular among researchers for its price advantage. The sensors commonly used for vSLAM are monocular cameras, binocular cameras, and depth cameras. Furthermore, vSLAM can use the image information captured by the camera for semantic perception of the surrounding environment [4].
At present, many achievements have been made in traditional vSLAM. However, these vSLAM schemes are worked under the assumption of a static environment. Dynamic objects in real scenes can degrade the performance of vSLAM [5]. In addition to this, the dynamic objects in the maps are not conducive to robot navigation or man-machine interaction [6]. Therefore, the effect of dynamic objects on vSLAM cannot be ignored.
With the development of deep learning, many scholars have integrated methods such as object detection and semantic segmentation into vSLAM. These methods can improve the scenario understanding of vSLAM and provide new directions for the development of vSLAM [7]. In terms of tracking, semantic information can assist inter-frame matching to obtain more robust results. In terms of mapping, vSLAM can build semantic maps by combining semantic information in images. The maps with semantic information can give mobile robots a higher understanding of the environment.
For the effect of dynamic objects on vSLAM, a semantic information-based optimized vSLAM is proposed with ORB-SLAM2 [8] as the base framework. In the process of frame matching, in order to improve the accuracy of pose estimation in dynamic sceneries, the detection of dynamic regions that are based on homography matrix and dense optical flow methods are used to remove feature points in dynamic regions. In the process of mapping, the semantic information in the image is obtained using the deep learning method. Then, a region growing algorithm combined with depth information is used to improve the accuracy of object semantic segmentation. Finally, the 3D information of dynamic objects is removed using semantic information and dynamic region detection to avoid building dynamic objects into the map. The main contributions of this paper are as follows: 1. A strategy that enables vSLAM initialization in dynamic environments is proposed, which is able to initialize local maps and poses robustly and efficiently.
2. An efficient method for inter-frame pose estimation in dynamic environments based on homography matrix and dense optical flow method is proposed. 3. A method of optimized semantic segmentation based on depth images is proposed for accurately annotating semantic objects in key frames.
The remainder of this paper is organized as follows: Section 2 discusses the related research on vSLAM in dynamic environments. Section 3 presents the proposed method for localization and building maps for vSLAM in indoor dynamic scenes. Section 4 validates the proposed method using publicly available datasets and real scenario. Finally, Section 5 summarizes this study.

Related Work
At present, many vSLAM schemes already have excellent performance in static environments. Some examples are monocular camera schemes [9,10], binocular camera schemes [11] and RGB-D camera schemes [12]. With the popularity of a series of depth cameras such as Kinect and Releasense, RGB-D cameras are widely used in vSLAM schemes. Compared to monocular and binocular cameras, depth cameras can perform direct ranging. In addition, depth cameras can compensate for the high computational complexity of pure vision schemes. ORB-SLAM2 enjoys widely acclaim in the vSLAM field as a complete open-source scheme. ORB-SLAM2 determines the correspondence between the 2D ORB points of the current frame and the 3D points of the local map by feature matching. Then, the 3D coordinates of the matched points in the local map are projected to the current frame to estimate the camera pose by minimize the reprojection error. Finally, a global sparse map is built by bundle adjustment (BA). In order to run vSLAM in a specific static environment, some researchers had designed variants of the vSLAM scheme. Cai et al. [13] proposed an affine transformation-based ORB feature extraction method (Affine-ORB) which is used and applied to existing robot vision SLAM methods. Li et al. [14] designed an RGB-D SLAM system that works specifically in structured environments, aiming to improve tracking and mapping accuracy by relying on geometric features extracted from the surroundings. Nebiker et al. [15] designed a mobile mapping system installed on a motorized tricycle to count on-street parking statistics using RGB-D camera and GPS system. Chen et al. [16] built a SLAM system using RGB-D cameras that are suitable for working in complex orchard environments. Zhang et al. [17] combined the Grid-based Motion Statistics (GMS) with the RANSAC algorithm to jointly filter the feature matching set acquired by ORB to improve the accuracy of ORB-SLAM2 poses. However, all of the above-mentioned works work in static environments. When dynamic objects are present in the environment, the system performance of vSLAM will deteriorate several times or even tens of times [18].
Dynamic objects usually affect the inter-frame pose estimation and mapping of vSLAM systems. The matching relationship between feature points on dynamic objects is usually an important factor that causes pose estimation errors. Wei et al. [19] combined a GMS feature point matching method with a k-means clustering algorithm to reject feature points on dynamic objects. Yang et al. [20] cluster regions in each image frame using depth information and intensity information. Then, dynamic feature points are eliminated by checking the consistency of motion of feature points in each region.
In recent years, many scholars consider the incorporation of deep learning into vSLAM as the key to solve the dynamic environment problem. The semantic information in the images obtained by deep learning further improves the perception and understanding of the environments. Bescos et al. [21] proposed DynaSLAM based on ORB-SLAM2, which used the Mask-RCNN [22] network to achieve dynamic detection and background restoration functions. Yu et al. [23] proposed the DS-SLAM, which uses SegNet [24] to perform semantic segmentation to obtain semantic masks. Later, Xi et al. [25] improved the segmentation accuracy of DS-SLAM through PSPNet network [26] to achieve higher pose estimation accuracy in dynamic environments. Wang [27] et al. proposed Dynamic-SLAM, which uses convolutional neural networks to build Single Shot MultiBox Detector (SSD) [28] to detect dynamic objects. Sun et al. [29] took the lead in applying weakly supervised semantic segmentation to the dynamic detection of vSLAM. The drawback of this approach is that motion detection takes too much time, which affects the real-time performance of SLAM. Zhang et al. proposed the VDO-SLAM [30], which does not rely on the a priori information of motion. Instead, it adopts the dense optical flow method and semantic segmentation to estimate the motion of objects in the scene, which builds a globally consistent scene map with high robustness and accuracy. Soares et al. [31] proposed Crowed-SLAM, which uses the YOLOv3 [32] Tiny network to detect crowds in images and can deal well with the crowded environment problem in SLAM.
From the above analysis, there have been many research results on vSLAM. However, the dynamic objects in the scene greatly affect the conventional vSLAM development scheme. Many scholars have started to study operational vSLAM systems in dynamic environments. Among them, the focus of the research is on dynamic feature point elimination and the building of point cloud maps.
1. For dynamic feature point elimination, the existing methods can be classified into two types. One is to eliminate dynamic feature points using multi-viewpoint geometry [33] or new matching methods. Another approach is based on deep learning to acquire dynamic objects and then eliminate the feature points in dynamic regions. However, deep learningbased approaches require a large amount of computation and usually do not enable realtime operation.
2. For point cloud map building, the existing methods eliminate dynamic objects in local maps by semantic segmentation of key frames. This approach not only reduces the impact of dynamic objects on global optimization and loopback detection, but also enables to build maps with semantic information. However, most of the segmentation networks cannot extract the objects accurately due to the influence of semantic segmentation accuracy.

Overview
The vSLAM tracking module and mapping module will be affected when objects are moving within the field of view of the camera during indoor data acquisition. Therefore, vSLAM should have the ability of robust inter-frame matching and dynamic object detection in dynamic environments. To solve the above problems, we optimized a vSLAM system based on ORB-SLAM2 that run in dynamic environments and generate semantic maps, as shown in Figure 1. Compared with ORB-SLAM2, our system has made the following improvements with the following four modules.
2. Motion detection module. It calculates the homography of the current color image frame and the previous image frame. Then, the image of the current frame is transformed using the homography, and the optical flow field is calculated using the dense optical flow method. Finally, the dynamic region for the tracking module is obtained based on the optical flow field. In addition, the semantic masks are combined with the dynamic regions to acquire the dynamic targets for the map building module. 3. Semantic segmentation module. The semantic information is obtained by semantic segmentation of color images using BiSeNet V2 network. Then, the results of semantic segmentation are optimized using region growing algorithm and depth image. Finally, the semantic information in the image is mapped to the point cloud. 4. Map building module. The tracking module selects key frames, and then the depth image and color image of the dynamic target are removed from the key frames to build the point cloud of the current frame and add semantic information. Finally, the mapping module uses the pose information to join the point clouds from keyframes, and if a looping is detected, it performs global pose optimization and outputs a point cloud map and pose information with semantic information.

Tracking Module
The initialization performance of vSLAM has a great impact on the subsequent pose estimation and mapping [18]. Conventional RGB-D SLAM usually adopts the first frame 1.
Tracking module. ORB feature points are extracted from the color image. Then, the dynamic regions in the image are acquired based on the dynamic detection module, and the ORB feature points that contained in the regions are removed. Finally, the remaining feature points are matched, and the feature matching set is used to estimate the pose of the camera and to determine whether the current image is a key frame.
When the system is in the first frame of operation, the dynamic regions in the image cannot be acquired, so only the first frame is semantic segmented and then the dynamic target is directly acquired, and then the pose estimation is done for the first and second frames to complete the initialization.

2.
Motion detection module. It calculates the homography of the current color image frame and the previous image frame. Then, the image of the current frame is transformed using the homography, and the optical flow field is calculated using the dense optical flow method. Finally, the dynamic region for the tracking module is obtained based on the optical flow field. In addition, the semantic masks are combined with the dynamic regions to acquire the dynamic targets for the map building module.

3.
Semantic segmentation module. The semantic information is obtained by semantic segmentation of color images using BiSeNet V2 network. Then, the results of semantic segmentation are optimized using region growing algorithm and depth image. Finally, the semantic information in the image is mapped to the point cloud.

4.
Map building module. The tracking module selects key frames, and then the depth image and color image of the dynamic target are removed from the key frames to build the point cloud of the current frame and add semantic information. Finally, the mapping module uses the pose information to join the point clouds from keyframes, and if a looping is detected, it performs global pose optimization and outputs a point cloud map and pose information with semantic information.

Tracking Module
The initialization performance of vSLAM has a great impact on the subsequent pose estimation and mapping [18]. Conventional RGB-D SLAM usually adopts the first frame k 1 as the key frame after the initialization condition is satisfied. The local map is initialized by projecting k_1 into the 3D space based on the depth information acquired by the camera. Then, Perspective-n-Point (PnP) is used to calculate the matching relationship between the feature points of the current frame k 2 and the local map points of k 1 to complete the initial pose. However, when vSLAM initializes in a dynamic environment using traditional methods, objects that are moving in the camera's field of view are not only appended to the local map, but feature points located on dynamic objects in two adjacent frames can also significantly affect inter-frame matching. Therefore, we propose a strategy for vSLAM initialization in dynamic environments that can robustly and efficiently initialize local map and pose. First, the ORB feature points of k 1 are extracted. At the same time, the semantic segmentation function in the semantic segmentation module (Section 3.4) is used to obtain the semantic information of dynamic objects (e.g., pedestrians) in k 1 . The semantic information will be used to eliminate the key points located on dynamic objects in k 1 . Next, the matching relationship between the key points of k 1 and k 2 is determined whether the initialization condition of ORB-SLAM2 is satisfied. If not, k 1 is released. k 2 is reinitialized as k 1 . On the contrary, the dynamic objects in k 1 that are located on the 3D local map will be eliminated. The matching relationship of key points in k 2 and k 1 is calculated according to the dynamic detection module (Section 3.3) to obtain the initial pose and eliminate the dynamic feature points in k 2 . After initialization, ORB feature extraction, dynamic feature rejection, feature matching set building and pose estimation are performed on the subsequent images.

Dynamic Detection Module
In the odometry of vSLAM, the conventional tracking module determines the interframe poses based on the matching relationship pairs of feature points in the previous frame k n−1 and the current frame k n . When objects are moving within the camera field of view, inter-frame matching is easily affected. The specific reason is that feature points located on dynamic objects usually produce incorrect matching relationships. Additionally, when the camera is in motion, the positions of stationary objects, background and dynamic objects in the image all change, which makes it more difficult to determine dynamic regions. Considering the necessity of real-time inter-frame pose estimation for odometry, we propose a method for inter-frame pose estimation based on the homography and dense optical flow method, which can solve the above problems efficiently and effectively. As shown in Figure 2, the homography is used to align k n to the moment where k n−1 is located. Then, the dynamic region in k n is obtained using the dense optical flow method, and the feature points located in the dynamic region in k n are eliminated, as shown in Figure 3. Finally, the inter-frame pose is re-estimated based on the static feature points in k n−1 and k n .
traditional methods, objects that are moving in the camera's field of view are n appended to the local map, but feature points located on dynamic objects in two a frames can also significantly affect inter-frame matching. Therefore, we propose a s for vSLAM initialization in dynamic environments that can robustly and efficien tialize local map and pose. First, the ORB feature points of 1 are extracted. At th time, the semantic segmentation function in the semantic segmentation module ( 3.4) is used to obtain the semantic information of dynamic objects (e.g., pedestrians The semantic information will be used to eliminate the key points located on d objects in 1 . Next, the matching relationship between the key points of 1 and termined whether the initialization condition of ORB-SLAM2 is satisfied. If not, leased. 2 is reinitialized as 1 . On the contrary, the dynamic objects in 1 that cated on the 3D local map will be eliminated. The matching relationship of key po

Dynamic Detection Module
In the odometry of vSLAM, the conventional tracking module determines th frame poses based on the matching relationship pairs of feature points in the pr frame −1 and the current frame . When objects are moving within the came of view, inter-frame matching is easily affected. The specific reason is that feature located on dynamic objects usually produce incorrect matching relationships. Ad ally, when the camera is in motion, the positions of stationary objects, backgrou dynamic objects in the image all change, which makes it more difficult to determ namic regions. Considering the necessity of real-time inter-frame pose estimat odometry, we propose a method for inter-frame pose estimation based on the ho phy and dense optical flow method, which can solve the above problems efficien effectively. As shown in Figure 2, the homography is used to align to the m where −1 is located. Then, the dynamic region in is obtained using the den cal flow method, and the feature points located in the dynamic region in are nated, as shown in Figure 3. Finally, the inter-frame pose is re-estimated based static feature points in −1 and .

Homography Calculation
In order to accurately determine the homology matrix between adjacent frames, the matching relationships located on dynamic objects are eliminated as much as possible when inter-frame matching is performed. When the first inter-frame matching is performed, the dynamic feature points in 1 are eliminated based on the semantic information from the dynamic semantic recognition module. When the nth inter-frame matching is performed, the dynamic feature points in 1 are eliminated in the − 1th interframe matching based on the dynamic region information. Therefore, the dynamic feature points in −1 are not involved in the calculation of the homology matrix of the current two adjacent frames. Based on this, the matching relationship is constructed using the static feature points in −1 and the feature points in . Finally, the RANSAC algorithm is used to obtain the results of robust homology matrix between adjacent frames −1 and .

Dynamic Region Detection
The dense pyramidal optical flow method is applied to detect dynamic regions in images. By building a dense two-dimensional optical flow field on the image, richer motion information in the image is obtained. In motion scenes, if the speed of dynamic objects is too fast, the difference between the two images is more obvious, and the single-layer optical flow can easily reach the local optimum and cannot be tracked correctly. The same image is scaled by image pyramids to obtain images at different resolutions, as shown in Figure 4. When calculating the optical flow, the result of the optical flow calculated in the previous layer is used as the initial value of the optical flow in the next layer. By calculating the optical flow in this coarse-to-fine way, the effect of object motion uncertainty on the optical flow can be effectively solved, and the results are shown in Figure 5b. After obtaining the optical flow field, the Munsell color system [34] optical flow is used to interpolate Figure 5b, and the results are shown in Figure 5c. To determine the dynamic regions in the image, a dynamic optical flow threshold θ is defined. The region where the optical flow amplitude is higher than is considered as a moving region. On the contrary, the optical flow amplitude in the static region is lower than . The optical flow fields are binarized using , as shown in Figure 5d. The optical flow amplitude is defined according to Formula (1).
where represents the optical flow amplitude, represents the horizontal optical flow value, and represents the vertical optical flow value.

Homography Calculation
In order to accurately determine the homology matrix between adjacent frames, the matching relationships located on dynamic objects are eliminated as much as possible when inter-frame matching is performed. When the first inter-frame matching is performed, the dynamic feature points in k 1 are eliminated based on the semantic information from the dynamic semantic recognition module. When the nth inter-frame matching is performed, the dynamic feature points in k 1 are eliminated in the n − 1th inter-frame matching based on the dynamic region information. Therefore, the dynamic feature points in k n−1 are not involved in the calculation of the homology matrix of the current two adjacent frames. Based on this, the matching relationship is constructed using the static feature points in k n−1 and the feature points in k n . Finally, the RANSAC algorithm is used to obtain the results of robust homology matrix between adjacent frames k n−1 and k n .

Dynamic Region Detection
The dense pyramidal optical flow method is applied to detect dynamic regions in images. By building a dense two-dimensional optical flow field on the image, richer motion information in the image is obtained. In motion scenes, if the speed of dynamic objects is too fast, the difference between the two images is more obvious, and the single-layer optical flow can easily reach the local optimum and cannot be tracked correctly. The same image is scaled by image pyramids to obtain images at different resolutions, as shown in Figure 4. When calculating the optical flow, the result of the optical flow calculated in the previous layer is used as the initial value of the optical flow in the next layer. By calculating the optical flow in this coarse-to-fine way, the effect of object motion uncertainty on the optical flow can be effectively solved, and the results are shown in Figure 5b. After obtaining the optical flow field, the Munsell color system [34] optical flow is used to interpolate Figure 5b, and the results are shown in Figure 5c. To determine the dynamic regions in the image, a dynamic optical flow threshold θ is defined. The region where the optical flow amplitude is higher than θ is considered as a moving region. On the contrary, the optical flow amplitude in the static region is lower than θ. The optical flow fields are binarized using θ, as shown in Figure 5d. The optical flow amplitude is defined according to Formula (1).
where f represents the optical flow amplitude, f u represents the horizontal optical flow value, and f v represents the vertical optical flow value.

Semantic Segmentation Module
Conventional SLAM mapping module can obtain a complete 3D map of the surrounding environment. However, it cannot obtain semantic information from the map, which makes it difficult to perform high-level semantic tasks. At the same time, loopback detection and global map optimization will be affected when there are dynamic objects in the local map. Therefore, we add a separate semantic segmentation thread to semantically segment the keyframes. In addition, a method is proposed to optimize semantic segmentation based on depth images. Finally, the dynamic region detection of the tracking module is combined to eliminate the dynamic objects.

BiSeNet V2
In the semantic segmentation thread, the BiSeNet V2 network [35] is used to perform real-time semantic segmentation of the original image to obtain the semantic information of the image. It can meet the need of real-time operation while ensuring relatively accurate

Semantic Segmentation Module
Conventional SLAM mapping module can obtain a complete 3D map of the surrounding environment. However, it cannot obtain semantic information from the map, which makes it difficult to perform high-level semantic tasks. At the same time, loopback detection and global map optimization will be affected when there are dynamic objects in the local map. Therefore, we add a separate semantic segmentation thread to semantically segment the keyframes. In addition, a method is proposed to optimize semantic segmentation based on depth images. Finally, the dynamic region detection of the tracking module is combined to eliminate the dynamic objects.

BiSeNet V2
In the semantic segmentation thread, the BiSeNet V2 network [35] is used to perform real-time semantic segmentation of the original image to obtain the semantic information of the image. It can meet the need of real-time operation while ensuring relatively accurate

Semantic Segmentation Module
Conventional SLAM mapping module can obtain a complete 3D map of the surrounding environment. However, it cannot obtain semantic information from the map, which makes it difficult to perform high-level semantic tasks. At the same time, loopback detection and global map optimization will be affected when there are dynamic objects in the local map. Therefore, we add a separate semantic segmentation thread to semantically segment the keyframes. In addition, a method is proposed to optimize semantic segmentation based on depth images. Finally, the dynamic region detection of the tracking module is combined to eliminate the dynamic objects.

BiSeNet V2
In the semantic segmentation thread, the BiSeNet V2 network [35] is used to perform real-time semantic segmentation of the original image to obtain the semantic information of the image. It can meet the need of real-time operation while ensuring relatively accurate Appl. Sci. 2023, 13, 8790 8 of 18 detection accuracy. To apply the BiSeNet V2 network to VSLAM in indoor environments, the network was trained on the Microsoft Common Objects in Context [36] (MS COCO) dataset. The trained semantic segmentation model was used to perform semantic segmentation experiments in dynamic environment datasets and real scenes. As shown in Figure 6, the BiSeNet V2 semantic segmentation network can effectively perform the semantic segmentation task and obtain the semantic information of the images in the indoor dynamic environment. The trained semantic segmentation model was used to perform semantic segmentation experiments in dynamic environment datasets and real scenes. As shown in Figure  6, the BiSeNet V2 semantic segmentation network can effectively perform the semantic segmentation task and obtain the semantic information of the images in the indoor dynamic environment.
(a) Original image (b) Semantic segmentation results Figure 6. Dynamic area detection. The effect of semantic segmentation of BiSeNet V2 network. In (a), the first row is the publicly available data in the TUM RGB-D dataset [25], and the second row is the data of the actual dynamic scene collected. The effect of BiSeNet V2 on the semantic segmentation of the image is shown in (b). In this figure, light blue indicates pedestrians, dark blue indicates monitors, red indicates books, and pink indicates chairs.

Refine Segmentation
The depth camera can provide color and depth images. The two-dimensional image and semantic information of each key frame can be mapped to a three-dimensional point cloud based on the camera internal parameters, as shown in Figure 7. However, due to the over-segmentation problem in obtaining the semantic information of the image, the semantic information in two dimensions is mapped to the three-dimensional point cloud with large deviations, as shown in Figure 7c. Therefore, a region growing algorithm with depth information is given to optimize 3D semantic annotation, as shown in Table 1. As shown in Figure 8, the accuracy of optimized 3D semantic annotation by the algorithm in Table 1 has significantly improved.

Refine Segmentation
The depth camera can provide color and depth images. The two-dimensional image and semantic information of each key frame can be mapped to a three-dimensional point cloud based on the camera internal parameters, as shown in Figure 7. However, due to the over-segmentation problem in obtaining the semantic information of the image, the semantic information in two dimensions is mapped to the three-dimensional point cloud with large deviations, as shown in Figure 7c. Therefore, a region growing algorithm with depth information is given to optimize 3D semantic annotation, as shown in Table 1. As shown in Figure 8, the accuracy of optimized 3D semantic annotation by the algorithm in Table 1 has significantly improved.
where p, q is 0 or 1; I(u, v) represents the gray value at pixel (u, v), m pq represents the geometric moment of the contour of the mask. P center represents the mask center point.
cloud based on the camera internal parameters, as shown in Figure 7. However, due to the over-segmentation problem in obtaining the semantic information of the image, the semantic information in two dimensions is mapped to the three-dimensional point cloud with large deviations, as shown in Figure 7c. Therefore, a region growing algorithm with depth information is given to optimize 3D semantic annotation, as shown in Table 1. As shown in Figure 8, the accuracy of optimized 3D semantic annotation by the algorithm in Table 1 has significantly improved.

Main Steps Operation Content
Step 1 Select a semantic mask from the semantic segmentation results of the image by the BiSeNet V2 network. After the initial seed point (u 0 , v 0 ) is determined by Equation (2), the corresponding 3D coordinates (x 0 , y 0 , z 0 ) of the seed point are determined according to the camera internal reference. Add the three-dimensional coordinates (x 0 , y 0 , z 0 ) to the semantic point cloud set P.
Step 2 Taking the seed point (u 0 , v 0 ) as the center. In the range of semantic mask, search the spatial coordinates (x i , y i , z i ) that correspond to the eight-pixel points (u i , v i ) around the seed point. Calculate the Euclidean distance between (x i , y i , z i ) and the spatial coordinates of the seed point (x 0 , y 0 , z 0 ). If the distance is less than the threshold T, (u i , v i ) is pressed into the stack and (x i , y i , z i ) is added to P.
Step 3 Take a pixel from the stack as a new seed point (u 0 , v 0 ), and return to step 2. When the stack is empty, the regional growth of the current semantic mask ends, and the point cloud set P is returned as the optimized result.
Step 4 Return to Step 1, optimize the next semantic mask, and end the algorithm after all masks are completed.
In step 2 of the algorithm in Table 1, the Euclidean distance threshold T needs to be determined. The fixed threshold value is only applicable to a certain class of cases. Therefore, we give a method to dynamically select the Euclidean distance threshold T of the by random sampling within the mask.
First, i pixel points in the semantic mask are randomly selected, and at the same time, the adjacent eight pixels of each random pixel point are also selected. The average Euclidean distance of the corresponding spatial coordinates between random pixel points and their adjacent pixels points is calculated by Equation (3). Then, ρ m is sorted, and after eliminating the largest and smallest terms in ρ m , the average value of the Euclidean distance is calculated. Finally, the Euclidean distance threshold T in step 2 in Table 1 is calculated by Equation (7).
where ρ m represents the average value of the Euclidean distance of the spatial coordinates corresponding to the random pixel points and their adjacent pixel points; (x m , y m , z m ) represent the spatial coordinates corresponding to the random pixel points, (x m n , y m n , z m n ) represent the spatial coordinates corresponding to the adjacent 8 pixels of each random pixel point, and ρ represents the average value of Euclidean distance.
where ε represents the dynamic coefficient, and ρ represents the average value of Euclidean distance.  Select a semantic mask from the semantic segmentation results of the image by the BiSeNet V2 network. After the initial seed point ( 0 , 0 ) is determined by Equation (2), the corresponding 3D coordinates ( 0 , 0 , 0 ) of the seed point are determined according to the camera internal reference. Add the three-dimensional coordinates ( 0 , 0 , 0 ) to the semantic point cloud set .
tep 2 Taking the seed point ( 0 , 0 )as the center. In the range of semantic mask, search the spatial coordinates ( , , ) that correspond to the eight-pixel points ( , ) around the seed point. Calculate the Euclidean distance between ( , , ) and the spatial coordinates of the seed point ( 0 , 0 , 0 ). If the

Mapping Module
Aiming to remove dynamic objects from the local maps, the motion regions in the images obtained in Section 3.3 are used to determine whether the semantic masks in the BiSeNet V2 network are dynamic object masks. Specifically, the proportion ρ i of dynamic regions in each semantic mask is first calculated according to Equation (5). if ρ i is higher than the threshold ϕ, the semantic mask is determined to be dynamic, otherwise it is static, as shown in Figure 9. Then, the local three-dimensional semantic map is generated after removing the dynamic object mask areas in the color image and depth image, as shown in Figure 10. Each frame of the point cloud can be concatenated to form a global map based on the pose information. Finally, the global map is optimized to generate an accurate 3D semantic map based on loopback detection and global BA in ORB-SLAM2.
where M i represents the number of pixels of one of semantic mask in the image and F i represents the number of pixels in the dynamic region in M i .  After constructing a local map containing semantic information for the key frames, the global map is joined by the camera pose. Let the 3D coordinate points constructed in frame be the point set , and the local point cloud map is constructed by stitching the 3D coordinate points of a single frame. When loopback is detected, the vSLAM system optimizes the overall motion pose and outputs the global point cloud map with pose information.

Experimental Data and Evaluation Criteria
The feasibility of the proposed method was verified by testing the TUM RGB-D dataset and real scenarios using Ubuntu 18.04 on a computer (i7-9700K CPU, 16 GB RAM and Nvidia GeForce RTX 2060 GPU). The TUM RGB-D dataset consists of colour and  After constructing a local map containing semantic information for the key frame the global map is joined by the camera pose. Let the 3D coordinate points constructed frame be the point set , and the local point cloud map is constructed by stitching t 3D coordinate points of a single frame. When loopback is detected, the vSLAM syste optimizes the overall motion pose and outputs the global point cloud map with pose i formation. After constructing a local map containing semantic information for the key frames, the global map is joined by the camera pose. Let the 3D coordinate points constructed in frame i be the point set P i , and the local point cloud map is constructed by stitching the 3D coordinate points of a single frame. When loopback is detected, the vSLAM system optimizes the overall motion pose and outputs the global point cloud map with pose information.

Experimental Data and Evaluation Criteria
The feasibility of the proposed method was verified by testing the TUM RGB-D dataset and real scenarios using Ubuntu 18.04 on a computer (i7-9700K CPU, 16 GB RAM and Nvidia GeForce RTX 2060 GPU). The TUM RGB-D dataset consists of colour and depth images (640 × 480) acquired by a Microsoft Kinect sensor at a full frame rate (30 Hz). Among them, the ground-truth trajectory information of the sensor is acquired by a highprecision motion-capture system with eight high-speed tracking cameras (100 Hz). This high precision motion-capture system acquires frame-by-frame relative errors in the ground truth data which are less than the 1 mm and 0.5 deg measured by the Kinect optical centre, and the absolute error of the entire motion-capture area is less than 10 mm and 0.5 deg [37]. Finally, the five indoor dynamic scenes were selected for testing in the dataset, namely freiburg3_sitting_static, freiburg3_walking_halfsphere, freiburg3_walking_rpy, freiburg3_walking_static, and freiburg3_walking_xyz. The real scenario used Intel Re-alSense D435 to collect the data and the device parameters are shown in Table 2. In order to verify the experimental results, the results of the experiments are evaluated by the following two criteria: absolute trajectory error (ATE) and relative pose error (RPE). ATE is defined as the error between the true trajectory and the estimated trajectory of the SLAM, calculated by Equation (6). RPE is defined as the error between the true position transformation and the position transformation estimated by SLAM within a certain period of time, calculated by Equation (7).
where T gt,i . represents the true pose of the i-th frame. T esti,i represents the SLAM estimate of the i-th frame. e i represents the absolute trajectory error of the i-th frame. T esti,i+∆t represents the true pose of the i-th frame to i + ∆t-th frame. T esti,i+∆t represents the SLAM estimate of the i-th frame to i + ∆t-th frame. f i represents the relative positional error of the i-th frame to i + ∆t-th frame.

Evaluation Using TUM RGB-D Dataset
Our algorithm was compared with ORB-SLAM2 and DS-SLAM using the TUB RGB-D dataset. Tables 3 and 4 show the absolute trajectory error and relative pose error of the three algorithms in different scenarios. Compared with ORB-SLAM2, our algorithm has significantly improved. Similarly, comparing with DS-SLAM, which has dynamic object processing capability, our algorithm also has some improvement. To visualize the estimated trajectory data, the trajectory data of freiburg3_walking_halfsphere scenario and freiburg3_walking_xyz scenario are visualized as shown in Figure 11.
Real-time response is an important factor in evaluating vSLAM systems. In this part, we experimentally evaluate the average frame rate (FPS) of different systems. Table 5 shows the average FPS of each sequence used by ORB-SLAM2 and Ours pairs for the evaluation of the TUM dataset. Ours achieves an average frame rate of 30.38 FPS, while ORB-SLAM2 achieves 25.24 FPS. In addition, the BiSeNet V2 network performs semantic segmentation of images much faster than the semantic segmentation network used in DS-SLAM, so Ours outperforms other vSLAM systems that use GPU acceleration, for example DS-SLAM (13.08 FPS on fr3-w-h).

Evaluation Using Real Environment
In addition to the TUM RGB-D dataset, we also compared our algorithm with the open-source ORB-SLAM2 in a real environment. In the real environment, we add constantly moving pedestrians used to interfere with SLAM. The point cloud map of the dynamic environment constructed by ORB-SLAM2, and our algorithm is shown in Figures 12-14. Affected by the dynamic object, the pose estimation of the camera had large errors, as shown in Figure 13a. When the map was constructed, point cloud information was mapped to the wrong location. Meanwhile, the information of dynamic objects is also retained in the map, as shown in Figure 13. Compared to ORB-SLAM2, our algorithm greatly improves these drawbacks. In scenes with dynamic objects, our algorithm is able to estimate the pose and construct the map very well. In addition, the algorithm is also able to obtain semantic maps while perfectly removing dynamic objects, as shown in Figure 13a,b. To visualize the estimated trajectory data, the trajectory data of freiburg3_walk-ing_halfsphere scenario and freiburg3_walking_xyz scenario are visualized as shown in Figure 11.  form. These results indicate that the accuracy of our algorithm's bit pose estimation performs well in the high dynamic scenario.

M2 DS-SLAM
Real-time response is an important factor in evaluating vSLAM systems. In this part, we experimentally evaluate the average frame rate (FPS) of different systems. Table 5 shows the average FPS of each sequence used by ORB-SLAM2 and Ours pairs for the evaluation of the TUM dataset. Ours achieves an average frame rate of 30.38 FPS, while ORB-SLAM2 achieves 25.24 FPS. In addition, the BiSeNet V2 network performs semantic segmentation of images much faster than the semantic segmentation network used in DS-SLAM, so Ours outperforms other vSLAM systems that use GPU acceleration, for example DS-SLAM (13.08 FPS on fr3-w-h).

Evaluation Using Real Environment
In addition to the TUM RGB-D dataset, we also compared our algorithm with the open-source ORB-SLAM2 in a real environment. In the real environment, we add constantly moving pedestrians used to interfere with SLAM. The point cloud map of the dynamic environment constructed by ORB-SLAM2, and our algorithm is shown in Figures  12-14. Affected by the dynamic object, the pose estimation of the camera had large errors, as shown in Figure 13a. When the map was constructed, point cloud information was mapped to the wrong location. Meanwhile, the information of dynamic objects is also retained in the map, as shown in Figure 13. Compared to ORB-SLAM2, our algorithm greatly improves these drawbacks. In scenes with dynamic objects, our algorithm is able to estimate the pose and construct the map very well. In addition, the algorithm is also able to obtain semantic maps while perfectly removing dynamic objects, as shown in Figures 13a, b. (a) Point cloud map created by ORB-SLAM2.
(b) Point cloud map created by our algorithm.  In terms of pose accuracy, our method is higher than the ORB-SLAM2 system in all scenes; the reason for this is that our system can well remove the feature points in the im-age which have movement, thus improving the matching accuracy between images. In addition, by jointly using dynamic target and area growing algorithms, we get more ac-curate locations of dynamic points in the depth image, thus ensuring that dynamic object features are less involved in the mapping process. The test results of real scenes show that the point cloud maps created by our system are more effective than ORB-SLAM2. Our system compares to DS-SLAM in terms of pose accuracy, we use homography combined with optical flow method to determine dynamic feature points, while DS-SLAM uses fundamental matrix combined with polar line constraints. Homography and fundamental matrix produce different errors in facing different motion scenes, but the homography combined with optical flow method can acquire dynamic regions in the image, while the fundamental matrix and polar line constraint can only filter feature points. In terms of pose accuracy, our method is higher than the ORB-SLAM2 system in all scenes; the reason for this is that our system can well remove the feature points in the image which have movement, thus improving the matching accuracy between images. In addition, by jointly using dynamic target and area growing algorithms, we get more accurate locations of dynamic points in the depth image, thus ensuring that dynamic object features are less involved in the mapping process. The test results of real scenes show that the point cloud maps created by our system are more effective than ORB-SLAM2. Our system compares to DS-SLAM in terms of pose accuracy, we use homography combined with optical flow method to determine dynamic feature points, while DS-SLAM uses fundamental matrix combined with polar line constraints. Homography and fundamental matrix produce different errors in facing different motion scenes, but the homography combined with optical flow method can acquire dynamic regions in the image, while the fundamental matrix and polar line constraint can only filter feature points.
Due to the elimination of dynamic feature points, our system can achieve faster tracking with local optimization compared to ORB-SLAM2. while the semantic segmentation speed of the SegNet network of DS-SLAM is far less rapid than that of the BiSeNet V2 In terms of pose accuracy, our method is higher than the ORB-SLAM2 system in all scenes; the reason for this is that our system can well remove the feature points in the image which have movement, thus improving the matching accuracy between images. In addition, by jointly using dynamic target and area growing algorithms, we get more accurate locations of dynamic points in the depth image, thus ensuring that dynamic object features are less involved in the mapping process. The test results of real scenes show that the point cloud maps created by our system are more effective than ORB-SLAM2. Our system compares to DS-SLAM in terms of pose accuracy, we use homography combined with optical flow method to determine dynamic feature points, while DS-SLAM uses fundamental matrix combined with polar line constraints. Homography and fundamental matrix produce different errors in facing different motion scenes, but the homography combined with optical flow method can acquire dynamic regions in the image, while the fundamental matrix and polar line constraint can only filter feature points.
Due to the elimination of dynamic feature points, our system can achieve faster tracking with local optimization compared to ORB-SLAM2. while the semantic segmentation speed of the SegNet network of DS-SLAM is far less rapid than that of the BiSeNet V2 Due to the elimination of dynamic feature points, our system can achieve faster tracking with local optimization compared to ORB-SLAM2. while the semantic segmentation speed of the SegNet network of DS-SLAM is far less rapid than that of the BiSeNet V2 network, which results in a situation that the tracking module needs to wait for the semantic segmentation module, finally leading to a decrease in the overall system slow-down. Moreover, as can be seen from Figure 1, our semantic segmentation network is not directly involved in the tracking module, so there is no tracking thread waiting for the semantic segmentation results.

Conclusions
To address indoor dynamic environments, we design a semantic information based VSLAM scheme, which is improved in several aspects, such as dynamic region detection, 3D semantic annotation, and geometric segmentation of depth images. Experimental results on the TUM RGB-D dataset show that our algorithm is better than ORB-SLAM2 and DS-SLAM in terms of bit pose estimation in dynamic scenes. This shows that the VSLAM scheme with the addition of a dynamic region detection module can well improve the ability of the camera to localize and build images in dynamic environments. In addition, the experimental results of real dynamic scenes show that our algorithm can build point cloud in unknown environments that contains semantic information rich enough to ensure the downstream work. Our scheme improves on the ORB-SLAM2 algorithm with the goal of removing dynamic information from the environment. However, due to the limitation of manual feature description algorithms such as ORB, feature matches usually still exist when there is no co-vision between neighbouring frames when the camera is moving intensely, which is disastrous for position estimation and mapping. Therefore, filtering the feature matches or using a data-driven approach to build a more robust feature descriptor is worthy of in-depth research.