EMS-SLAM: Dynamic RGB-D SLAM with Semantic-Geometric Constraints for GNSS-Denied Environments

Fan, Jinlong; Ning, Yipeng; Wang, Jian; Jia, Xiang; Chai, Dashuai; Wang, Xiqi; Xu, Ying

doi:10.3390/rs17101691

Open AccessArticle

EMS-SLAM: Dynamic RGB-D SLAM with Semantic-Geometric Constraints for GNSS-Denied Environments

by

Jinlong Fan

¹

,

Yipeng Ning

^1,*,

Jian Wang

²,

Xiang Jia

¹,

Dashuai Chai

¹,

Xiqi Wang

³ and

Ying Xu

⁴

¹

School of Surveying and Geo-Informatics, Shandong Jianzhu University, Jinan 250101, China

²

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

³

School of Transportation Engineering, Shandong Jianzhu University, Jinan 250101, China

⁴

School of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(10), 1691; https://doi.org/10.3390/rs17101691

Submission received: 31 March 2025 / Revised: 4 May 2025 / Accepted: 9 May 2025 / Published: 12 May 2025

(This article belongs to the Special Issue Intelligent Perception and Robust Positioning Methods in GNSS-Denied Environments)

Download

Browse Figures

Versions Notes

Abstract

Global navigation satellite systems (GNSSs) exhibit significant performance limitations in signal-deprived environments such as indoor spaces and underground spaces. Although visual SLAM has emerged as a viable solution for ego-motion estimation in GNSS-denied areas, conventional approaches remain constrained by static environment assumptions, resulting in a substantial degradation in accuracy when handling dynamic scenarios. The EMS-SLAM framework combines the geometric constraints and semantics of SLAM to provide a real-time solution for addressing the challenges of robustness and accuracy in dynamic environments. To improve the accuracy of the initial pose, EMS-SLAM employs a feature-matching algorithm based on a graph-cut RANSAC. In addition, a degeneracy-resistant geometric constraint method is proposed, which effectively addresses the degeneracy issues of purely epipolar approaches. Finally, EMS-SLAM combines semantic information with geometric constraints to maintain high accuracy while quickly eliminating dynamic feature points. Experiments were conducted on the public datasets and our collected datasets. The results demonstrate that our method outperformed the current algorithms of SLAM in highly dynamic environments.

Keywords:

geometric constraints; semantic information; GC-RANSAC; dynamic environment; GNSS-denied environments

Graphical Abstract

1. Introduction

SLAM is a key technology enabling self-awareness in intelligent mobile devices, with applications in fields like mobile robots, drones, and so on [1,2]. While GNSS-based navigation systems have been widely adopted in open environments, their positioning accuracy severely degrades in denied environments such as indoor spaces, urban canyons, and underground tunnels due to signal occlusion or interference [3,4]. This limitation has driven significant research interest in alternative localization methods, particularly visual SLAM, which leverages the ubiquity and low cost of cameras to achieve environment-aware positioning without relying on external signals [5]. Due to the increasing number of visual SLAM systems being developed, the challenge of performing well in GNSS-denied environments has been addressed, some of which include PTAM [6], ORB-SLAM2 [7], VINS-Mono [8], and ORB-SLAM3 [9].

However, visual SLAM’s applicability is limited by its reliance on a static environment assumption, which restricts its use in real-world scenarios. Dynamic environments in real applications introduce numerous erroneous correspondences, resulting in an inaccurate state estimation in visual SLAM systems [10,11]. Although the algorithms for generalizing SLAM systems perform well in certain scenarios, they still encounter issues when dealing with dynamic objects. In recent years, various methods for dealing with dynamic objects have been presented. These include deep learning-based methods and geometry-based approaches as well as hybrid methods.

Most existing geometry-based methods that are used for dealing with dynamic objects rely on the information collected from rigid objects and motion consistency. Kim et al. [12] proposed a dense visual odometry method that used an RGB-D camera to estimate the sensor’s ego-motion. Wang et al. [13] utilized an RGB-D algorithm for detecting moving objects in an indoor environment that was able to achieve an improved performance by taking into account geometric constraints and mathematical models. Geometry-based methods usually depend on predefined thresholds to eliminate dynamic feature points, often leading to over-detection or under-detection.

Methods for deep learning that are used for removing dynamic features rely on semantic data collected from images [14]. Li et al. [15] proposed a system that used semantic segmentation and mixed information fusion to improve its localization accuracy and remove dynamic feature points. Xiao et al. [16] proposed a dynamic object detection method using an SSD object detector with prior knowledge and a compensation algorithm to enhance the recall rate. They also developed a feature-based visual SLAM system that minimized pose estimation errors by selectively tracking dynamic objects. Liu et al. [17] presented a dynamic SLAM algorithm based on the ORB-SLAM3 framework. It could perform better in real-time tracking accuracy when dealing with complex environments. The reliance on prior knowledge can lead to errors in deep learning such as when it mistakenly recognizes a static object as a dynamic one [18]. Simple networks may not recognize complex dynamic objects effectively, while complex networks may affect the system’s real-time performance.

Combining geometric methods with deep learning can fully leverage the strengths of both approaches Dynamic environments are first extracted from deep learning, and these methods then refine the extracted semantic information using geometric constraints. Bescos et al. [19] presented the DynaSLAM framework, which is an extension of ORB-SLAM2. It features multi-view geometry and masks-RCNN for removing dynamic points. In a previous paper, Zhao et al. [20] proposed a workflow that enabled the accurate segmenting of objects in dynamic environments. Yang et al. [21] first identified predefined dynamic targets using object detection models and then verified the depth information of dynamic pixels through multi-view constraints. Wu et al. [22] proposed YOLO-SLAM, which filtered dynamic feature points with Darknet19-YOLOv3 and geometric constraints. Wang et al. [23] proposed an algorithm combining semantic segmentation and geometric constraints to improve SLAM accuracy in dynamic indoor environments.

Although the methods above-mentioned have achieved significant improvements in accuracy, they still have the following limitations:

(1): Systems that rely on semantic information often face difficulties in detecting unknown dynamic features. For example, chairs are typically considered static in semantic labels, but the system may fail to identify this dynamic behavior if they move.
(2): The misclassification of stationary dynamic objects as dynamic can lead to excessive feature point removal and affect the data integrity. For example, a person sitting on a chair may be mistakenly identified as a dynamic object.
(3): The complexity of semantic segmentation networks leads to inadequate real-time performance and reduced processing efficiency.

The EMS-SLAM is an improved SLAM algorithm that adopts an RGB-D structure and addresses the limitations and disadvantages of the current methods. This method incorporates geometric constraints and semantic information to suppress dynamic points. In addition to being able to adapt to complex environments, it also meets the requirements of real-time applications. The EMS-SLAM system framework is shown in Figure 1. The main contributions it makes are summarized below.

(1): A feature matching algorithm leveraging graph-cut RANSAC for accurately estimating the pose of cameras;
(2): A degeneracy-resistant geometric constraint method that combines epipolar constraints with multi-view geometric constraints, effectively resolving the degeneracy issues inherent in purely epipolar approaches;
(3): EMS-SLAM, a real-time RGB-D SLAM system driven by an RGB-D sensor, is constructed. A new parallel thread is integrated into EMS-SLAM to extract semantic information;
(4): A dynamic feature suppression algorithm is proposed that combines the advantages of geometric and semantic constraints to efficiently suppress dynamic points. The proposed method can greatly enhance the system’s capability to adapt to complex environments and ensure a stable and accurate performance.

2. Method

2.1. System Framework

The widely used ORB-SLAM2 system is a high-accuracy visual SLAM that includes three main components: loop-closing, tracking, and local mapping. Building on this foundation, EMS-SLAM has been enhanced for dynamic environments by incorporating additional functional modules. Figure 1 illustrates the EMS-SLAM architecture, which consists of four synchronized threads: tracking, object detection, local mapping, and loop closing.

In EMS-SLAM, RGB-D camera frames are simultaneously processed by the tracking and object detection threads during operation. Within the tracking thread, the RGB image is used to extract ORB feature points. These features are then processed using a GC-RANSAC matching method to ensure the accurate identification of inliers and outliers. The fundamental matrix between frames is then computed using the inlier points, and the fundamental matrix is then used as prior for the geometric constraints. The object detection thread employs a MobileNetV3-SSD model to detect objects in the scene. The object detection thread detects objects in the image and filters out dynamic ones. Since object detection is computationally intensive, the tracking thread has to wait for detection results after calculating the geometric information. To reduce latency, a lightweight object detection model is instead of the semantic segmentation model, which significantly reduces the processing delay.

Subsequently, the tracking thread accurately filters out dynamic points using the dynamic point suppression strategy and utilizes the remaining static points to estimate the camera’s pose. The system then transitions to the local mapping and loop closure threads. Finally, bundle adjustment optimization is carried out to improve the trajectory’s accuracy and system performance.

2.2. Object Detection

In the EMS-SLAM, we used MobileNetV3-SSD to extract semantic information from the input RGB image. The SSD algorithm, proposed by Liu et al. [24] in 2016, effectively balances the detection speed and accuracy. MobileNetV3 [25], a lightweight model introduced by Google in 2019, builds upon its predecessors to achieve high detection accuracy with minimal memory usage. Because SSD’s base network, VGG16, has numerous parameters and is unsuitable for embedded platforms, we replaced it with MobileNetV3 to construct the MobileNetV3-SSD object detection model.

We selected NCNN as the inference framework for efficient inference on mobile devices and embedded platforms. NCNN’s optimizations allow the MobileNetV3-SSD model to perform object detection with low latency and high computational efficiency. The training of the model was carried out on the PASCAL VOC 2007 dataset, which ensured its generalizability across different scenarios.

2.3. Feature Matching Method Based on GC-RANSAC

In SLAM systems, the accuracy of the initial pose estimation is important [26,27]. However, in dynamic environments, moving objects can lead to feature point mismatches, compromising the pose estimation accuracy. In order to improve the accuracy of the initial pose estimation, we proposed a method that uses the GC-RANSAC algorithm [28]. Our method addresses the problem of separating inliers and outliers through geometric and spatial consistency to improve the feature point identification accuracy.

We employed the epipolar constraint to determine whether a pair of matching points was an inlier or an outlier. Between two consecutive frames, for a given pair of matching points

(P_{i}, P_{i}^{'})

, they must satisfy the epipolar constraint

l_{i} = F P_{i}

, where

l_{i}

is the epipolar line corresponding to

P_{i}

in the other view, computed from the fundamental matrix

F

. The RANSAC algorithm’s iterative iteration process classifies the matching point pairs into outliers or inliers according to their constraint. The classification is achieved by optimizing the following energy function:

F (L) = \sum_{i} C (L_{i}) + λ \sum_{(i, j) \in G} Q (L_{i}, L_{j})

(1)

where

L = {L_{i} \in {0, 1} | i = 1, …, n}

denotes the label assignment for the matching point

(P_{i}, P_{i}^{'})

, and

G

represents the neighborhood graph.

The unary term

C (L_{i})

represents the geometric relationship, defined as:

C (L_{i}) = \{\begin{matrix} K (ϕ (p_{i}, p_{i}^{'}, θ), ε) \\ 1 - K (ϕ (p_{i}, p_{i}^{'}, θ), ε) \end{matrix} \begin{matrix} if L_{i} = 0 \\ if L_{i} = 1 \end{matrix}

(2)

and

K (δ, ε) = e^{- \frac{δ^{2}}{2 ε^{2}}}

(3)

where

θ

is the angle parameter of the fundamental matrix

F

,

ϕ (p_{i}, p_{i}^{'}, θ)

represents the distance between the fundamental matrix and the matching point,

ε

is the inlier–outlier threshold, and the function is a Gaussian kernel function.

L_{i} = 0

indicates that the matching point

(P_{i}, P_{i}^{'})

is an inlier; otherwise, it is an outlier.

The pairwise term

Q (L_{i}, L_{j})

is used to describe the spatial consistency between neighboring feature points, defined as:

Q (L_{i}, L_{j}) = \{\begin{matrix} 1 \\ ((C (L_{i}) + C (L_{j})) / 2 \\ (1 - (C (L_{i}) + C (L_{j})) / 2 \end{matrix} \begin{matrix} if L_{i} \neq L_{j} \\ if L_{i} = L_{j} = 0 \\ if L_{i} = L_{j} = 1 \end{matrix}

(4)

By minimizing this energy function, the GC-RANSAC algorithm effectively separates inliers and outliers [29]. We then used the inliers to estimate the fundamental matrix

F

, providing accurate prior information for the epipolar constraints and enhancing the accuracy in filtering out dynamic feature points. Figure 2 illustrates an example of inlier selection using our proposed method, with the procedure summarized in Algorithm 1.

Algorithm 1 Feature Matching Method Based on GC-RANSAC
Input:	$F_{1}$ : Previous frame; $F_{2}$ : Current frame; $F_{r}$ : Reference frame.
Output:	$F$ : The fundamental matrix
1: Match features between frames $F_{1}$ and $F_{2}$
2: Get static feature points by GC-RANSAC
3: for each static feature point $p_{i}$ in $F_{2}$ do
4: $P_{i}$ ← FindCorresponding3DPoint( $p_{i}$ , $F_{r}$ )
5: end for
6: Estimate fundamental matrix $F$ by GC-RANSAC
7: return $F$

2.4. Epipolar Constraints

In dynamic environments, current networks can recognize object classes but cannot identify their motion status. Therefore, to eliminate dynamic points originating from static objects, we make full use of epipolar constraints to determine the motion state of feature points. The distance between the feature points and their epipolar lines is calculated, and the points that exceed a certain threshold are regarded as dynamic.

Figure 3 illustrates the optical centers of the cameras, denoted as

C_{1}

and

C_{2}

. The lines

l_{1}

and

l_{2}

represent the epipolar lines. The spatial point

P

corresponds to the feature points

p_{1}

and

p_{2}

observed in the previous and current frames, respectively:

p_{1} = {[a_{1}, b_{1}]}^{T}, p_{2} = {[a_{2}, b_{2}]}^{T}

(5)

The coordinate

(a_{1}, b_{1})

and

(a_{2}, b_{2})

of the feature points of the two frames are shown in the above equation. Their corresponding homogeneous coordinates are:

P_{1} = {[a_{1}, b_{1}, 1]}^{T}, P_{2} = {[a_{2}, b_{2}, 1]}^{T}

(6)

The epipolar line

l_{2}

can be determined as follows in the current frame:

l_{2} = [\begin{matrix} A \\ B \\ C \end{matrix}] = F P_{1} = F [\begin{matrix} a_{1} \\ b_{1} \\ 1 \end{matrix}]

(7)

where

A

,

B

,

C

represent the line vector components, and

F

represents the fundamental matrix. The epipolar constraints can be expressed as:

P_{2}^{T} F P_{1} = P_{2}^{T} l_{2} = 0

(8)

The deviation distance

D_{i}

of feature point

p_{i}

to epipolar line

l_{i}

is defined as follows:

D_{i} = \frac{|P_{i}^{T} F P_{1}|}{\sqrt{A^{2} + B^{2}}}

(9)

The spatial point

P

can be static or dynamic. If it is static, the corresponding feature point in the current frame is

p_{2}

, and its deviation distance is

D_{2} = 0

. On the other hand, if it is dynamic, the corresponding feature point is

p_{4}

, and its deviation distance is

D_{4} > 0

.

Figure 3 shows that if a point

P

is static, its feature point

p_{2}

should be on the epipolar line

l_{2}

. However, due to the noise in the environment, its deviation is not always zero but within the empirical threshold

D_{t h r e s h}

. When the point moves, its distance

D

between the epipolar line and the feature point can be calculated. This is used to determine the dynamic state of the feature.

When point

P

moves along the epipolar direction to

P^{″}

, its corresponding feature point

p_{5}

lies on epipolar line

l_{2}

. Therefore, epipolar constraints cannot effectively identify these dynamic objects in such cases. To address this issue, we introduced multi-view geometric constraints, building on epipolar constraints, to detect further missed dynamic feature points.

2.5. Muti-View Geometric Constraints

Dynamic scenes create significant changes in the angle and depth of the feature points in the reference frames and current scenes. Dynamic feature points can be eliminated using angular and depth information. This principle is shown in Figure 4.

Figure 4 shows that the keyframe feature point

x_{k f}

in the reference frame is projected onto the current frame to obtain the feature point

x_{c f}

and its projected depth

Z_{c f}

, generating the corresponding 3D point

X

. The depth information is obtained by calculating the depth value

Z_{c f}

and

Z_{k f}

in the current frame. The depth difference

Δ Z

is defined as

Δ Z = |Z_{k f} - Z_{c f}|

. The angular deviation

Δ θ

is the angle between the reprojections of

x_{k f}

and

x_{c f}

, which is determined using the cosine rule.

By combining the depth information and angular deviation, a feature point

x_{c f}

is identified as dynamic under multi-view geometric constraints when the depth difference

Δ Z

exceeds the threshold

Z_{t h r e s h}

or the angular deviation

Δ θ

exceeds the threshold

θ_{t h r e s h}

. Here,

Z_{t h r e s h}

and

θ_{t h r e s h}

are the depth and angular thresholds, respectively.

2.6. Dynamic Feature Suppression Method

This paper proposed a geometric information-based dynamic feature suppression method aimed at reducing the over-reliance on depth learning. First, the target category in formation obtained through the detection module was used as prior knowledge for the system to distinguish between moving and non-moving objects. Since different object categories exhibit distinct motion characteristics, the system assigns each feature point a weight based on its object category, with values ranging from 0 to 1. Indoor objects were classified into three categories based on their motion status: active motion, passive motion, and absolutely static, with different weights assigned to each category. The classification details are shown in Table 1.

Subsequently, epipolar constraints were used to calculate the distance between feature points and their associated epipolar lines, which was then integrated with feature weights for a more refined comparison. If the distance was below the threshold, it underwent multi-view geometric analysis to examine the angle and depth changes, further determining whether it was dynamic or static. If the distance was below the empirical threshold, the point underwent multi-view geometric analysis to examine the changes in depth and angle, further determining whether it was dynamic or static. By applying these rules, the system can accurately determine the status of all feature points in the current frame. Algorithm 2 provides a detailed explanation of the suppression strategy for dynamic feature points.

Algorithm 2 Dynamic Feature Rejection Strategy
Input:	$F_{1}$ : Previous frame; $F_{2}$ : Current frame; $P_{1}$ : Previous frame’s feature points; $P_{2}$ : Current frame’s feature points; Depth image; Standard empirical thresholds; $F$ : The fundamental matrix.
Output:	$S$ : Static feature points in the current frame.
1: ReferenceFrames ← ComputeReferenceFrames( $F_{2}$ )
2: for each matched pair ( $p_{1}$ , $p_{2}$ ) in $P_{1}$ , $P_{2}$ do:
3: if dynamic object exists and $P_{2}$ is in a dynamic area then
if epipolar line distance and movement probability is within threshold then
Compute3DPointInReferenceFrames( $p_{2}$ )
if depth and angle are within thresholds then
$S$ ← $p_{2}$
end if
end if
else
if epipolar line distance is within threshold then
Compute3DPointInReferenceFrames( $p_{2}$ )
if depth and angle are within thresholds then
$S$ ← $p_{2}$
end if
8: end if
9: end if
18: end for
18: return $S$

3. Results

This section presents an evaluation of the performance of the system using publicly available TUM RGB-D [30] and Bonn RGB-D [31] datasets. Subsequently, ablation and comparative experiments were performed to validate the effectiveness of each module in the dynamic scenes. Additionally, we tested the algorithm on self-collected data. Finally, an in-depth analysis of the system’s real-time performance was carried out. We used the relative pose error (RPE) and absolute trajectory error (ATE) to evaluate the accuracy of SLAM. The optimal results for each sequence have been highlighted in bold. To address uncertainties in object detection and feature extraction, each algorithm was tested ten times per sequence, and the median value was taken as the final result. All experiments were conducted on a computer running Ubuntu 18.04, equipped with an Intel i7-9750H (2.60 GHz) CPU, an NVIDIA GeForce GTX 1660 GPU, and 16 GB of RAM.

3.1. Evaluation on the TUM RGB-D Dataset

In 2012, the TUM group released the RGB-D dataset, which has since become a standard in the SLAM sector. The dataset features various dynamic scenes, each with its own accurate ground truth trajectory. It was obtained through the use of a high-accuracy capture system. We tested EMS-SLAM by performing five dynamic scenes. Among these, the sitting sequence featured a person sitting in a chair with slight limb movement, representing a low-dynamic environment, while the walking sequence featured a person walking continuously, representing a high-dynamic environment.

EMS-SLAM was compared with ORB-SLAM2 to assess performance improvements, as shown in Figure 5 and Figure 6. As indicated by the ATE results in Table 2, in high-dynamic sequences, EMS-SLAM achieved a 96.36% improvement in average RMSE accuracy compared with ORB-SLAM2, while the improvement in low-dynamic sequences was 55.17%. This was due to the smaller size and limited movement of dynamic objects in low-dynamic sequences. The RPE results in Table 3 and Table 4 indicate that the accuracy of the RMSE was significantly improved by EMS-SLAM compared with ORB-SLAM2. In high-dynamic sequences, the RMSE accuracy improvement was 93.07% and 86.92%, while in low-dynamic instances, it was 47.31% and 46.99%, respectively.

These findings highlight the system’s effectiveness in high-dynamic sequences, with limited improvement in low-dynamic scenarios due to the minimal impact of dynamic objects on SLAM. The ATE results presented in Figure 5 highlight the differences between the ground-truth and estimated trajectories. EMS-SLAM significantly improved the accuracy compared with ORB-SLAM2. Figure 6 presents the RPE results, where EMS-SLAM showed reduced errors.

The proposed algorithm was further evaluated by comparing its performance with other mainstream SLAM systems including DSLAM [32], RS-SLAM [33], DS-SLAM [34], RDS-SLAM [32], RDMO-SLAM [35], Blitz-SLAM [36], YOLO-SLAM [22], and SG-SLAM [37]. Table 5 presents the experimental results for five dynamic scene sequences. Aside from the fr3_w_static sequence, where EMS-SLAM performed slightly worse than RDMO-SLAM, it achieved a higher average accuracy than other algorithms across the remaining sequences.

3.2. Evaluation on the Bonn RGB-D Dataset

The University of Bonn’s Robotics and Photogrammetry Laboratory released the Bonn RGB-D Dynamic Dataset in 2019, which features 24 dynamic sequences that can be used for various scenarios such as walking people and moving objects. These sequences were recorded using ground-truth camera trajectories from OptiTrack Prime 13. Nine representative sequences were selected to evaluate the system performance. In the crowd sequence, three people were simulated walking randomly indoors, while the moving_no_box sequence depicted the process of moving a box from the ground to a table without obstructing the line of sight. In the person_tracking sequence, the camera followed a slow-moving individual, and in the synchronous sequence, two people were shown moving at the same speed and in the same direction. These highly dynamic scenes present significant challenges for traditional SLAM systems.

The EMS-SLAM system was compared with three standard SLAM systems to verify its robustness and accuracy: ORB-SLAM2, SG-SLAM, and YOLO-SLAM. The evaluation results of the nine scene sequences are presented in Table 6. In the crowd3 sequence, frequent mutual occlusions, overlapping actions, and changing viewpoints among people in the scene led to false positives and missed detections, affecting the overall performance of the algorithm. In the synchronous2 sequence, the dynamic objects moved synchronously at similar speeds and in the same direction, causing the motion differences relied upon by the epipolar constraints to become indistinct, which in turn affected the accuracy of the motion estimation. Although EMS-SLAM’s performance was slightly inferior to other algorithms on the crowd3 and synchronous2 sequences, EMS-SLAM significantly outperformed the other algorithms in the remaining sequences. These results not only further highlight the EMS-SLAM system’s exceptional accuracy and robustness in dynamic environments, but also showcase its generalization ability across various complex dynamic scenarios.

3.3. Effectiveness Validation of Each Module

3.3.1. Ablation Experiments

The experiments were performed to evaluate the performance of the multi-view geometric and feature matching modules in the TUM RGB-D dataset. Trajectory estimation results were compared across three configurations: EMS-SLAM without the improved feature matching module, EMS-SLAM without the multi-view geometric module, and the complete EMS-SLAM system.

The results of the ablation procedures presented in Table 7 are shown in detail. The system’s performance in highly dynamic sequences was 36.37% higher than that of the configuration without GC-RANSAC. In contrast, in low-dynamic sequences, the performance was 30.23% lower. Compared with the system without the multi-view geometric algorithm, the complete system achieved an average improvement of 24.68% in highly dynamic sequences and 10.23% in low-dynamic sequences. The main reason is that the smaller dynamic areas and magnitudes reduced the impact of geometric constraints in low-dynamic sequences.

3.3.2. Comparative Experiments

The experiments were conducted to evaluate the effectiveness of the dynamic feature suppression strategy. Three different approaches were used: a geometric algorithm, a semantic algorithm, and a combined algorithm. The results of the experiments are shown in Figure 7 and show the effectiveness of the different approaches in detecting dynamic points.

(1): EMS-SLAM (S): Removes dynamic feature points only using semantic information.
(2): EMS-SLAM (G): Removes dynamic feature points only using geometric information.
(3): EMS-SLAM (S+G): Removes dynamic feature points by combining semantic and geometric information.

The results, shown in Figure 7 and Table 8, demonstrate the comparative performance of each method. ORB-SLAM2, which does not handle dynamic features, indiscriminately extracted feature points in dynamic areas. EMS-SLAM (S), using only semantic information, misclassified many static feature points as dynamic such as those on a computer monitor. EMS-SLAM (G), relying on geometric information, still misidentified some dynamic features as static such as those on a person’s knee while seated. In contrast, EMS-SLAM (S+G), combining semantic and geometric information, effectively removed all dynamic features associated with moving objects while retaining the static object features, outperforming both of the other algorithms.

The two algorithms based on individual information had different strengths and weaknesses across sequences. In contrast, the algorithm delivered the best results in all test sequences by combining the geometric and semantic information. However, in the fr3_w_rpy sequence, complex camera movements caused dynamic objects to blur, resulting in the geometric algorithm having significantly lower errors than the semantic and combined algorithms. Based on the experimental results in Table 8, in high-dynamic scenes, the combination of semantic and geometric methods improved the RMSE accuracy by 79.17% and 8.70%, respectively, when compared with the semantic or geometric methods individually. In low-dynamic scenes, the combined method yielded results that were nearly identical to those obtained using semantic or geometric methods alone. This suggests that the integration of semantic and geometric information effectively compensates for the limitations of each method, thereby enhancing the overall performance of the system.

3.4. Evaluation Using Our Collected Dataset

EMS-SLAM was tested on the dynamic scene data collected with a Kinect v2 camera to validate its performance in complex dynamic scenarios. The data, collected in a laboratory setting with people walking, included two dynamic environments named sdjz_w_static_01 and sdjz_w_static_02. The datasets contained color and depth images at a resolution of 640 × 480. The trajectories of the EMS-SLAM and ORB-SLAM2 were evaluated by keeping the Kinect v2 stationary while collecting data. As a result, the Kinect v2 sensor’s actual trajectory was a fixed point in space, and the error was computed as the distance between its estimated and actual trajectory.

The extraction of feature points for the various sequences presented in Figure 8 was performed for the two different systems: ORB-SLAM2 and EMS-SLAM. Compared with ORB-SLAM2, EMS-SLAM effectively removed all feature points on dynamic objects. As shown in Table 9, the RMSE accuracy of the ATE improved by 94.45% and 96.25% for the sdjz_w_static_01 and sdjz_w_static_02 sequences, respectively.

3.5. Timing Analysis

The performance of EMS-SLAM was compared with that of other systems by measuring the average time it took for each frame on the TUM dataset, as shown in Table 10. Systems such as DS-SLAM, YS-SLA [38], and RDS-SLAM utilize pixel-level semantic segmentation networks, significantly increasing their average frame processing time. While YOLO-SLAM employs an object detection algorithm, its real-time performance is limited by hardware constraints. In contrast, EMS-SLAM uses a lightweight object detection network, greatly enhancing the inference efficiency and processing speed on resource-constrained devices. The average frame processing time for EMS-SLAM increased by only 6.22 milliseconds compared with ORB-SLAM2, fully satisfying the real-time performance requirements of SLAM.

4. Discussion

The EMS-SLAM algorithm proposed in this paper has made significant progress in the field of real-time SLAM, particularly in dynamic environments. By integrating a feature matching method based on GC-RANSAC, we enhanced the initial camera pose estimation, which is crucial for achieving accurate localization in complex scenarios. Additionally, our degeneracy-resistant geometric constraint method, which combines epipolar constraints with multi-view geometric constraints, further strengthens the robustness of the algorithm, addressing common limitations faced by traditional SLAM methods. The inclusion of a parallel object detection thread ensures that the system can handle dynamic objects in real-time, making it suitable for a wide range of practical applications.

However, during the experiments, we observed that the performance of EMS-SLAM may be compromised when images are blurred, particularly in the extraction of semantic information. This issue is particularly pronounced when the camera experiences significant motion or operates in low-light conditions, both of which are common challenges in real-world applications. In these scenarios, feature extraction and matching become unreliable, potentially leading to reduced accuracy in the final pose estimation.

Future improvements will focus on addressing this limitation by enhancing the semantic information extraction process. This could involve integrating more advanced image processing techniques, such as deep learning-based feature extraction methods, which are better equipped to handle blurry or low-quality images.

5. Conclusions

The proposed EMS-SLAM is a real-time SLAM algorithm designed to enhance the accuracy and robustness in dynamic environments. To more accurately estimate the initial camera pose, we proposed a feature matching method based on GC-RANSAC that utilized geometric and spatial consistency. In addition, for the degeneracy problem of epipolar constraints, we presented a degeneracy-resistant geometric constraint method that combines epipolar constraints with multi-view geometric constraints. A parallel object detection thread was added to EMS-SLAM. Finally, we proposed a dynamic point suppression method that integrated the semantic information and geometric constraints. Experimental results demonstrated that EMS-SLAM outperformed other state-of-the-art SLAM algorithms on the TUM RGB-D, Bonn RGB-D, and our collected datasets. Ablation and comparative experiments further validated the contributions of each module.

Author Contributions

Conceptualization, X.J.; Methodology, J.F.; Validation, Y.N.; Formal analysis, D.C. and Y.X.; Investigation, J.W.; Resources, X.W.; Data curation, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (42204011); the Natural Science Foundation of Shandong Province (ZR2021QD058); the Natural Science Foundation of Shandong Province (ZR2022QD108); and the Strategic Engineering Project for City-School Integration Development of Jinan City (JNSX2024033).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Giubilato, R.; Chiodini, S.; Pertile, M.; Debei, S. An Evaluation of ROS-Compatible Stereo Visual SLAM Methods on a nVidia Jetson TX2. Measurement 2019, 140, 161–170. [Google Scholar] [CrossRef]
Jia, X.; Ning, Y.; Chai, D.; Fan, J.; Yang, Z.; Xi, X.; Zhu, F.; Wang, W. EGLT-SLAM: Real-Time Visual-Inertial SLAM Based on Entropy-Guided Line Tracking. IEEE Sens. J. 2024, 24, 32757–32771. [Google Scholar] [CrossRef]
Yu, H.; Wang, Q.; Yan, C.; Feng, Y.; Sun, Y.; Li, L. DLD-SLAM: RGB-D Visual Simultaneous Localisation and Mapping in Indoor Dynamic Environments Based on Deep Learning. Remote Sens. 2024, 16, 246. [Google Scholar] [CrossRef]
Sun, Y.; Wang, Q.; Yan, C.; Feng, Y.; Tan, R.; Shi, X.; Wang, X. D-VINS: Dynamic Adaptive Visual–Inertial SLAM with IMU Prior and Semantic Constraints in Dynamic Scenes. Remote Sens. 2023, 15, 3881. [Google Scholar] [CrossRef]
Fuentes-Pacheco, J.; Ruiz-Ascencio, J.; Rendón-Mancha, J.M. Visual Simultaneous Localization and Mapping: A Survey. Artif. Intell. Rev. 2015, 43, 55–81. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–10. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Durrant-Whyte, H.; Bailey, T. Simultaneous Localization and Mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
Kim, D.-H.; Kim, J.-H. Effective Background Model-Based RGB-D Dense Visual Odometry in a Dynamic Environment. IEEE Trans. Robot. 2016, 32, 1565–1573. [Google Scholar] [CrossRef]
Wang, R.; Wan, W.; Wang, Y.; Di, K. A New RGB-D SLAM Method with Moving Object Detection for Dynamic Indoor Scenes. Remote Sens. 2019, 11, 1143. [Google Scholar] [CrossRef]
Li, G.; Yu, L.; Fei, S. A Deep-Learning Real-Time Visual SLAM System Based on Multi-Task Feature Extraction Network and Self-Supervised Feature Points. Measurement 2021, 168, 108403. [Google Scholar] [CrossRef]
Li, F.; Chen, W.; Xu, W.; Huang, L.; Li, D.; Cai, S.; Yang, M.; Xiong, X.; Liu, Y.; Li, W. A Mobile Robot Visual SLAM System with Enhanced Semantics Segmentation. IEEE Access 2020, 8, 25442–25458. [Google Scholar] [CrossRef]
Xiao, L.; Wang, J.; Qiu, X.; Rong, Z.; Zou, X. Dynamic-SLAM: Semantic Monocular Visual Localization and Mapping Based on Deep Learning in Dynamic Environment. Robot. Auton. Syst. 2019, 117, 1–16. [Google Scholar] [CrossRef]
Liu, Y.; Miura, J. RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
Wang, Y.; Tian, Y.; Chen, J.; Xu, K.; Ding, X. A Survey of Visual SLAM in Dynamic Environment: The Evolution From Geometric to Semantic Approaches. IEEE Trans. Instrum. Meas. 2024, 73, 1–21. [Google Scholar] [CrossRef]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Zhao, L.; Liu, Z.; Chen, J.; Cai, W.; Wang, W.; Zeng, L. A Compatible Framework for RGB-D SLAM in Dynamic Scenes. IEEE Access 2019, 7, 75604–75614. [Google Scholar] [CrossRef]
Yang, D.; Bi, S.; Wang, W.; Yuan, C.; Wang, W.; Qi, X.; Cai, Y. DRE-SLAM: Dynamic RGB-D Encoder SLAM for a Differential-Drive Robot. Remote Sens. 2019, 11, 380. [Google Scholar] [CrossRef]
Wu, W.; Guo, L.; Gao, H.; You, Z.; Liu, Y.; Chen, Z. YOLO-SLAM: A Semantic SLAM System towards Dynamic Environment with Geometric Constraint. Neural Comput. Appl. 2022, 34, 6011–6026. [Google Scholar] [CrossRef]
Wang, X.; Zheng, S.; Lin, X.; Zhu, F. Improving RGB-D SLAM Accuracy in Dynamic Environments Based on Semantic and Geometric Constraints. Measurement 2023, 217, 113084. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Lin, Z.; Zhang, Q.; Tian, Z.; Yu, P.; Lan, J. DPL-SLAM: Enhancing Dynamic Point-Line SLAM Through Dense Semantic Methods. IEEE Sens. J. 2024, 24, 14596–14607. [Google Scholar] [CrossRef]
Xiao, X.; Zhang, Z.; Xia, G.-S.; Shao, Z.; Gong, J.; Li, D. RTO-LLI: Robust Real-Time Image Orientation Method With Rapid Multilevel Matching and Third-Times Optimizations for Low-Overlap Large-Format UAV Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5620419. [Google Scholar] [CrossRef]
Barath, D.; Matas, J. Graph-Cut RANSAC. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6733–6741. [Google Scholar]
Du, Z.-J.; Huang, S.-S.; Mu, T.-J.; Zhao, Q.; Martin, R.R.; Xu, K. Accurate Dynamic SLAM Using CRF-Based Long-Term Consistency. IEEE Trans. Visual. Comput. Graph. 2022, 28, 1745–1757. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 573–580. [Google Scholar]
Palazzolo, E.; Behley, J.; Lottes, P.; Giguere, P.; Stachniss, C. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7855–7862. [Google Scholar]
Dai, W.; Zhang, Y.; Li, P.; Fang, Z.; Scherer, S. RGB-D SLAM in Dynamic Environments Using Point Correlations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 373–389. [Google Scholar] [CrossRef]
Ran, T.; Yuan, L.; Zhang, J.; Tang, D.; He, L. RS-SLAM: A Robust Semantic SLAM in Dynamic Environments Based on RGB-D Sensor. IEEE Sens. J. 2021, 21, 20657–20664. [Google Scholar] [CrossRef]
Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
Liu, Y.; Miura, J. RDMO-SLAM: Real-Time Visual SLAM for Dynamic Environments Using Semantic Label Prediction With Optical Flow. IEEE Access 2021, 9, 106981–106997. [Google Scholar] [CrossRef]
Fan, Y.; Zhang, Q.; Tang, Y.; Liu, S.; Han, H. Blitz-SLAM: A Semantic SLAM in Dynamic Environments. Pattern Recognit. 2022, 121, 108225. [Google Scholar] [CrossRef]
Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A Real-Time RGB-D Visual SLAM Toward Dynamic Scenes With Semantic and Geometric Information. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
Li, J.; Luo, J. YS-SLAM: YOLACT++ Based Semantic Visual SLAM for Autonomous Adaptation to Dynamic Environments of Mobile Robots. Complex Intell. Syst. 2024, 10, 5771–5792. [Google Scholar] [CrossRef]

Figure 1. EMS-SLAM system framework. The blue parts represent the original ORB-SLAM2 modules, while the green parts indicate the improvement modules introduced in EMS-SLAM including object detection and geometric constraints.

Figure 2. Inlier selection by applying GC-RANSAC. After filtering with GC-RANSAC, the inliers consisted almost entirely of static feature points. Thus, these inliers were used for the fundamental matrix estimation. (a) Feature matching between current frame and reference frame before applying GC-RANSAC. (b) Feature point pairs labeled as inliers after applying GC-RANSAC.

Figure 3. Epipolar constraints.

Figure 4. Multi-view geometric constraints.

Figure 5. Comparison of ATE in the dynamic sequences. The first row presents the ATE of ORB-SLAM2, while the second row displays the ATE of EMS-SLAM.

Figure 6. Comparison of RPE in the dynamic sequences. The first row presents the RPE of ORB-SLAM2, while the second row displays the RPE of EMS-SLAM.

Figure 7. Feature point extraction results of ORB-SLAM2, EMS-SLAM (S), EMS-SLAM (G), and EMS-SLAM (S+G) in the TUM dataset.

Figure 8. Feature point extraction results of ORB-SLAM2 and EMS-SLAM in the collected dataset.

Table 1. Classification of indoor objects and movement status.

Object Type	Category	Status
Human, animals, etc.	Moving	Active movement
Chairs, books, water cups, etc.	Relative	Passive movement
Air conditioning, frames, lamps, etc.	Static	Absolutely static

Table 2. Result of ATE on the TUM dataset (m).

Sequences	ORB-SLAM2				EMS-SLAM				Improvements
Sequences	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D
fr3_s_static	0.0087	0.0078	0.0057	0.0072	0.0039	0.0050	0.0043	0.0032	55.17%	35.90%	24.56%	55.56%
fr3_w_half	0.4462	0.4096	0.3860	0.1770	0.0236	0.0182	0.0155	0.0151	94.71%	95.56%	95.98%	91.47%
fr3_w_rpy	0.5396	0.5012	0.4974	0.1999	0.0304	0.0224	0.0172	0.0205	94.37%	95.53%	96.55%	89.74%
fr3_w_static	0.4032	0.3696	0.3164	0.1627	0.0068	0.0061	0.0056	0.0030	98.32%	98.35%	98.23%	98.16%
fr3_w_xyz	0.6826	0.6086	0.6661	0.3091	0.0135	0.0113	0.0097	0.0074	98.02%	98.14%	98.54%	97.60%

Table 3. Result of RPE on the TUM dataset (m).

Sequences	ORB-SLAM2				EMS-SLAM				Improvements
Sequences	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D
fr3_s_static	0.0093	0.0082	0.0074	0.0044	0.0049	0.0049	0.0049	0.0005	47.31%	40.24%	33.78%	88.64%
fr3_w_half	0.3685	0.2072	0.0491	0.3047	0.0354	0.0337	0.0358	0.0108	90.39%	83.74%	27.06%	96.45%
fr3_w_rpy	0.3374	0.2344	0.1137	0.2426	0.0739	0.0572	0.0452	0.0467	78.10%	75.60%	60.26%	80.74%
fr3_w_static	0.2182	0.0950	0.0169	0.1965	0.0131	0.0104	0.0064	0.0079	93.99%	89.05%	62.13%	95.98%
fr3_w_xyz	0.3752	0.2944	0.2394	0.2326	0.0554	0.0459	0.0379	0.0310	85.23%	84.41%	84.17%	86.68%

Table 4. Result of RPE on the TUM dataset (°).

Sequences	ORB-SLAM2				EMS-SLAM				Improvements
Sequences	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D
fr3_s_static	0.2899	0.2606	0.2484	0.1271	0.1536	0.1296	0.1079	0.0824	46.99%	50.29%	56.57%	35.17%
fr3_w_half	7.9219	4.4695	1.2568	6.5406	0.4842	0.3952	0.3287	0.2797	93.89%	91.16%	73.85%	95.72%
fr3_w_rpy	6.4220	4.5134	2.2990	4.5685	0.6977	0.4584	0.3259	0.5259	89.14%	89.85%	85.82%	88.49%
fr3_w_static	3.8068	1.6993	0.3888	3.4065	0.1912	0.1668	0.1474	0.0934	94.98%	90.18%	62.08%	97.26%
fr3_w_xyz	7.1415	5.6403	4.6159	4.3804	0.4100	0.2890	0.2391	0.2908	94.26%	94.87%	94.82%	93.36%

Table 5. Results of the RMSE of ATE on the TUM RGB-D dataset (m).

Sequences	DSLAM	RS-SLAM	DS-SLAM	RDS-SLAM	RDMO-SLAM	Blitz-SLAM	YOLO-SLAM	SG-SLAM	EMS-SLAM
fr3_s_static	0.0096	0.0066	0.0065	0.0084	0.0066	-	0.0066	0.0060	0.0039
fr3_w_half	0.0354	0.0425	0.0303	0.0807	0.0304	0.0256	0.0283	0.0268	0.0236
fr3_w_rpy	0.1608	0.1869	0.4442	0.1604	0.1283	0.0356	0.2164	0.0324	0.0304
fr3_w_static	0.0108	0.0067	0.0081	0.0206	0.0066	0.0102	0.0073	0.0073	0.0068
fr3_w_xyz	0.0874	0.0146	0.0247	0.0571	0.0226	0.0153	0.0146	0.0152	0.0135

Table 6. Results of the RMSE of ATE on the Boon RGB-D dataset (m).

Sequences	ORB-SLAM2	YOLO-SLAM	SG-SLAM	EMS-SLAM
crowd	0.8632	0.033	0.0234	0.0165
crowd2	1.3573	0.423	0.0584	0.0576
crowd3	1.0772	0.069	0.0319	0.0426
moving_no_box	0.1174	0.027	0.0192	0.0137
moving_no_box2	0.1142	0.035	0.0299	0.0216
person_tracking	0.7959	0.157	0.0400	0.0228
person_tracking2	1.0679	0.037	0.0376	0.0331
synchronous	1.1411	0.014	0.3229	0.0069
synchronous2	1.4069	0.007	0.0164	0.0231

Table 7. Results of ATE from the ablation experiments on the TUM RGB-D dataset.

Sequences	Without Improved Feature Matching Module				Without Multi-View Geometric Module				Complete System
Sequences	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D
fr3_s_static	0.0086	0.0070	0.0056	0.0050	0.0067	0.0057	0.0048	0.0036	0.0060	0.0050	0.0043	0.0032
fr3_w_half	0.0283	0.0215	0.0175	0.0185	0.0284	0.0215	0.0176	0.0185	0.0236	0.0182	0.0155	0.0151
fr3_w_rpy	0.0755	0.0608	0.0491	0.0448	0.0797	0.0615	0.0457	0.0505	0.0304	0.0224	0.0172	0.0205
fr3_w_static	0.0090	0.0074	0.0064	0.0052	0.0081	0.0070	0.0064	0.0040	0.0068	0.0061	0.0056	0.0030
fr3_w_xyz	0.0244	0.0216	0.0192	0.0113	0.0161	0.0132	0.0113	0.0092	0.0135	0.0113	0.0097	0.0074

Table 8. Results of ATE from the comparative experiments on the TUM RGB-D dataset.

Sequences	EMS-SLAM (S)				EMS-SLAM (G)				EMS-SLAM (S+G)
Sequences	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D	RMSE	Mean	Median	S.D
fr3_s_static	0.0059	0.0055	0.0053	0.0018	0.0070	0.0061	0.0053	0.0035	0.0060	0.0050	0.0043	0.0032
fr3_w_half	0.1347	0.0905	0.0708	0.0998	0.0448	0.0342	0.0281	0.0290	0.0236	0.0182	0.0155	0.0151
fr3_w_rpy	0.1436	0.1184	0.0843	0.0813	0.0222	0.0162	0.0109	0.0152	0.0304	0.0224	0.0172	0.0205
fr3_w_static	0.0095	0.0079	0.0066	0.0053	0.0082	0.0071	0.0061	0.0041	0.0068	0.0061	0.0056	0.0030
fr3_w_xyz	0.0145	0.0124	0.0108	0.0075	0.0159	0.0115	0.0079	0.0109	0.0135	0.0113	0.0097	0.0074

Table 9. Results of the RMSE on our collected dataset.

Sequences	Error Type	ORB-SLAM2	EMS-SLAM	Improvement
Sdjn_w_static_01	ATE (m)	0.1568	0.0087	94.45%
	Translational RPE (m)	0.2393	0.0078	96.74%
	Rotational RPE (°)	0.3534	0.0270	92.36%
Sdjn_w_static_02	ATE (m)	0.3627	0.0136	96.25%
	Translational RPE (m)	0.0811	0.0051	93.71%
	Rotational RPE (°)	0.1453	0.0130	91.05%

Table 10. Time analysis (ms).

Algorithm	Average Tracking Time	Hardware Platform
ORB-SLAM2 (OURS)	28.25	Intel i7 CPU, NVIDIA GeForce GTX 1660 GPU
EMS-SLAM (OURS)	34.47	Intel i7 CPU, NVIDIA GeForce GTX 1660 GPU
YOLO-SLAM	696.09	Intel Core i5-4228U CPU
DS-SLAM	59.40	Intel i7 CPU.P4000 GPU
RDS-SLAM	57.50	Nvidia RTX 2080TiGPU
YS-SLAM	57.57	Imel i7 CPU, Nvidia GTX 3060 GPU. 6 GB RAM

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, J.; Ning, Y.; Wang, J.; Jia, X.; Chai, D.; Wang, X.; Xu, Y. EMS-SLAM: Dynamic RGB-D SLAM with Semantic-Geometric Constraints for GNSS-Denied Environments. Remote Sens. 2025, 17, 1691. https://doi.org/10.3390/rs17101691

AMA Style

Fan J, Ning Y, Wang J, Jia X, Chai D, Wang X, Xu Y. EMS-SLAM: Dynamic RGB-D SLAM with Semantic-Geometric Constraints for GNSS-Denied Environments. Remote Sensing. 2025; 17(10):1691. https://doi.org/10.3390/rs17101691

Chicago/Turabian Style

Fan, Jinlong, Yipeng Ning, Jian Wang, Xiang Jia, Dashuai Chai, Xiqi Wang, and Ying Xu. 2025. "EMS-SLAM: Dynamic RGB-D SLAM with Semantic-Geometric Constraints for GNSS-Denied Environments" Remote Sensing 17, no. 10: 1691. https://doi.org/10.3390/rs17101691

APA Style

Fan, J., Ning, Y., Wang, J., Jia, X., Chai, D., Wang, X., & Xu, Y. (2025). EMS-SLAM: Dynamic RGB-D SLAM with Semantic-Geometric Constraints for GNSS-Denied Environments. Remote Sensing, 17(10), 1691. https://doi.org/10.3390/rs17101691

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EMS-SLAM: Dynamic RGB-D SLAM with Semantic-Geometric Constraints for GNSS-Denied Environments

Abstract

1. Introduction

2. Method

2.1. System Framework

2.2. Object Detection

2.3. Feature Matching Method Based on GC-RANSAC

2.4. Epipolar Constraints

2.5. Muti-View Geometric Constraints

2.6. Dynamic Feature Suppression Method

3. Results

3.1. Evaluation on the TUM RGB-D Dataset

3.2. Evaluation on the Bonn RGB-D Dataset

3.3. Effectiveness Validation of Each Module

3.3.1. Ablation Experiments

3.3.2. Comparative Experiments

3.4. Evaluation Using Our Collected Dataset

3.5. Timing Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI