Next Article in Journal
Spatio-Temporal Recursive Method for Traffic Flow Interpolation
Previous Article in Journal
The Charmed Meson Spectrum Using One-Loop Corrections to the One-Gluon Exchange Potential
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSGS-SLAM: Monocular Semantic Gaussian Splatting SLAM

1
Houston International Institute, Dalian Maritime University, Dalian 116026, China
2
College of Information Engineering, ShenYang University of Chemical Technology, Shenyang 110142, China
3
Information Science and Technology College, Dalian Maritime University, Dalian 116026, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(9), 1576; https://doi.org/10.3390/sym17091576
Submission received: 12 July 2025 / Revised: 16 August 2025 / Accepted: 10 September 2025 / Published: 20 September 2025
(This article belongs to the Section Engineering and Materials)

Abstract

With the iterative evolution of SLAM (Simultaneous Localization and Mapping) technology in the robotics domain, the SLAM paradigm based on three-dimensional Gaussian distribution models has emerged as the current state-of-the-art technical approach. This research proposes a novel MSGS-SLAM system (Monocular Semantic Gaussian Splatting SLAM), which innovatively integrates monocular vision with three-dimensional Gaussian distribution models within a semantic SLAM framework. Our approach exploits the inherent spherical symmetries of isotropic Gaussian distributions, enabling symmetric optimization processes that maintain computational efficiency while preserving geometric consistency. Current mainstream three-dimensional Gaussian semantic SLAM systems typically rely on depth sensors for map reconstruction and semantic segmentation, which not only significantly increases hardware costs but also limits the deployment potential of systems in diverse scenarios. To overcome this limitation, this research introduces a depth estimation proxy framework based on Metric3D-V2, which effectively addresses the inherent deficiency of monocular vision systems in depth information acquisition. Additionally, our method leverages architectural symmetries in indoor environments to enhance semantic understanding through symmetric feature matching. Through this approach, the system achieves robust and efficient semantic feature integration and optimization without relying on dedicated depth sensors, thereby substantially reducing the dependency of three-dimensional Gaussian semantic SLAM systems on depth sensors and expanding their application scope. Furthermore, this research proposes a keyframe selection algorithm based on semantic guidance and proxy depth collaborative mechanisms, which effectively suppresses pose drift errors accumulated during long-term system operation, thereby achieving robust global loop closure correction. Through systematic evaluation on multiple standard datasets, MSGS-SLAM achieves comparable technical performance to existing three-dimensional Gaussian model-based semantic SLAM systems across multiple key performance metrics including ATE RMSE, PSNR, and mIoU.

1. Introduction

Visual Simultaneous Localization and Mapping (SLAM) [1] serves as a pivotal research topic in the computer vision domain, aiming to achieve three-dimensional map construction of unknown environments while simultaneously tracking camera poses in real-time. Examining recent research developments, traditional SLAM systems [2,3,4] have predominantly focused on map construction as a core problem, which fundamentally influences the design of all processing modules within SLAM systems. However, existing research primarily employs sparse representation methods such as voxels and point clouds, remaining somewhat insufficient in dense map representation. Moreover, these approaches fail to exploit inherent geometric symmetries present in real-world environments, such as bilateral and rotational symmetries, leading to suboptimal utilization of spatial regularity patterns.
Under the impetus of Neural Radiance Fields (NeRF) technology [5], NeRF-based SLAM methods have achieved significant progress in acquiring dense geometric information and constructing high-quality map representations [6]. These methods effectively capture dense photometric information about environments through differentiable rendering techniques, not only generating highly realistic three-dimensional global maps but also demonstrating good robustness against noise and outliers. However, NeRF-SLAM methods employing Multi-Layer Perceptrons (MLPs) [7] as implicit neural representations of scenes exhibit significant limitations [8]. On one hand, they suffer from over-smoothing effects at object boundaries, leading to geometric detail loss. This over-smoothing particularly disrupts preservation of local symmetries crucial for accurate scene understanding, as implicit continuous representation tends to blur symmetric structures. This limitation results in the lack of effective decoupling mechanisms for scene object representation, constraining semantic segmentation and scene editing capabilities. On the other hand, when confronted with extended scenes, they exhibit catastrophic forgetting phenomena, leading to substantial degradation in representation accuracy.
Against this backdrop, our exploration turns toward 3D Gaussian-based semantic SLAM systems [9,10,11,12]. This approach brings multiple significant advantages to SLAM systems: first, its efficient rasterization mechanism substantially improves rendering speed by eliminating view-dependency and adopting simplified spherical Gaussian distributions (rather than complex ellipsoidal forms), achieving real-time dense photometric optimization. Second, this representation possesses linear scaling capabilities, supporting arbitrary increases in map capacity and local editing while maintaining overall rendering quality. Additionally, semantic maps significantly enhance the system’s scene understanding capabilities, while this architectural design facilitates the integration of multimodal semantic information, enabling appearance, geometry, and semantic signals to collaboratively promote camera tracking and scene reconstruction, better meeting the requirements of robotic navigation and mixed reality applications. Moreover, we introduce Metric3D-V2 [13], a state-of-the-art monocular depth estimation model, to provide proxy depth information, further expanding the application scope and research value of this method in the robotics domain. This innovative integration enables real-time rendering of appearance, semantic colors, and proxy depth.
Compared to 3D Gaussian-based semantic RGB-D SLAM systems (such as SGS-SLAM), the system proposed in this research demonstrates comparable performance in key indicators including rendering efficiency, reconstruction quality, and semantic segmentation accuracy. This method significantly enhances the system’s depth reasoning robustness under heterogeneous material surfaces (such as transparent or highly reflective media) and complex lighting conditions, effectively overcoming depth acquisition distortion problems of traditional RGB-D sensors at special optical interfaces, thereby achieving stable scene understanding and reconstruction capabilities in variable environments. Furthermore, the multimodal integration strategy that fuses proxy depth information with explicit spatial representation and semantic features can significantly improve camera pose estimation accuracy and convergence efficiency. More importantly, we combine proxy depth, geometric, and semantic criteria, employing multi-level adjustment methods for keyframe selection, a process based on the identification of previously observed objects in the trajectory. Through extensive experiments in synthetic and real scene benchmarks, we comprehensively compare our method with implicit NeRF-based methods and novel 3D Gaussian-based semantic RGB-D methods, systematically evaluating performance in map construction, tracking, and semantic segmentation.
Overall, our work presents several key contributions, summarized as follows:
We propose MSGS-SLAM, a monocular semantic SLAM system based on 3D Gaussian splatting. The system integrates Metric3D-V2 [13] for proxy depth estimation to address scale ambiguity, while utilizing four-dimensional semantic Gaussian representation for unified geometric, appearance, and semantic optimization. This approach achieves high-fidelity reconstruction comparable to RGB-D methods while requiring only monocular input.
We develop innovative technical components including uncertainty-aware proxy depth fusion, semantic-guided keyframe selection, and multi-level loop detection mechanisms. The system employs adaptive optimization strategies that leverage appearance, geometric, and semantic information to ensure robust tracking and global consistency correction.
Our monocular approach eliminates the requirement for specialized depth sensors (e.g., LiDAR, structured light sensors) that typically cost $200–2000 compared to standard RGB cameras ($10–100), enabling deployment with ubiquitous consumer devices, existing video datasets, and resource-constrained environments. This fundamental hardware simplification expands applicability to historical data analysis, large-scale urban modeling, and scenarios where RGB-D sensors are impractical due to cost, power, or environmental constraints.
The paper is organized as follows: Section 2 reviews related work in NeRF-based SLAM, 3D Gaussian splatting, and semantic SLAM. Section 3 presents our methodology including proxy depth fusion, semantic Gaussian representation, and loop detection mechanisms. Section 4 provides comprehensive experimental evaluation and ablation studies. Section 5 concludes with discussions and future directions.

2. Related Work

2.1. NeRF-Based SLAM

Neural Radiance Fields (NeRF) have significantly advanced dense SLAM systems through their ability to model complex scenes with implicit neural representations. iNeRF [14] pioneered pose estimation by inverting neural radiance fields, establishing the foundation for NeRF-based SLAM. iMAP [15] introduced the first real-time system using NeRF as the sole map representation, though limited to 2 Hz mapping rates. Despite enabling dense reconstruction, iMAP’s computational bottleneck stemmed from per-frame neural network optimization, severely limiting scalability to larger environments. NICE-SLAM [16] addressed scalability through hierarchical scene representations, introducing multi-resolution voxel grids that improved efficiency but still suffered from memory consumption issues in extended sequences. Co-SLAM [17] improved performance via joint coordinate and sparse parametric encodings. Co-SLAM’s key contribution was decoupling coordinate and color learning, improving convergence speed but not fundamentally addressing over-smoothing at object boundaries. More recent methods include NICER-SLAM [18] for dense RGB SLAM and Point-SLAM [19] utilizing neural point clouds.
Despite these advances, NeRF-based methods face inherent limitations: over-smoothing at object boundaries, catastrophic forgetting in extended scenes, and computational inefficiency due to MLP architectures. The implicit representation’s continuous nature makes it particularly prone to losing fine geometric details at material boundaries, which are crucial for accurate semantic understanding and scene interpretation. These constraints limit real-time applicability in robotics applications.

2.2. 3D Gaussian Splatting SLAM

3D Gaussian Splatting (3DGS) [20] offers superior rendering efficiency through explicit scene representation. The key advantage lies in its differentiable rasterization process, avoiding expensive volumetric sampling required by NeRF methods. MonoGS [21] demonstrated the first monocular 3DGS SLAM, achieving real-time performance at 3 fps with high-fidelity reconstruction. However, MonoGS relied on traditional depth estimation techniques that suffer from significant scale drift in long sequences, limiting practical deployment in extended navigation scenarios. SplaTAM [22] extended this to RGB-D settings, delivering up to 2× performance improvements over existing methods. SplaTAM’s success was attributed to accurate RGB-D depth information, but this hardware requirement significantly increases system cost and limits deployment flexibility. GS-SLAM [23] contributed adaptive expansion strategies and coarse-to-fine tracking. While introducing effective Gaussian densification mechanisms, it lacked semantic awareness, making it unsuitable for applications requiring scene understanding. Recent developments include MM3DGS SLAM [24] for multi-modal fusion and various efficiency-focused variants.
However, existing Gaussian-based SLAM systems lack semantic understanding capabilities, limiting their application in scene interpretation and object-level tasks. This limitation is particularly problematic for robotic applications requiring environment understanding beyond geometric reconstruction, such as navigation planning around specific object categories.

2.3. Semantic SLAM

Traditional semantic SLAM systems [25] integrate semantic information into conventional frameworks. Early approaches like SLAM++ [26] and Kimera [27] focused on object-aware mapping with explicit representations, but suffered from storage limitations and reconstruction quality issues. SLAM++ represented scenes as collections of object models, providing semantic understanding, but also being limited to predefined categories and struggling with novel geometries. Kimera improved with mesh-based reconstruction, yet explicit representation inherently limited scalability and required significant memory resources.
Recent neural implicit methods have shown promise. SNI-SLAM [28] introduced hierarchical semantic representation with cross-attention for multi-modal feature fusion. SNI-SLAM’s contribution was integrating semantic and geometric features through attention mechanisms, enabling robust tracking. However, its NeRF-based implicit representation inherited computational bottlenecks and over-smoothing issues, affecting real-time performance. SGS-SLAM [29] pioneered semantic Gaussian Splatting, incorporating appearance, geometry, and semantic features through multi-channel optimization, achieving state-of-the-art performance in semantic segmentation [30] while maintaining real-time capabilities. SGS-SLAM demonstrated superior boundary preservation compared to implicit methods but remained dependent on RGB-D sensors, limiting applicability in scenarios where depth sensors are unavailable or unreliable.

2.4. Depth Estimation for Monocular SLAM

Monocular SLAM faces the fundamental challenge of scale ambiguity. Traditional solutions rely on bundle adjustment [31] and geometric constraints, while recent learning-based depth estimation methods offer new opportunities. Classical approaches like structure-from-motion require sufficient parallax and texture for reliable depth triangulation, often failing in texture-sparse environments. Recent deep learning methods such as MiDaS [32] produce relative depth maps lacking the metric scale consistency required for accurate SLAM applications. However, most existing approaches either sacrifice accuracy for real-time performance or require complex multi-stage processing. Methods like MonoDepth2 [33] achieve real-time inference but produce inconsistent depth estimates across frames, leading to accumulated drift, while accuracy-focused methods require computationally prohibitive post-processing.
Our MSGS-SLAM addresses these limitations by integrating proxy depth estimation with semantic understanding in a unified 3D Gaussian framework, achieving robust monocular SLAM without compromising real-time performance. Specifically, our approach leverages Metric3D-V2’s metric-scale depth estimates with uncertainty quantification to handle unreliable regions, combining monocular scalability with RGB-D-level reconstruction quality.

3. Method

The overall architecture of the MSGS-SLAM system is shown in Figure 1, comprising four key modules: proxy depth estimation, four-dimensional Gaussian representation, multi-channel tracking and mapping, and semantic-guided loop detection. Unlike traditional RGB-D-based semantic SLAM systems, MSGS-SLAM innovatively integrates monocular vision with four-dimensional Gaussian distribution models, converting monocular camera observations into depth-informed representations through proxy depth estimation, thereby overcoming the inherent scale ambiguity problem of monocular SLAM systems.

3.1. Proxy Depth Fusion

The fundamental challenge faced by monocular visual SLAM systems lies in the inability of monocular cameras to directly provide depth information, resulting in scale ambiguity and geometric vagueness in reconstructed scenes. To address this technical bottleneck, one of the core innovations of MSGS-SLAM is the introduction of a depth estimation proxy framework based on Metric3D-V2 [13]. Compared to traditional RGB-D sensors and other proxy depth frameworks, this architecture demonstrates significant advantages in monocular visual environments: first, the model can achieve completely unsupervised depth estimation without relying on specific depth ground truth data for training; and second, it maintains consistent scale in depth estimation, resolving the inherent scale ambiguity problem of monocular SLAM systems. Additionally, this framework possesses fast execution speed, meeting the requirements of real-time SLAM applications. Furthermore, our depth estimation approach leverages symmetric consistency constraints, exploiting bilateral and reflective symmetries commonly found in architectural structures to enhance depth prediction accuracy and reduce estimation variance in symmetric regions. For each monocular image frame I t , we use the pre-trained Metric3D-V2 model to estimate the depth map D ^ t :
D ^ t = Metric 3 D - V 2 ( I t )
Unlike depth measurements provided by traditional RGB-D sensors, proxy depth contains inherent uncertainty. To address this, we introduce an uncertainty quantification mechanism U ( D ^ t ) to estimate the reliability of depth values:
U ( D ^ t ) = exp ( α · D ^ t 2 ) · β · Conf ( D ^ t ) · γ · θ ( S t )
where D ^ t 2 is the depth gradient norm, capturing depth discontinuous regions (typically with higher uncertainty); Conf ( D ^ t ) is the confidence map output by Metric3D-V2; θ ( S t ) is a semantic segmentation accuracy modulation function, defined as:
θ ( S t ) = 1 1 + exp ( λ ( mIoU ( S t ) ϕ ) )
The adjustment coefficients α , β , and γ are designed to balance different uncertainty sources: α controls sensitivity to geometric discontinuities (recommended range 0.5–1.0), β weights the depth estimator’s internal confidence (typically 0.7–0.9), and γ incorporates semantic consistency (usually 0.3–0.7). For environments with challenging depth estimation conditions, increasing α and γ while reducing β provides more robust uncertainty assessment.
This function adopts a sigmoid form, where mIoU ( S t ) represents the mean Intersection over Union metric of the current frame’s semantic segmentation, ϕ is the threshold parameter, and λ controls the function curvature. The parameters α , β , and γ are adjustment coefficients that balance the contributions of geometric, appearance, and semantic factors to uncertainty assessment. This uncertainty quantification provides important guidance for subsequent scene reconstruction and optimization.

3.2. Four-Dimensional Semantic Gaussian Representation and Multi-Channel Rendering

Traditional monocular SLAM systems, while advantageous in terms of hardware cost and deployment convenience, are fundamentally limited by their lack of scene understanding and semantic perception capabilities, severely restricting their potential in practical applications such as intelligent navigation, scene interaction, and complex environment adaptation. In MSGS-SLAM, we propose a four-dimensional semantic Gaussian representation model that explicitly represents scenes through a set of three-dimensional Gaussian functions and innovatively integrates semantic information as a fourth-dimensional attribute into the Gaussian representation. Each Gaussian distribution is modeled through an influence function f ( · ) . For computational simplification, we adopt isotropic Gaussian distributions:
f 3 D ( x ) = σ exp x μ 2 2 r 2
where σ [ 0 , 1 ] represents opacity, μ R 3 represents the center position, and r represents the radius. Each Gaussian distribution also carries RGB color information c i = [ r i , g i , b i ] T and semantic label information s i = [ s i 1 , s i 2 , , s i K ] T , where K represents the total number of semantic categories. Unlike existing monocular SLAM systems, this multi-channel Gaussian representation allows the system to simultaneously optimize geometric, appearance, and semantic information, achieving more comprehensive scene understanding.
To optimize Gaussian distribution parameters for scene representation, we need to project three-dimensional Gaussians onto the two-dimensional image plane through differentiable rendering. Similar to traditional RGB-D semantic Gaussian SLAM systems (such as SGS-SLAM), we adopt volumetric rendering methods, and additionally incorporate proxy depth channels. The Gaussian distribution’s center μ , radius r, and depth d are projected in the camera coordinate system through standard point rendering formulas:
μ 2 D = K E t μ d , r 2 D = l r d , d = ( E t μ ) z
where K is the camera intrinsic matrix, E t is the camera extrinsic matrix at time t, and l is the focal length.
During rendering, pixel-level rendered color C pix , depth D pix , and semantic information S pix are computed through front-to-back volumetric rendering:
C pix = i = 1 n c i f i , pix 2 D j = 1 i 1 ( 1 f j , pix 2 D )
D pix = i = 1 n d i f i , pix 2 D j = 1 i 1 ( 1 f j , pix 2 D · w d ( d i , D ^ i ) )
S pix = i = 1 n s i f i , pix 2 D j = 1 i 1 ( 1 f j , pix 2 D )
where w d ( d i , D ^ i ) = exp ( κ | d i D ^ i | 2 ) is the depth fusion weight factor, D ^ i is the proxy depth estimate, and κ is the balance parameter. This design enables the system to adaptively adjust volumetric rendering weights based on the consistency between proxy depth and rendered depth, improving the accuracy and robustness of depth rendering. Compared to traditional RGB-D systems, this depth fusion mechanism can more flexibly handle depth uncertain regions, achieving precise reconstruction of complex geometric structures.
The introduction of semantic information not only enhances the system’s scene understanding capability but also serves as additional constraints during optimization, promoting accurate reconstruction of geometric and appearance features. During mapping optimization, semantic loss serves as a key component in multi-channel optimization, guiding the system to accurately model critical features such as object boundaries and material changes. Moreover, our approach exploits symmetric correspondences in semantic regions, utilizing bilateral and mirror symmetries commonly present in architectural environments to enforce consistent semantic labeling across symmetric scene elements. This semantic-guided optimization strategy significantly improves the system’s reconstruction quality and segmentation accuracy in complex environments, providing rich environmental representations for downstream tasks such as scene understanding, object interaction, and autonomous navigation. Compared to existing monocular SLAM and NeRF-based systems, MSGS-SLAM achieves more efficient rendering speed and more precise object-level geometric reconstruction through semantic Gaussian representation, effectively overcoming the over-smoothing problems commonly present in neural implicit representations.
By setting d i = 1 , we can compute the silhouette value Sil pix = D pix ( d i = 1 ) , which is used to determine pixel visibility in the current view. This is crucial for camera pose estimation and scene reconstruction, especially when dealing with thin objects or complex geometric shapes, where silhouette information can provide additional structural constraints, effectively avoiding geometric degeneracy problems.

3.3. Semantic-Guided System Tracking

In the tracking phase, we propose a keyframe selection algorithm based on semantic guidance and proxy depth collaborative mechanisms. Unlike traditional SLAM systems that rely solely on photometric error or geometric residual minimization tracking methods, which are prone to losing track in scenes with sparse textures, lighting changes, or significant viewpoint variations, MSGS-SLAM innovatively incorporates semantic information as additional constraints, significantly enhancing the system’s robustness in complex environments. Specifically, we utilize high-level semantic representations generated by the DINOv2 visual feature extractor [34], providing the system with scene understanding capabilities that enable the tracking process to focus not only on low-level geometric and appearance features but also to identify and utilize semantic consistency in scenes.
First, the system captures and stores keyframes at fixed time intervals, ensuring temporal consistency of trajectory sampling. Subsequently, keyframes are filtered based on dual geometric and semantic constraints to establish precise associations between current frames and historical observations.
At the geometric constraint level, the system randomly samples pixel points from the current frame, extracts their corresponding three-dimensional Gaussian distributions G sample , and projects them to keyframe viewpoints, generating projected distributions G proj . Based on the geometric visibility of projected distributions, we define an overlap rate evaluation metric:
η = 1 j G proj , j i { G i 0 width ( G i ) W , 0 height ( G i ) H }
This metric quantifies the degree of geometric overlap between current frames and keyframes, where W and H represent image width and height, respectively. Based on this, we introduce a geometric overlap rate threshold T geo and filter keyframes satisfying the following condition:
K F geo = { K F i η ( K F i ) > T geo }
Building upon this, the system further introduces semantic consistency constraints to perform refined filtering of keyframe candidate sets. Through semantic features extracted by DINOv2, we compute the mean Intersection over Union (mIoU) between the keyframe semantic map S pix K F i and current frame semantic map S pix cur :
mIoU ( K F i ) = pix ( S pix K F i S pix cur ) pix ( S pix K F i S pix cur )
Unlike traditional methods, we prioritize selecting keyframes with lower semantic similarity by setting threshold T sem to filter frames with high semantic overlap:
K F final = { K F i K F geo mIoU ( K F i ) < T sem }
This innovative design enables the system to optimize map construction from diverse perspectives, particularly for views containing different semantic content, allowing more effective constraint of three-dimensional spatial structure. Finally, appropriate keyframes are randomly sampled from remaining candidates to establish associations with current frames.
To overcome cumulative errors in long-term tracking, we compute time-varying uncertainty scores for each keyframe:
U ( t ) = ρ · e τ ( t current t ) · ω ( t )
where ρ is the uncertainty baseline coefficient for adjusting overall uncertainty levels; τ is the temporal decay coefficient controlling the weight decay rate of historical observations; t current and t are the current timestamp and keyframe timestamp, respectively; and ω ( t ) is a modulation factor based on depth estimation confidence:
ω ( t ) = exp δ · 1 N i U ( D ^ t ) i
where δ is the depth uncertainty sensitivity parameter controlling the system’s response strength to depth estimation uncertainty, and N is the number of pixel samples used to compute average depth uncertainty in the current frame. This uncertainty score is not only used to weight the mapping loss L mapping but also reflects the growth of reconstruction uncertainty caused by accumulated camera tracking errors along the trajectory.
During pose estimation, the first frame’s camera pose is set as the identity matrix in the global reference coordinate system. Combined with proxy depth estimation, this setting not only establishes a unified coordinate system but also effectively resolves the scale ambiguity problem inherent in monocular SLAM. For a new time step t + 1 , the initial camera pose is predicted through constant velocity assumption:
E t + 1 = E t · ( E t 1 1 E t )
This prediction model assumes approximately uniform camera motion over short time periods, estimating velocity by analyzing relative displacement between the previous two frames. Compared to complex prediction methods such as Kalman filtering, this linear extrapolation model has low computational overhead, with average prediction errors controlled within 2.5% in experiments, providing good initial estimates for subsequent optimization.
This predicted value provides a good initial estimate for subsequent optimization. Subsequently, the current pose is iteratively optimized by minimizing a multi-objective tracking loss function:
L tracking = pix ( Sil pix > T sil ) ( λ D w depth | D ^ pix D pix | + λ C | I pix C pix | + λ S | S pix G T S pix | )
This loss function integrates information from three dimensions: geometric depth, appearance color, and semantic labels, forming a comprehensive multi-objective optimization framework. Unlike traditional RGB-D Gaussian semantic SLAM systems, MSGS-SLAM adopts adaptive weighting strategies, dynamically adjusting the contribution of each loss term based on data reliability. Specifically, w depth = U ( D ^ t ) is an adaptive weight based on proxy depth uncertainty, enabling the system to flexibly reduce depth constraint influence in depth estimation uncertain regions; T sil is the silhouette threshold ensuring optimization uses only reliable visible map portions; and λ D , λ C , and λ S are predefined weight coefficients for balancing the contributions of depth, appearance, and semantic losses.
In implementation, we employ the Adam optimizer [35] with adaptive learning rates, setting reasonable iteration counts (10–20 iterations) and early stopping strategies to effectively control computation time while ensuring optimization quality, enabling the system to achieve real-time pose estimation (30–40 Hz). By appropriately adjusting weight coefficients, such as increasing depth loss weight in indoor scenes ( λ D = 1.2–1.5) while enhancing color loss weight in texture-rich outdoor scenes ( λ C = 0.8–1.0), the system can adapt to different environmental characteristics.
This semantic-guided tracking strategy addresses the problem of traditional purely geometric methods being prone to losing track. When photometric information is insufficient or geometric structure is ambiguous, semantic information provides high-level scene understanding, ensuring the system maintains stable tracking in complex environments. Experimental results demonstrate that this multimodal fusion tracking strategy significantly improves the system’s robustness and pose estimation accuracy in challenging environments with occlusions, lighting changes, and dynamic objects.

3.4. Semantic-Guided System Mapping

Based on the four-dimensional semantic Gaussian representation model described in Section 3.2, this section focuses on parameter optimization and map update strategies during the system mapping process. Unlike traditional RGB-D systems, MSGS-SLAM introduces proxy depth confidence as a key guidance signal during the mapping phase to compensate for the uncertainty of monocular visual depth estimation.
Starting from the first frame, all pixels participate in map initialization, with each pixel generating a Gaussian primitive whose depth is provided by Metric3D-V2. To improve initialization quality, we introduce semantic consistency constraints, applying shared depth priors to pixels of the same semantic category, expressed through the following formula:
P ( D i | S i = k ) exp ( D i μ k ) 2 2 σ k 2
where μ k and σ k are the depth mean and variance for category k, respectively, obtained through statistical learning. This design enables the system to utilize semantic prior knowledge to improve initialization accuracy in depth uncertain regions.
During map updates in subsequent frames, the system adopts a semantic-aware dynamic Gaussian densification strategy, determining whether to introduce new Gaussian distributions in specific regions based on three key conditions: (1) silhouette value Sil pix below threshold T sil , indicating highly uncertain visibility; (2) difference between proxy depth D ^ pix and rendered depth D pix exceeding threshold T depth , i.e., | D ^ pix D pix | > T depth , suggesting the presence of new geometric entities; and (3) inconsistency between semantic segmentation results S pix and rendered semantics S render , indicating potential object boundaries or new objects.
Based on proxy depth estimation accuracy, we categorize depth information into three types: high-confidence regions ( U ( D ^ pix ) > τ high ) , medium-confidence regions ( τ low < U ( D ^ pix ) τ high ) , and low-confidence regions ( U ( D ^ pix ) τ low ) . For different confidence regions, the system adopts adaptive optimization strategies: high-confidence regions mainly rely on depth constraints; medium-confidence regions balance depth, appearance, and semantic constraints; and low-confidence regions primarily rely on semantic and appearance constraints while introducing spatial smoothness regularization.
After densification, we propose a semantically enhanced multi-channel optimization loss function:
L mapping = U t pix λ D w depth ( D ^ pix ) | D ^ pix D pix | + λ C L C + λ S L S + λ R L reg + λ E L edge
where w depth ( D ^ pix ) = min ( 1 , ν · U ( D ^ pix ) ) is an adaptive weight based on proxy depth confidence U . The newly introduced L edge term is a semantic edge preservation term, defined as:
L edge = ( i , j ) N ψ ( S i , S j ) · exp ( ζ D i D j 2 )
where ( i , j ) N represents adjacent pixel pairs, ψ ( S i , S j ) is a semantic similarity measure, and ζ is a parameter controlling distance sensitivity. This term encourages the system to maintain depth discontinuity at object boundaries while maintaining depth smoothness within the same semantic region, thereby improving the accuracy of reconstructed object boundaries. λ R L reg is a regularization term that suppresses excessive expansion of Gaussian distributions. L C and L S are weighted SSIM losses for appearance images and semantic images:
L img = pix ω | I pix G T I pix | + ( 1 ω ) ( 1 ssim ( I pix G T , I pix ) )
λ D , λ C , λ S , λ R , λ E , and ω are predefined hyperparameters, and U t is the time-varying uncertainty score. To optimize depth-semantic joint representation, we introduce a bidirectional information flow mechanism: semantic labels guide depth reasoning (applying depth consistency constraints to regions with similar semantics), while depth gradients assist semantic boundary refinement (enhancing semantic boundary detection at depth discontinuities).
In implementation, we adopt dynamic batching strategies, adaptively adjusting optimization step sizes based on semantic and depth confidence to improve computational efficiency while ensuring convergence quality. Low-confidence regions use smaller step sizes (0.001–0.005) for cautious updates, while high-confidence regions use larger step sizes (0.01–0.05) to accelerate convergence. Additionally, we introduce semantic-guided gradient clipping mechanisms to prevent optimization instability at object boundaries.
Experimental results demonstrate that this multi-channel optimization strategy, which fuses semantic information with proxy depth confidence, significantly improves the system’s reconstruction accuracy in texture-sparse, lighting-varying, and geometrically complex regions. Furthermore, for typical depth sensing challenging areas such as highly reflective surfaces and transparent objects, the semantic-guided mapping strategy can effectively utilize contextual information for reasonable compensation, providing reliable environmental representations for downstream tasks such as scene understanding and autonomous navigation.

3.5. Semantic Loop Detection and System Optimization

To suppress pose drift errors accumulated during long-term system operation, MSGS-SLAM constructs a loop detection mechanism based on semantic keyframe sets. Traditional SLAM systems typically focus only on correcting pose tracking errors; however, this approach struggles to sufficiently correct mapping errors in long sequences, leading to reduced global consistency. Addressing this problem, we design a hierarchical optimization strategy that fuses semantic information, including local semantic BA optimization and global semantic BA optimization, achieving joint correction of tracking and mapping errors.
First, we cluster keyframes based on semantic content, forming semantic keyframe sets with high discriminability:
K F sem = { K F i | S ( K F i ) > T cluster }
where S ( K F i ) represents the semantic richness score of keyframe K F i , obtained through a weighted combination of scene semantic entropy and semantic boundary density:
S ( K F i ) = θ · H ( S K F i ) + ( 1 θ ) · E ( S K F i )
H ( S K F i ) is the semantic label distribution entropy, E ( S K F i ) is the semantic boundary density function, and θ is the balance parameter. The threshold T cluster is set to 0.9, ensuring that only keyframes with rich semantic information and significant discriminability are included in the clustering set. This high-threshold strategy significantly improves the accuracy of loop candidates and reduces false matching probability.
When the system detects that the similarity between the current frame and historical semantic keyframes exceeds a predetermined threshold, loop candidate detection is triggered:
Loop ( F current , K F sem ) = True if max   mIoU ( F current , K F i ) > T loop
To enhance detection robustness, we design a multi-level semantic similarity assessment mechanism that comprehensively considers low-level feature consistency, mid-level semantic distribution, and high-level scene structure:
SimScore ( F a , F b ) = α 1 · mIoU ( F a , F b ) + α 2 · DINOSim ( F a , F b ) + α 3 · StructSim ( F a , F b )
where mIoU measures pixel-level semantic label consistency, DINOSim computes visual feature similarity based on DINOv2, and StructSim evaluates topological consistency of scene structural semantic graphs. This multi-level similarity measure effectively overcomes the problem of single features being susceptible to lighting and viewpoint changes, improving loop detection accuracy in complex environments.
For detected loop candidates, we first apply local semantic BA optimization, performing fine pose adjustments for specific loop regions:
L local = i N ( loop ) λ G E i E i pred Σ i 2 + λ S j V ( i ) S j G T S ( E i , G j ) 2 + λ D j V ( i ) D j G T D ( E i , G j ) 2
where N ( loop ) represents the keyframe set in the loop region, V ( i ) represents the set of Gaussian distributions observed by keyframe i, and S ( E i , G j ) and D ( E i , G j ) represent the semantic rendering and depth rendering results of Gaussian distribution G j under camera pose E i , respectively. This loss function explicitly incorporates semantic information into the BA optimization process, ensuring consistency between geometric and semantic representations.
After local optimization, the system triggers global semantic BA optimization, achieving global loop correction through pose graph optimization:
E 1 * , , E T * = arg min E 1 , , E T ( i , j ) C E i 1 E j E i j Σ i j 2 + ξ i = 1 T k M ( i ) S k G T S ( E i , G k ) 2
where C represents the constraint set, including consecutive frame constraints and loop constraints, E i j is the relative pose measurement, Σ i j is the covariance matrix, and M ( i ) represents the set of map points visible at pose i. The second term is a global semantic consistency constraint, ensuring that optimized poses can generate rendering results consistent with ground truth semantic labels. This semantic-guided global optimization strategy not only corrects camera poses but also synchronously optimizes the three-dimensional Gaussian map, achieving joint correction of tracking and mapping errors.
Experimental results demonstrate that this semantic-guided hierarchical optimization strategy significantly reduces cumulative drift in long sequences. Particularly in scenes containing large amounts of repetitive textures and low-discriminability regions, our method shows clear advantages, successfully detecting and correcting loops that traditional geometric methods cannot identify, effectively improving system robustness and global map consistency.

4. Experiments

4.1. Experimental Setup

  • Datasets. We conduct systematic evaluations on both synthetic and real-world scene datasets to validate the effectiveness and generalization capability of our method. To ensure a fair comparison with existing SLAM systems, we select the Replica dataset [36] and ScanNet dataset [37] as benchmark testing platforms. Specifically, the camera trajectories and semantic maps in the Replica dataset are derived from precise simulation environments, while the reference camera poses in the ScanNet dataset are reconstructed using the BundleFusion algorithm. All two-dimensional semantic labels used in the experiments are directly provided by the original datasets, ensuring consistency and reliability of the evaluation benchmarks.
  • Metrics. To evaluate system performance, we adopt a multi-dimensional quantitative metric system: for scene reconstruction quality assessment, we apply PSNR [38], Depth L1, SSIM [39], and LPIPS [40] for comprehensive measurement; for camera pose estimation accuracy, we employ Average Absolute Trajectory Error (ATE RMSE) as the standard evaluation metric; for semantic segmentation performance evaluation, we select mean Intersection over Union (mIoU) [41] as the key measurement standard.
  • Baselines. We compare the tracking and mapping performance of our system with current state-of-the-art methods, including SplaTAM, SGS-SLAM, MonoGS, and SNI-SLAM. For semantic segmentation accuracy evaluation, we select SGS-SLAM and SNI-SLAM, which represent the cutting-edge level in the field of three-dimensional Gaussian semantic scene understanding, as benchmark standards.
  • Implementation Details. In this section, we detail the experimental setup and key hyperparameter configurations. To address the complexity of hyperparameter tuning in practical applications, we provide principled guidelines for parameter selection based on extensive empirical validation across diverse scene types. All experiments are conducted on a workstation equipped with a 3.80 GHz Intel i7-10700K CPU and NVIDIA RTX 3090 GPU.
  • Hyperparameter Selection Methodology. Our hyperparameter values were determined through systematic grid search validation on a held-out subset of Replica scenes (Room0, Office0) followed by cross-validation on ScanNet scenes. We evaluated 125 different parameter combinations using a multi-objective optimization approach that balances reconstruction quality (PSNR), tracking accuracy (ATE), and semantic performance (mIoU). The reported values represent the Pareto-optimal configuration that achieves the best trade-off across all metrics.
During the tracking phase, the silhouette threshold is set to T s i l = 0.95 , and loss weights are configured as λ D = 0.8 , λ C = 1.0 , and λ S = 0.3 , with a camera parameter learning rate of 3 × 10 3 . Loss weights prioritize appearance consistency ( λ C = 1.0 ) with depth ( λ D = 0.8 ) and semantic ( λ S = 0.3 ) guidance, determined through ablation studies showing that equal weighting ( λ C = λ D = 1.0 ) leads to unstable convergence in proxy depth regions. For indoor scenes with reliable depth estimation, increase λ D to 1.2–1.5; for outdoor environments with varying lighting, adjust λ C to 0.8–1.0.
Keyframe selection employs the geometric overlap threshold η = 0.08 and semantic threshold T s e m = 0.65 , with each frame associated with at most 20 keyframes. The geometric threshold η = 0.08 was empirically determined to prevent redundant keyframes while maintaining sufficient coverage, validated through trajectory coverage analysis. The semantic threshold T s e m = 0.65 encourages diverse semantic content based on semantic entropy measurements; use 0.5–0.6 for semantically rich scenes and 0.7–0.8 for sparse environments.
During the mapping phase, the silhouette threshold is adjusted to T s i l = 0.55 , loss weights are configured as λ D = 0.8 , λ C = 1.0 , λ S = 0.15 , and λ E = 0.5 . The lower threshold allows map densification while reduced λ S = 0.15 prevents over-smoothing during reconstruction, determined through boundary preservation analysis. The edge preservation weight λ E = 0.5 was optimized to balance geometric accuracy with semantic boundary clarity.
Key parameters for proxy depth estimation include uncertainty quantification parameters α = 0.7 , β = 0.8 , γ = 0.5 , and depth confidence thresholds τ h i g h = 0.85 and τ l o w = 0.35 . These parameters balance geometric gradients ( α ), confidence maps ( β ), and semantic accuracy ( γ ), validated through depth estimation error analysis on ground truth depth maps. The confidence thresholds were set based on Metric3D-V2’s accuracy distribution, where 85% confidence corresponds to sub-centimeter accuracy. Practitioners should adjust thresholds based on deployment environment: increase α and γ while reducing β for challenging depth conditions.
  • Parameter Sensitivity Analysis. We conduct sensitivity analysis showing that our method is robust to ±20% variations in most parameters, with tracking weights ( λ D , λ C ) being most critical for convergence. The regularization parameter ω in Equation (20) was set to 0.2 through SSIM optimization, while temporal decay τ in Equation (13) was set to 0.1 based on keyframe retention analysis. These values, while not globally optimal due to computational constraints, represent practical choices validated across multiple scene types and lighting conditions.

4.2. Evaluation of Mapping, Tracking, and Semantic Segmentation on the Replica Dataset

We evaluate our MSGS-SLAM system on the Replica dataset, which provides high-quality synthetic scenes with ground truth camera poses and semantic labels. Our evaluation encompasses three key aspects: rendering quality (PSNR, SSIM, LPIPS, Depth L1), tracking accuracy (ATE), and computational efficiency (FPS). We compare against state-of-the-art methods including SplaTAM, SGS-SLAM, MonoGS, and SNI-SLAM across eight scenes (Room0-2, Office0-3).
Figure 2 presents visual comparisons of reconstruction quality across representative Replica scenes. Our MSGS-SLAM produces noticeably sharper geometric details and more accurate color reproduction compared to baseline methods. Particularly evident are the improved edge preservation and texture clarity in complex indoor environments, demonstrating the effectiveness of our semantic-guided Gaussian representation and proxy depth integration. The visual results align with our quantitative metrics, showing consistent quality improvements across diverse scene configurations.
  • Rendering Quality Analysis. As demonstrated in Table 1, our MSGS-SLAM achieves superior performance across all rendering metrics. Our method attains the highest average PSNR of 34.48 dB, surpassing SGS-SLAM by 0.36 dB, with the most significant improvement in Office3 (31.60 vs. 31.29 dB). The SSIM score of 0.957 and lowest LPIPS of 0.108 indicate excellent structural preservation and perceptual quality. Notably, our Depth L1 error of 0.345 substantially outperforms monocular competitor MonoGS (27.23), validating the effectiveness of Metric3D-V2 proxy depth integration.
  • Tracking Accuracy and Efficiency.Table 2 shows our method achieves the best trajectory accuracy with ATE RMSE of 0.408 cm and ATE Mean of 0.319 cm, significantly outperforming monocular methods like MonoGS (12.823 cm mean error). Despite operating on monocular input, our approach maintains competitive performance with RGB-D methods while achieving real-time capability (2.05 SLAM FPS). The superior tracking performance stems from our semantic-guided keyframe selection and uncertainty-aware proxy depth optimization.
  • Semantic Segmentation Performance. Our semantic evaluation in Table 3 demonstrates significant improvements with 92.68% average mIoU, surpassing SGS-SLAM by 0.80% and SNI-SLAM by 4.34%. The consistent performance across scenes, particularly 93.51% in Office0, validates our four-dimensional semantic Gaussian representation. Unlike RGB-D systems limited by depth sensor constraints, our proxy depth approach maintains robust semantic understanding across varying conditions.
Figure 3 illustrates the semantic segmentation and depth estimation capabilities of our MSGS-SLAM system across different Replica scenes. The rendered RGB images exhibit high visual fidelity, while the semantic results demonstrate precise object boundary delineation and consistent category classification. The depth estimation maps reveal smooth transitions and accurate geometric structure recovery, validating the effectiveness of our Metric3D-V2 integration. Notably, the ground truth depth comparison confirms our method’s ability to preserve fine-scale geometric details while maintaining semantic coherence throughout the reconstruction process.
The Replica evaluation validates MSGS-SLAM’s effectiveness across all metrics. Superior rendering quality confirms our semantic-guided 3D Gaussian representation captures fine details, while excellent tracking accuracy demonstrates successful monocular scale ambiguity resolution. The outstanding semantic performance establishes our method as a practical solution bridging monocular accessibility with RGB-D-level performance.

4.3. Evaluation of Mapping, Tracking, and Semantic Segmentation on the ScanNet Dataset

Figure 4 showcases reconstruction quality comparisons across challenging ScanNet scenes, highlighting our method’s robustness in real-world environments. Despite the complex lighting conditions and surface material variations inherent in natural scenes, our MSGS-SLAM maintains superior visual fidelity and geometric accuracy compared to baseline approaches. The results demonstrate effective handling of problematic areas such as reflective surfaces, low-texture regions, and occlusion boundaries, where traditional monocular methods typically struggle. This visual validation supports our quantitative findings and confirms the practical applicability of our semantic-guided proxy depth integration.
The ScanNet dataset evaluation presents significantly more challenging conditions compared to the controlled synthetic Replica environment. Real-world scenes introduce complex factors including dynamic lighting variations, surface reflectance heterogeneity, motion blur, and sensor noise that thoroughly test our monocular approach’s robustness. Unlike Replica’s idealized conditions, ScanNet scenes contain realistic imperfections such as incomplete coverage, varying texture density, and challenging geometric configurations that stress-test our proxy depth estimation and semantic reasoning capabilities.
  • Rendering Quality Analysis. Table 4 demonstrates MSGS-SLAM’s superior rendering quality in natural scenes, achieving 19.48 dB PSNR and 0.738 SSIM while maintaining competitive depth accuracy (6.25 L1 error) comparable to RGB-D methods despite using proxy depth estimation.
  • Tracking Accuracy and Efficiency. Our method achieves the best trajectory accuracy (8.05 cm mean ATE) in Table 5, demonstrating effective scale ambiguity resolution and robust performance under real-world visual complexity while maintaining practical real-time operation (2.08 SLAM FPS).
  • Semantic Segmentation Performance. Table 6 shows our 69.15% mIoU outperforms existing semantic SLAM methods, validating our four-dimensional Gaussian representation’s effectiveness in maintaining scene understanding despite the additional challenges of natural lighting and surface variations.
The ScanNet validation confirms MSGS-SLAM’s practical viability, maintaining consistent performance advantages across rendering, tracking, and semantic metrics in real-world conditions that would challenge traditional monocular SLAM approaches.

4.4. Ablation Study

To systematically validate the contribution of each key component in our MSGS-SLAM framework, we conduct comprehensive ablation experiments on the ScanNet scene 0000_00, with each configuration evaluated five times to ensure statistical reliability and minimize experimental variance.
  • Effectiveness of Proxy Depth Guidance: As shown in Table 7, introducing Metric3D-V2 proxy depth guidance significantly improves PSNR from 18.07 dB to 19.13 dB (5.86% improvement) while reducing depth L1 error from 8.34 to 6.12 (26.6% reduction). This enhancement stems from our innovative proxy depth fusion mechanism with uncertainty quantification, successfully addressing monocular vision’s scale ambiguity problem.
  • Critical Role of Semantic Loss: Table 8 confirms semantic information’s importance in scene reconstruction. Integrating semantic loss achieves 9.8% PSNR improvement and dramatically enhances mIoU from 61.85% to 69.85% (13.0% improvement). This performance leap benefits from our four-dimensional semantic Gaussian representation, enabling collaborative optimization of geometric, appearance, and semantic information.
  • Robustness of Semantic-Guided Loop Detection: Table 9 demonstrates significant trajectory accuracy improvement through semantic-guided loop detection. Our method reduces ATE RMSE from 12.34 cm to 11.15 cm while improving global consistency from 0.742 to 0.824. This advantage originates from our multi-level semantic similarity assessment, effectively overcoming traditional geometric methods’ limitations in challenging scenarios.
  • Synergistic Effects of Multi-modal Loss: Table 10 shows our complete multi-modal loss function achieves optimal performance balance. Compared to color-only configurations, our full system significantly improves PSNR, depth accuracy, and perceptual quality, validating effective synergy among depth constraints, appearance consistency, and semantic guidance.
Our method exhibits excellent generalization across synthetic (Replica) and real-world (ScanNet) datasets, maintaining consistent performance advantages. MSGS-SLAM achieves optimal balance between rendering quality and depth estimation accuracy, real-time performance and reconstruction precision, providing a practical solution for monocular semantic SLAM.

5. Conclusions

In this paper, we propose MSGS-SLAM, a novel monocular semantic Gaussian splatting SLAM system that integrates single-camera visual input with three-dimensional Gaussian distribution models. This model has an outstanding balance ability, which fully reflects its symmetry. Our first innovation introduces a Metric3D-V2-based proxy depth estimation framework to address monocular SLAM’s scale ambiguity problem. The second contribution is our four-dimensional semantic Gaussian representation model that explicitly encodes geometric, appearance, and semantic information within a unified volumetric representation. Our third innovation is the semantic-guided keyframe selection and loop detection mechanism that achieves robust global consistency correction. Experimental evaluation demonstrates competitive performance with PSNR reaching 34.48 dB, mIoU achieving 92.68% on Replica dataset, and ATE RMSE of 0.408 cm while operating at 2.05 SLAM FPS. Our method addresses the critical limitation of existing Gaussian semantic SLAM systems’ dependence on depth sensors, improving depth estimation accuracy by 26.6% and semantic segmentation performance by 13.0%.
Compared to existing approaches, MSGS-SLAM offers several distinct advantages: (1) unlike RGB-D methods such as SGS-SLAM that require specialized depth sensors (typically $200–2000 vs. $10–100 for RGB cameras), our monocular approach achieves orders-of-magnitude hardware cost reduction while maintaining comparable reconstruction quality; (2) in contrast to NeRF-based semantic SLAM methods like SNI-SLAM that suffer from over-smoothing and computational inefficiency, our explicit Gaussian representation preserves sharp object boundaries and achieves real-time performance; and (3) compared to traditional monocular SLAM systems like MonoGS that struggle with scale ambiguity, our proxy depth integration provides metric-scale reconstruction without additional sensors.

Author Contributions

Conceptualization, M.Y. and F.W.; Methodology, M.Y., S.G. and F.W.; Software, M.Y. and S.G.; Validation, M.Y. and S.G.; Formal analysis, M.Y. and F.W.; Investigation, M.Y. and S.G.; Data curation, S.G.; Writing—original draft, M.Y.; Writing—review & editing, F.W.; Visualization, S.G.; Supervision, F.W.; Project administration, F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part i. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
  2. Grisetti, G.; Stachniss, C.; Burgard, W. Improved techniques for grid mapping with rao-blackwellized particle filters. IEEE Trans. Robot. 2007, 34–46. [Google Scholar] [CrossRef]
  3. Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. Orb-slam: A versatile and accurate monocular slam system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
  4. Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  5. Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision, Scotland, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 405–421. [Google Scholar]
  6. Li, M.; Zhou, Y.; Jiang, G.; Deng, T.; Wang, Y.; Wang, H. Ddn-slam: Real-time dense dynamic neural implicit slam. arXiv 2024, arXiv:2401.01545. [Google Scholar] [CrossRef]
  7. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  8. Li, M.; He, J.; Wang, Y.; Wang, H. End-to-end rgb-d slam with multi-mlps dense neural implicit representations. IEEE Robot. Autom. Lett. 2023, 8, 7138–7145. [Google Scholar] [CrossRef]
  9. Liu, S.; Deng, T.; Zhou, H.; Li, L.; Wang, H.; Wang, D.; Li, M. Mg-slam: Structure gaussian splatting slam with manhattan world hypothesis. IEEE Trans. Autom. Sci. Eng. 2025, 22, 17034–17049. [Google Scholar] [CrossRef]
  10. Li, M.; Chen, W.; Cheng, N.; Xu, J.; Li, D.; Wang, H. Garad-slam: 3d gaussian splatting for real-time anti dynamic slam. arXiv 2025, arXiv:2502.03228. [Google Scholar]
  11. Zhou, Y.; Guo, Z.; Li, D.; Guan, R.; Ren, Y.; Wang, H.; Li, M. Dsosplat: Monocular 3d gaussian slam with direct tracking. IEEE Sens. J. 2025. [Google Scholar] [CrossRef]
  12. Li, M.; Liu, S.; Deng, T.; Wang, H. Densesplat: Densifying gaussian splatting slam with neural radiance prior. arXiv 2025, arXiv:2502.09111. [Google Scholar]
  13. Hu, M.; Yin, W.; Zhang, C.; Cai, Z.; Long, X.; Chen, H.; Wang, K.; Yu, G.; Shen, C.; Shen, S. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10579–10596. [Google Scholar] [CrossRef] [PubMed]
  14. Yen-Chen, L.; Florence, P.; Barron, J.T.; Rodriguez, A.; Isola, P.; Lin, T.-Y. iNeRF: Inverting neural radiance fields for pose estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 1323–1330. [Google Scholar]
  15. Sucar, E.; Liu, S.; Ortiz, J.; Davison, A.J. iMAP: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6229–6238. [Google Scholar]
  16. Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural implicit scalable encoding for SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12786–12796. [Google Scholar]
  17. Wang, H.; Wang, J.; Agapito, L. Co-SLAM: Joint coordinate and sparse parametric encodings for neural real-time SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13293–13302. [Google Scholar]
  18. Deng, Z.; Yunus, R.; Deng, Y.; Cheng, J.; Pollefeys, M.; Konukoglu, E. NICER-SLAM: Neural implicit scene encoding for RGB SLAM. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; pp. 82–91. [Google Scholar]
  19. Sandström, E.; Li, Y.; Gool, L.V.; Oswald, M.R. Point-SLAM: Dense neural point cloud-based SLAM. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 18–23 October 2023; pp. 18433–18444. [Google Scholar]
  20. Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3d gaussian splatting for real-time radiance field rendering. Acm Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
  21. Matsuki, H.; Murai, R.; Kelly, P.H.J.; Davison, A.J. Gaussian Splatting SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 18039–18048. [Google Scholar]
  22. Keetha, N.; Karhade, J.; Jatavallabhula, K.M.; Yang, G.; Scherer, S.; Ramanan, D.; Luiten, J. SplaTAM: Splat, track & map 3d gaussians for dense RGB-D SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 21357–21366. [Google Scholar]
  23. Yan, C.; Qu, D.; Xu, D.; Zhao, B.; Wang, Z.; Wang, D.; Li, X. GS-SLAM: Dense visual SLAM with 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–24 June 2024; pp. 19221–19230. [Google Scholar]
  24. Sun, L.C.; Bhatt, N.P.; Liu, J.C.; Fan, Z.; Wang, Z.; Humphreys, T.E. MM3DGS SLAM: Multi-modal 3d gaussian splatting for SLAM using vision, depth, and inertial measurements. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–24 June 2024; pp. 23403–23413. [Google Scholar]
  25. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  26. Salas-Moreno, R.F.; Newcombe, R.A.; Strasdat, H.; Kelly, P.H.; Davison, A.J. SLAM++: Simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 1352–1359. [Google Scholar]
  27. Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An open-source library for real-time metric-semantic localization and mapping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1689–1696. [Google Scholar]
  28. Zhu, S.; Wang, G.; Blum, H.; Liu, J.; Song, L.; Pollefeys, M.; Wang, H. SNI-SLAM: Semantic neural implicit SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 21167–21177. [Google Scholar]
  29. Li, M.; Liu, S.; Zhou, H.; Zhu, G.; Cheng, N.; Deng, T.; Wang, H. SGS-SLAM: Semantic gaussian splatting for neural dense SLAM. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 168–185. [Google Scholar]
  30. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  31. Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle adjustment—A modern synthesis. In Vision Algorithms: Theory and Practice; Springer: Berlin/Heidelberg, Germany, 2000; pp. 298–372. [Google Scholar]
  32. Dönmez, A.; Köseoğlu, B.; Araç, M.; Günel, S. vemb-slam: An efficient embedded monocular slam framework for 3d mapping and semantic segmentation. In Proceedings of the 2025 7th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (ICHORA), Ankara, Turkiye, 23–24 May 2025; pp. 1–10. [Google Scholar]
  33. Wang, H.; Yang, M.; Zheng, N. G2-monodepth: A general framework of generalized depth inference from monocular rgb+ x data. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3753–3771. [Google Scholar] [CrossRef] [PubMed]
  34. Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 27380–27400. [Google Scholar]
  35. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  36. Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. The replica dataset: A digital replica of indoor spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar] [CrossRef]
  37. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
  38. Horé, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  39. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  40. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
  41. Everingham, M.; Gool, L.V.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (voc) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Figure 1. Overview of the MSGS-SLAM framework. The system integrates monocular RGB input with proxy depth estimation (Metric3D-V2) and semantic understanding (DINOv2) through two main processes: tracking (top) for camera pose optimization and mapping (bottom) for 4D semantic Gaussian scene reconstruction. Multiple loss functions ( L tracking , L mapping , L edge , L img , L local ) jointly optimize the system for robust monocular semantic SLAM performance.
Figure 1. Overview of the MSGS-SLAM framework. The system integrates monocular RGB input with proxy depth estimation (Metric3D-V2) and semantic understanding (DINOv2) through two main processes: tracking (top) for camera pose optimization and mapping (bottom) for 4D semantic Gaussian scene reconstruction. Multiple loss functions ( L tracking , L mapping , L edge , L img , L local ) jointly optimize the system for robust monocular semantic SLAM performance.
Symmetry 17 01576 g001
Figure 2. Visual evaluation of reconstruction quality between our MSGS-SLAM approach and baseline methods across multiple scenarios from the Replica dataset.
Figure 2. Visual evaluation of reconstruction quality between our MSGS-SLAM approach and baseline methods across multiple scenarios from the Replica dataset.
Symmetry 17 01576 g002
Figure 3. Segmentation and depth estimation results of MSGS-SLAM in the Replica dataset.
Figure 3. Segmentation and depth estimation results of MSGS-SLAM in the Replica dataset.
Symmetry 17 01576 g003
Figure 4. Visual evaluation of reconstruction quality between our MSGS-SLAM approach and baseline methods across multiple scenarios from the ScanNet dataset.
Figure 4. Visual evaluation of reconstruction quality between our MSGS-SLAM approach and baseline methods across multiple scenarios from the ScanNet dataset.
Symmetry 17 01576 g004
Table 1. Quantitative comparison of our method and the baselines in training view rendering on the Replica dataset. Bold values indicate the best performance and underlined values indicate the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 1. Quantitative comparison of our method and the baselines in training view rendering on the Replica dataset. Bold values indicate the best performance and underlined values indicate the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
MethodsMetricsAvg.Room0Room1Room2Office0Office1Office2Office3
SplaTAMPSNR↑33.5231.7233.1234.3437.3338.1131.0628.98
SSIM↑0.9530.9520.9520.9650.9610.9630.9460.931
LPIPS↓0.1170.0930.1230.0950.1090.1210.1290.151
Depth L1↓0.5330.4850.5420.5010.5180.5550.5670.563
SGS-SLAMPSNR↑34.1231.8133.5634.3837.6038.2231.9831.29
SSIM↑0.9540.9540.9760.9580.9610.9570.9480.947
LPIPS↓0.1110.0880.1170.0870.1070.1090.1260.143
Depth L1↓0.3590.3250.3150.3850.3310.3940.3750.388
MonoGSPSNR↑31.2729.0530.9531.5534.3534.6729.7328.59
SSIM↑0.9100.8950.9080.9150.9210.9250.9020.904
LPIPS↓0.2080.1850.2010.1950.2240.2350.1980.217
Depth L1↓27.2324.8526.1225.7829.4531.2026.9026.33
SNI-SLAMPSNR↑28.9725.1827.9528.9433.8930.0528.2528.53
SSIM↑0.9280.8770.9050.9350.9620.9250.9450.945
LPIPS↓0.3430.3950.3780.3350.2750.3300.3380.352
Depth L1↓1.1671.1231.3861.2471.0950.8741.1521.092
OursPSNR↑34.4831.7533.8534.6537.8538.1832.2531.60
SSIM↑0.9570.9550.9540.9680.9640.9660.9470.954
LPIPS↓0.1080.0850.1140.0900.1040.1060.1230.144
Depth L1↓0.3450.3140.3060.3690.4150.3860.3650.380
Table 2. Quantitative comparison in terms of ATE, and FPS between our method and the baselines on the Replica dataset. The values represent the average outcomes across eight scenes. Bold values indicate the best performance and underlined values indicate the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 2. Quantitative comparison in terms of ATE, and FPS between our method and the baselines on the Replica dataset. The values represent the average outcomes across eight scenes. Bold values indicate the best performance and underlined values indicate the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
MethodsATE Mean
[cm] ↓
ATE RMSE
[cm] ↓
Track. FPS
[f/s] ↑
Map. FPS
[f/s] ↑
SLAM FPS
[f/s] ↑
SplaTAM0.3490.4515.473.942.13
SGS-SLAM0.3280.4165.203.532.11
MonoGS12.82314.5330.850.620.51
SNI-SLAM0.5160.63117.253.602.94
Ours0.3190.4084.583.262.05
Table 3. Quantitative comparison of our method against existing semantic SLAM methods on the Replica dataset. Bold values indicate the best performance and underlined values indicate the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 3. Quantitative comparison of our method against existing semantic SLAM methods on the Replica dataset. Bold values indicate the best performance and underlined values indicate the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
MethodsAvg. mIoU↑Room0 [%]↑Room1 [%]↑Room2 [%]↑Office0 [%]↑
SGS-SLAM91.8892.0292.1790.9492.25
SNI-SLAM88.3489.3588.2187.1088.63
Ours92.6892.5892.7491.8993.51
Table 4. Quantitative comparison of our method and the baselines in training view rendering on the ScanNet dataset. Bold indicates the best performance and underlined indicates the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 4. Quantitative comparison of our method and the baselines in training view rendering on the ScanNet dataset. Bold indicates the best performance and underlined indicates the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
MethodsMetricsAvg.000000590106016901810207
SplaTAMPSNR↑18.6418.9218.5917.1521.6216.2919.20
SSIM↑0.6930.6370.7640.6740.7520.6600.673
LPIPS↓0.4290.5050.3610.4540.3570.4950.410
Depth L1↓11.3311.459.8712.569.2313.7811.09
SGS-SLAMPSNR↑18.9619.0618.4517.9819.5718.8219.68
SSIM↑0.7260.7210.7040.6780.7510.7280.774
LPIPS↓0.3830.3900.4070.4390.3520.3860.324
Depth L1↓6.486.167.127.895.846.555.32
MonoGSPSNR↑16.8317.2415.8915.6717.4816.9217.78
SSIM↑0.6340.6450.6120.5980.6610.6380.683
LPIPS↓0.5870.5980.6230.6410.5590.5870.513
Depth L1↓18.7619.8421.3522.6717.2318.9412.53
SNI-SLAMPSNR↑17.8617.4216.8916.7318.3517.9419.83
SSIM↑0.6580.6410.6250.6140.6860.6630.719
LPIPS↓0.5120.5280.5510.5670.4890.5140.423
Depth L1↓7.347.858.429.166.787.234.60
OursPSNR↑19.4819.1319.0518.7621.0619.5519.33
SSIM↑0.7380.7280.7250.7120.7480.7320.775
LPIPS↓0.3680.3780.3590.4240.3420.3730.318
Depth L1↓6.256.126.957.755.726.454.48
Table 5. Quantitative comparison in terms of ATE, and FPS between our method and the baselines on the ScanNet dataset. The values represent the average outcomes across six scenes. Bold indicates the best performance and underlined indicates the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 5. Quantitative comparison in terms of ATE, and FPS between our method and the baselines on the ScanNet dataset. The values represent the average outcomes across six scenes. Bold indicates the best performance and underlined indicates the second-best performance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
MethodsATE Mean
[cm] ↓
ATE RMSE
[cm] ↓
Track. FPS
[f/s] ↑
Map. FPS
[f/s] ↑
SLAM FPS
[f/s] ↑
SplaTAM8.239.874.853.211.98
SGS-SLAM9.5411.374.673.051.89
MonoGS18.7521.420.730.540.46
SNI-SLAM12.8415.6714.323.122.58
Ours8.059.654.953.152.08
Table 6. Quantitative comparison of our method against existing semantic SLAM methods on the ScanNet dataset. Bold indicates the best performance and underlined indicates the second-best performance.↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 6. Quantitative comparison of our method against existing semantic SLAM methods on the ScanNet dataset. Bold indicates the best performance and underlined indicates the second-best performance.↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
MethodsAvg. mIoU↑0000 [%]↑0059 [%]↑0106 [%]↑0169 [%]↑
SGS-SLAM68.5269.4567.8968.2168.53
SNI-SLAM64.8965.8563.6864.3165.72
Ours69.1569.8568.3768.9469.18
Table 7. Evaluation of reconstruction without depth guidance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 7. Evaluation of reconstruction without depth guidance. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
ConfigurationMapping (ms) ↓PSNR (dB) ↑Depth L1 ↓
W/o proxy depth guidance28.418.078.34
W/proxy depth guidance31.719.136.12
Table 8. Evaluation of reconstruction without semantic loss. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 8. Evaluation of reconstruction without semantic loss. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
ConfigurationMapping (ms) ↓PSNR (dB) ↑mIoU (%) ↑SSIM ↑
W/o semantic loss26.817.4261.850.665
W/semantic loss31.719.1369.850.728
Table 9. Evaluation of tracking without loop closure. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 9. Evaluation of tracking without loop closure. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
ConfigurationATE RMSE (cm) ↓Tracking FPS ↑PSNR (dB) ↑Global Consistency ↑
W/o loop closure12.345.2818.250.742
W/semantic loop closure11.155.0819.130.824
Table 10. Evaluation of reconstruction without depth loss. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
Table 10. Evaluation of reconstruction without depth loss. ↑ and ↓ indicate whether higher or lower values represent better performance, respectively.
ConfigurationMapping (ms) ↓PSNR (dB) ↑Depth L1 ↓LPIPS ↓
W/o depth loss29.517.857.940.425
W/o color & depth loss24.116.939.170.468
W/full loss (Ours)31.719.136.120.378
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, M.; Ge, S.; Wang, F. MSGS-SLAM: Monocular Semantic Gaussian Splatting SLAM. Symmetry 2025, 17, 1576. https://doi.org/10.3390/sym17091576

AMA Style

Yang M, Ge S, Wang F. MSGS-SLAM: Monocular Semantic Gaussian Splatting SLAM. Symmetry. 2025; 17(9):1576. https://doi.org/10.3390/sym17091576

Chicago/Turabian Style

Yang, Mingkai, Shuyu Ge, and Fei Wang. 2025. "MSGS-SLAM: Monocular Semantic Gaussian Splatting SLAM" Symmetry 17, no. 9: 1576. https://doi.org/10.3390/sym17091576

APA Style

Yang, M., Ge, S., & Wang, F. (2025). MSGS-SLAM: Monocular Semantic Gaussian Splatting SLAM. Symmetry, 17(9), 1576. https://doi.org/10.3390/sym17091576

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop