Sem-SLAM: Semantic-Integrated SLAM Approach for 3D Reconstruction

Liu, Shuqi; Zhuang, Yufeng; Zhang, Chenxu; Li, Qifei; Hou, Jiayu

doi:10.3390/app15147881

Open AccessArticle

Sem-SLAM: Semantic-Integrated SLAM Approach for 3D Reconstruction

by

Shuqi Liu

,

Yufeng Zhuang

^*,

Chenxu Zhang

,

Qifei Li

and

Jiayu Hou

Key Laboratory of IoT Monitoring and Early Warning, Ministry of Emergency Management, School of Intelligent Engieering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7881; https://doi.org/10.3390/app15147881

Submission received: 24 June 2025 / Revised: 9 July 2025 / Accepted: 12 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Applications of Data Science and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Under the upsurge of research on the integration of Simultaneous Localization and Mapping (SLAM) and neural implicit representation, existing methods exhibit obvious limitations in terms of environmental semantic parsing and scene understanding capabilities. In response to this, this paper proposes a SLAM system that integrates a full attention mechanism and a multi-scale information extractor. This system constructs a more accurate 3D environmental model by fusing semantic, shape, and geometric orientation features. Meanwhile, to deeply excavate the semantic information in images, a pre-trained frozen 2D segmentation algorithm is employed to extract semantic features, providing a powerful support for 3D environmental reconstruction. Furthermore, a multi-layer perceptron and interpolation techniques are utilized to extract multi-scale features, distinguishing information at different scales. This enables the effective decoding of semantic, RGB, and Truncated Signed Distance Field (TSDF) values from the fused features, achieving high-quality information rendering. Experimental results demonstrate that this method significantly outperforms the baseline-based methods in terms of mapping and tracking accuracy on the Replica and ScanNet datasets. It also shows superior performance in semantic segmentation and real-time semantic mapping tasks, offering a new direction for the development of SLAM technology.

Keywords:

artificialintelligence; intelligent industrial systems; SLAM; emergency response; neural implicit representation

1. Introduction

Traditional Simultaneous Localization and Mapping (SLAM) techniques predominantly generate maps with geometric information. However, their deficiency in environmental semantic comprehension restricts their extensive deployment in intricate scenarios. In recent times, propelled by the rapid advancements in deep learning technologies, semantic SLAM has emerged as a focal area of research. Semantic SLAM [1], an advanced technology integrating semantic understanding and map building, is designed to augment a robot’s localization and navigation capabilities within uncharted and complex environments. By fusing deep learning techniques with SLAM, this approach enables the effective segmentation and precise recognition of environmental semantic information. As a result, it furnishes more comprehensive and accurate data for robot localization and navigation in complex settings.

Semantic segmentation is a fundamental task in computer vision that aims to classify each pixel in an image into specific semantic categories, such as objects, scenes, or parts of objects. This technique bridges the gap between low-level image features and high-level semantic understanding, enabling machines to perceive the environment in a more human-like manner. In the context of this research, semantic segmentation plays a crucial role in the SLAM system. By integrating semantic segmentation with SLAM, our system can not only build geometric maps of the environment but also understand the semantic information of different objects and scenes. This semantic information is vital for complex tasks such as disaster rescue, where identifying obstacles, trapped individuals, and other critical elements accurately can significantly improve the efficiency and safety of rescue operations.

In the critical domain of disaster rescue, the SLAM technology assumes a pivotal role. It empowers rescue personnel to expeditiously and precisely acquire information regarding the disaster site, thereby providing invaluable support for rescue operations [2]. Notably, this technology incorporates environmental semantic information into the map construction process. It achieves the accurate segmentation of crucial semantic elements such as obstacles in rescue corridors and trapped individuals, facilitating the generation of maps replete with rich semantic content. These maps not only delineate the physical space layout but also present rescue workers with a more intuitive and detailed perspective of the disaster scene. Consequently, the efficiency and safety of rescue operations are substantially enhanced. Rescue robots can execute precise navigation, effectively circumvent obstacles, and choose the optimal route to reach the disaster stricken area.

Neural Radiance Fields (NeRF), a neural network-based implicit representation technology, has emerged as a revolutionary approach in the field of SLAM [3]. At its core, NeRF represents a 3D scene as a continuous volumetric field, where neural networks learn to map 3D coordinates to view-dependent radiance and density values. This enables the technology to capture intricate lighting and geometric details with high fidelity. In the context of SLAM, NeRF has demonstrated remarkable potential in achieving full-density optimization and the joint learning of geometry and semantics, thus providing more comprehensive environmental understanding [4].

Nevertheless, such methods are plagued by several drawbacks, including high computational costs, substantial resource consumption, real-time challenges, reliance on high-quality data, and limited pose estimation accuracy. These limitations impede their widespread application in scenarios with stringent real-time requirements, such as disaster rescue. Notably, the integration of NeRF and SLAM technologies can utilize per-frame image data to construct high-quality 3D scene models and achieve precise spatial perception, 3D reconstruction, and dynamic localization. The combination of these two technologies not only enables real-time updates of the environmental model but also significantly enhances the SLAM system’s understanding of the environment by providing semantic information such as object categories and locations [5].

In contrast to existing methods, most of which struggle with high computational costs, resource consumption, and real-time performance issues, this paper fully exploits the advantages of NeRF and SLAM technologies. By leveraging cross-attention and self-attention mechanisms for information extraction, we construct a 3D scene reconstruction system with better real-time performance.

Unlike other methods that lack comprehensive information extraction, we propose an innovative multi-scale information extraction scheme. Using multi-layer perceptrons as fundamental units, we build a multi-scale feature extractor to obtain semantic, RGB, and Truncated Signed Distance Field (TSDF) values, ultimately realizing information decoding through rendering techniques. This enables a more detailed and accurate understanding of the environment.

Existing NeRF-based SLAM approaches often have limitations in semantic feature fusion. In response, we implement a full-attention cross-modal feature fusion method, which effectively integrates geometric, shape, and semantic features.

Moreover, many current methods do not use a well-designed loss function system. We employ color loss, appearance loss, feature loss, and depth loss to guide the optimization of the network, leading to better mapping accuracy and object recognition. Extensive experiments on the Replica and ScanNet datasets demonstrate that our method surpasses existing NeRF-based SLAM approaches in mapping quality while exhibiting exceptional real-time semantic mapping capabilities. To address the limitations of existing methods, we summarize the key contributions as follows:

We propose a NeRF-based visual semantic SLAM system. To accurately construct a semantic map for mapping, we introduce a multi-scale information extractor to extract both coarse-grained and fine-grained information from the camera pose. This enables a more comprehensive and detailed understanding of the spatial information related to the camera, laying a solid foundation for precise mapping.
We implement a full-attention cross-modal feature fusion method to integrate geometric, shape, and semantic features based on cross-attention and self-attention. This innovative structure can fully utilize the functions of features at various levels. By effectively combining different types of features, it enhances the system’s ability to perceive and analyze the environment, resulting in more accurate and meaningful semantic representations.
Additionally, we employ color loss, appearance loss, feature loss, and depth loss to guide the optimization of the network. These carefully designed loss functions play a crucial role in adjusting the network parameters, helping the system to converge towards better solutions. As a result, we can obtain more optimal results in terms of mapping accuracy, object recognition, and overall system performance.
We conduct extensive evaluations on two challenging datasets, Replica [6] and ScanNet [7]. The experimental results clearly demonstrate that, compared with existing NeRF-based SLAM methods, our method exhibits state-of-the-art performance in mapping, tracking, and semantic segmentation. It not only shows higher accuracy in mapping the environment but also achieves more precise object tracking and semantic segmentation, highlighting the superiority and practical value of our proposed Sem-SLAM.

The remainder of this paper is organized as follows: In Section 2, we review typical methods for NeRF and Semantic SLAM. In Section 3, we present our proposed method, including the forward feature fusion network and the backward semantic mapping network. Section 4 describes the experimental settings and results, including the datasets, evaluation metrics, and qualitative and quantitative comparisons with existing methods. Finally, in Section 5, we summarize the paper and discuss future work.

2. Related Work

Implicit Neural Representation

The application of implicit neural representations in SLAM has emerged as a prominent research area in recent years. These methods leverage deep learning techniques to represent environmental map information. They are capable of providing high-fidelity dense maps and have demonstrated promising results in indoor environments [8,9]. However, such approaches also confront several challenges, such as issues regarding robustness in complex or dynamic scenarios [10]. To address these problems, researchers have proposed a series of improvement methods to enhance the performance of implicit neural SLAM systems:

To better handle RGB images without depth information, a design of hierarchical feature volumes has been proposed to effectively fuse shape cues at different scales [11]. This approach enables more comprehensive utilization of visual information, facilitating more accurate map construction. The DF-SLAM system represents scenes using dictionary factors. It encodes geometric and appearance information as a combination of basis and coefficient factors, thereby enhancing the ability to reconstruct details [12]. This innovative encoding method improves the precision of scene reconstruction, especially in capturing fine-grained features. NICE-SLAM [13] exploits local information to improve the reconstruction effect in large-scale scenes. By focusing on local details and integrating them into the overall reconstruction, it achieves better results in complex and extensive environments. The GO-SLAM [14] framework emphasizes the consistency of global optimization for pose estimation and 3D reconstruction. It supports efficient loop-closure detection and online full bundle adjustment, resolving the interference of dynamic environments on reconstruction. This framework effectively addresses the challenges posed by dynamic elements in the environment, ensuring more stable and accurate SLAM performance. Additionally, to enhance the system’s robustness, a SLAM system based on event data has been proposed [15]. This system can maintain stability under complex conditions such as illumination changes and motion blur. The use of event-based data provides an alternative and more robust way to perceive the environment, expanding the applicability of SLAM systems in challenging scenarios.

Notwithstanding the remarkable progress achieved, the existing neural implicit SLAM methods still have certain inadequacies. Currently, most methods primarily rely on color information within images, neglecting other crucial environmental features. This narrow focus limits their comprehensive understanding and accurate representation of the environment.

Although NeRF-SLAM research in 2024–2025 appears to yield fruitful results, it still faces critical bottlenecks in core technologies, remaining far from practical application. In terms of computational efficiency, most methods require several seconds to process each frame even with high-end GPUs, making them impractical for real-time applications such as autonomous driving and augmented reality. The limited scene generalization capability of RGB-D-based algorithms is another major weakness—every transition between environments (e.g., indoor to outdoor, daylight to nighttime) demands model retraining, with prohibitive time and computational costs that hinder adaptation to dynamic real-world conditions.

Hardware compatibility is equally problematic. The insatiable demand for computing power restricts these algorithms to large-scale servers, rendering deployment on edge devices virtually impossible and severely limiting their applicability. Dynamic scene handling remains a “Waterloo” for such methods: rapidly moving objects or sudden environmental changes can instantly “blind” the system, generating extensive visual artifacts that directly compromise localization and mapping accuracy.

The scientific rigor of evaluation systems is also questionable. Heavy reliance on synthetic data in numerous studies exacerbates the domain gap when models encounter real-world complexity, exposing the fragility of system robustness.

Semantic SLAM

The integration of Semantic Semantic SLAM and NeRF aims to enhance the quality of scene understanding and reconstruction by fusing semantic information. In the context of SLAM in dynamic environments, which confront challenges such as tracking drift and mapping errors, DDN-SLAM [16] has pioneered the introduction of semantic features, giving rise to the first real-time dense dynamic neural implicit SLAM system. By integrating semantic information, this system effectively addresses the challenges in complex dynamic scenarios, significantly improving the accuracy and stability of localization and mapping. It thus sets a new milestone for the development of SLAM technology.

Concurrently, the SNI-SLAM [17] approach takes an alternative route and proposes a semantic SLAM framework based on neural implicit representation. This system not only achieves high-precision semantic mapping and high-quality surface reconstruction but also, through the introduction of a hierarchical semantic representation strategy, enables top-down structured semantic mapping. As a result, it ensures the robustness and accuracy of camera tracking. This innovation not only enriches the theoretical system of semantic SLAM but also provides a powerful technical support for practical applications.

In the pursuit of efficient and accurate SLAM solutions, the EC-SLAM [18] system represents maps using sparse parameter encoding and Truncated Signed Distance Fields (TSDF), thereby achieving enhanced efficiency and accelerated convergence. This approach not only ensures map accuracy but also significantly reduces computational costs, rendering real-time applications of SLAM technology feasible. To address the deficiencies in the tracking performance of NeRF-SLAM systems, SLAIM [19] proposes a novel coarse-to-fine tracking model. By optimizing the tracking strategy, this model effectively improves the tracking performance of NeRF-SLAM, marking a new breakthrough in SLAM technology based on neural radiance fields.

Furthermore, the iDF-SLAM [8] system combines a feature-based deep neural tracker with a NeRF-like neural implicit mapper, enabling a tight integration of front-end tracking and back-end mapping. This system can learn and utilize scene-specific features for camera tracking, thus maintaining stable localization and mapping capabilities in complex environments. There remain challenges such as how to more precisely interpret and fuse semantic information, as well as how to achieve high-quality environmental reconstruction. In response to these issues, the CG-SLAM [20] system presents an innovative approach based on uncertainty-aware 3D Gaussian fields. Through in-depth analysis of 3D Gaussian splatting techniques, this method constructs consistent and stable 3D Gaussian fields, providing a solid foundation for tracking and mapping.

In conclusion, this paper proposes to combine the cross-modal full cross-attention mechanism with the multi-scale information fusion reconstruction paradigm. By deeply fusing semantic, shape, and geometric features, the performance of the neural implicit SLAM system is enhanced. This approach enables the system to capture a more complete set of environmental characteristics, thereby improving its robustness and accuracy in complex scenarios.

3. Methodology

We propose an innovative cross-modal attention mechanism that not only integrates geometric, shape, and semantic features but also significantly enhances feature representation through a hybrid architecture combining cross-attention and self-attention mechanisms. Unlike frameworks such as SNI-SLAM, which predominantly rely on single-type feature fusion, our approach incorporates multi-scale feature extraction and full-attention mechanisms, enabling more comprehensive capture of hierarchical environmental information. Specifically, the cross-attention mechanism facilitates deep interaction between heterogeneous feature modalities, while the self-attention mechanism emphasizes critical elements within each feature type, endowing the system with superior adaptability and robustness in complex scenarios.

As Figure 1 describes, our approach conducts dense semantic modeling of scenes using consecutive RGBD frames. This section primarily encompasses three main components: forward feature fusion, inverse semantic mapping, and loss functions. In the feature fusion component, we elaborate on how to fuse appearance, semantic, and shape features through cross-attention and self-attention mechanisms. Cross-attention enables the exchange of information across different feature modalities, while self-attention helps in highlighting the significant elements within each feature type. By combining these two attention mechanisms, we can effectively integrate diverse features, enhancing the representational power of the data. The inverse semantic mapping component details how to represent multi-semantic features through a hierarchical approach. This hierarchical structure allows for a more organized and comprehensive handling of semantic information. Specifically, it facilitates semantic modeling, where the system can identify and classify different objects and their relationships in the scene. Additionally, it is utilized for color prediction, enabling the system to estimate the color of objects based on the learned semantic and geometric information. Moreover, depth estimation is also achieved through this hierarchical semantic representation, as the system can infer the distance of objects from the camera, contributing to a more complete 3D understanding of the scene.

Attention mechanisms are powerful tools in machine learning, assigning weights to input elements to prioritize critical features and improve performance. In SLAM, the complex environment with varying lighting and geometric features benefits from attention mechanisms, which can capture these details effectively.

The cross-attention mechanism enables deep interaction between heterogeneous feature modalities, integrating complementary information. The self-attention mechanism highlights key elements within each feature type through adaptive weights. Empirical studies, like in automated fault diagnosis of air-handling units, prove the effectiveness of attention-based approaches [21]. We apply attention mechanisms to SLAM to enhance environmental perception and scene reconstruction.

3.1. Forward Feature Fusion

In the context of this research, the Forward Feature Fusion method is adopted due to its strong relevance and multiple advantages. In 3D scene reconstruction and semantic understanding tasks, different types of features (geometric, semantic, and appearance features) carry complementary information. However, traditional methods often struggle to fully exploit the potential of these features, resulting in limited model performance.

The Forward Feature Fusion method proposed in this paper addresses these limitations. It fuses geometric, semantic, and appearance features through cross-attention and self-attention mechanisms, enabling the model to capture the inherent relationships among different features. This not only improves the feature representation ability but also makes the model more adaptable to complex scenes. Moreover, by integrating pre-trained semantic segmentation networks with attention mechanisms [22], we can further enhance the fusion effect of different features, achieving better performance in 3D scene reconstruction and semantic understanding.

Subsequently, we obtain appearance features via a Multi-Layer Perceptron (MLP) network. Regarding geometric features, we process information such as rays using an MLP network to acquire them. By performing calculations with cross-attention and self-attention, we obtain the fused features. This approach effectively combines the complementary information of different feature types, enabling the model to better understand and represent the scene in the reconstruction process.

In the network we proposed, the cross-attention module plays a crucial role. It skillfully merges geometric features, appearance features, and semantic features, fully exploring and leveraging the respective advantages of these three types of features. Specifically, this module effectively integrates different types of features through specific operation mechanisms. During this process, the output of the cross-attention module is directly used as the input of the self-attention network. The self-attention module is designed with the original intention of further highlighting the key and important information within these three types of features. It can adaptively assign attention weights based on the internal correlations of the features themselves, so that the important feature information can be strengthened. After being processed by the self-attention module, we deeply fuse these three types of features and finally generate a feature map that contains rich information. This feature map integrates information from multiple aspects such as geometry, appearance, and semantics, providing a more comprehensive and representative input for subsequent related tasks.

While pretrained 2D semantic segmentation networks have been widely adopted, our key innovation lies in their organic integration with cross-attention and full-attention mechanisms. The pretrained networks provide reliable semantic priors, whose fusion effectiveness with other features (e.g., geometric and shape features) is further enhanced through self-attention and full-attention mechanisms. This groundbreaking integration has yielded significant improvements in 3D scene reconstruction and semantic understanding-achievements unattainable by traditional approaches relying solely on pretrained networks.

Cross-Attention

In this paper, we adopt a fully cross-attention mechanism that encompasses all three types of features. This design aims to explore the interconnections among geometric features

f_{g e o}

, shape features

f_{s h a p e}

, and semantic features

f_{s e m}

, thereby enhancing feature expression and model performance.

In the fully cross-attention mechanism, we compute attention scores for each pair of feature types. For example, considering the pair of geometric and shape features, the formula is given by

\frac{f_{g e o} f_{s h a p e}^{T}}{\sqrt{| | f_{s h a p e} {| |}_{2}^{2}}}

. Here,

f_{g e o} f_{s h a p e}^{T}

measures the similarity between the feature vectors, and

\sqrt{| | f_{s h a p e} {| |}_{2}^{2}}

serves as a scaling factor to prevent gradient vanishing during softmax computation. Next, we apply the softmax function to normalize the scores, resulting in a distribution of attention weights. For instance,

s o f t m a x (\frac{f_{g e o} f_{s h a p e}^{T}}{\sqrt{| | f_{s h a p e} {| |}_{2}^{2}}})

yields the attention weights of geometric features relative to shape features. Finally, we multiply these weights with the third feature type, such as

s o f t m a x (\frac{f_{g e o} f_{s h a p e}^{T}}{\sqrt{| | f_{s h a p e} {| |}_{2}^{2}}}) f_{s e m}

, to obtain the cross-attention output

T_{s e m}

, which allows semantic features to incorporate information from both geometric and shape features. Similarly, we can compute

T_{g e o}

and

T_{s h a p e}

, completing the comprehensive cross-fusion of all three feature types.

T_{s e m} = s o f t m a x (\frac{f_{g e o} f_{s h a p e}^{T}}{\sqrt{| | f_{s h a p e} {| |}_{2}^{2}}}) f_{s e m}

(1)

T_{g e o} = s o f t m a x (\frac{f_{s e m} f_{s h a p e}^{T}}{\sqrt{| | f_{s h a p e} {| |}_{2}^{2}}}) f_{g e o}

(2)

T_{s h a p e} = s o f t m a x (\frac{f_{g e o} f_{s e m}^{T}}{\sqrt{| | f_{s e m} {| |}_{2}^{2}}}) f_{s h a p e}

(3)

Self-Attention

The output of the cross-attention module is then fed into the self-attention module. This module is designed to further explore the internal dependencies among the features. For the output

T_{g e o}

from the cross-attention module, we assume that we obtain queries

Q_{T_{g e o}} = T_{g e o} W_{Q_{T_{g e o}}}

, keys

K_{T_{g e o}} = T_{g e o} W_{K_{T_{g e o}}}

, and values

V_{T_{g e o}} = T_{g e o} W_{V_{T_{g e o}}}

through linear projections, where

W_{Q_{T_{g e o}}}

,

W_{K_{T_{g e o}}}

, and

W_{V_{T_{g e o}}}

are learnable weight matrices. The self-attention output is given by the formula:

O_{S A_{T_{g e o}}} = s o f t m a x (\frac{Q_{T_{g e o}} K_{T_{g e o}}^{T}}{\sqrt{d}}) V_{T_{g e o}}

(4)

where d is the feature dimension,

\frac{Q_{T_{g e o}} K_{T_{g e o}}^{T}}{\sqrt{d}}

computes the attention scores, and the softmax function normalizes them to obtain the attention weights. These weights are then multiplied with the value vector to produce the self-attention output. Similarly, we can derive

O_{S A_{T_{s e m}}}

and

O_{S A_{T_{s h a p e}}}

.

O_{S A_{T_{s e m}}} = s o f t m a x (\frac{Q_{T_{s e m}} K_{T_{s e m}}^{T}}{\sqrt{d}}) V_{T_{s e m}}

(5)

O_{S A_{T_{s h a p e}}} = s o f t m a x (\frac{Q_{T_{s h a p e}} K_{T_{s h a p e}}^{T}}{\sqrt{d}}) V_{T_{s h a p e}}

(6)

3.2. Backward Semantic Mapping

There are usually two common rendering methods in existing models. One approach [23] employs independent rendering networks to process different features. The other method [24] utilizes a decoder network to obtain geometric and color information from a single feature. However, neither of these two methods takes into account the information of different scales covered in the images.

In our work, we integrate the multi-scale mesh feature extraction method into the rendering process and combine it with a multi-layer perceptron (MLP) to obtain Signed Distance Function (SDF), RGB, and semantic values from geometric, appearance, and semantic features. It has been found that the introduction of multi-scale mesh feature extraction improves the performance of implicit semantic modeling and provides a more meticulous and rich semantic understanding.

The multi-scale extractor addresses the limitations of traditional NeRF-SLAM methods in handling multi-scale information in complex environments. Specifically, previous NeRF-SLAM methods were usually limited to single-scale information processing and performed poorly when dealing with large scenes or areas rich in details. In contrast, our multi-scale extractor significantly improves the accuracy of environmental understanding and reconstruction by simultaneously extracting coarse-grained and fine-grained information. For example, coarse-grained information helps to grasp the overall scene structure (e.g., room layout), while fine-grained information can capture local detailed features (e.g., object edges). This multi-scale processing capability enables the system to obtain more accurate environmental representations at various scales, thereby enhancing the quality of 3D reconstruction and the accuracy of semantic understanding.

In this paper, we define multi-scale discrete feature meshes for each feature type (semantic, RGB, geometric). The meshes at each scale have different resolutions, and the corresponding features are extracted from the given spatial coordinates through trilinear interpolation.

Meanwhile, we concatenate the multi-scale features under the same feature type to form a hybrid representation. This concatenation operation can combine features of different scales, thus enhancing the model’s ability to capture both the macroscopic structure and local details. Eventually, we concatenate the semantic, RGB, and geometric features into a global feature vector.

Finally, the global feature is input into the MLP network, which decodes it to generate semantic labels, RGB colors, and geometric densities, respectively. The semantic labels are used to represent the semantic information of the scene, the RGB colors are used to represent the appearance of the scene, and the geometric densities are used to represent the geometric structure of the scene.

We then provide the following mathematical derivation. Let

M^{k}

denote the mesh at scale k, where

k \in {1, \dots, K}

and K is the total number of scales. For a given spatial coordinate p, the feature

f^{k} (p)

at scale k can be obtained through trilinear interpolation:

f^{k} (p) = \sum_{i = 1}^{8} w_{i}^{k} (p) \cdot f_{v_{i}^{k}}

(7)

where

w_{i}^{k} (p)

is the interpolation weight of the i-th vertex

v_{i}^{k}

of the voxel containing p at scale k, and

f_{v_{i}^{k}}

is the feature value at vertex

v_{i}^{k}

. The interpolation weights satisfy the following boundary conditions:

\begin{matrix} \sum_{i = 1}^{8} w_{i}^{k} (p) = 1, for p inside the voxel \end{matrix}

(8)

\begin{matrix} w_{i}^{k} (p) \geq 0, i = 1, 2, \dots, 8 \end{matrix}

(9)

\begin{matrix} w_{i}^{k} (p) = 1 if p = v_{i}^{k}, w_{j}^{k} (p) = 0 for j \neq i \end{matrix}

(10)

We choose trilinear interpolation in the multi-scale feature extractor for several reasons. Here, we provide a comparison with alternative interpolation methods to justify this choice.

Comparison with Nearest-Neighbor Interpolation: Nearest-neighbor interpolation is the simplest method, which assigns the value of the nearest vertex to the interpolation point. While it is computationally efficient, it often results in a blocky or pixelated appearance, especially when the feature mesh has a low resolution. This is because it does not consider the contribution of neighboring vertices, leading to a lack of smoothness in the interpolated features. In contrast, trilinear interpolation uses the values of all eight vertices of the voxel, which can generate smoother and more accurate feature values, thus better preserving the details of the multi-scale features.

Comparison with Bilinear Interpolation: Bilinear interpolation is commonly used in 2D scenarios. It considers the contribution of four neighboring points in a 2D plane. However, in our 3D multi-scale feature extraction task, the spatial information is three-dimensional. Bilinear interpolation cannot fully utilize the 3D spatial information, which may lead to information loss in the depth dimension. Trilinear interpolation, on the other hand, extends the concept of bilinear interpolation to 3D space, taking into account the contributions of all vertices in the 3D voxel. This allows it to better adapt to the 3D nature of our multi-scale features and capture the spatial relationships more comprehensively.

For a specific feature type t (e.g., semantic, RGB, geometric), the hybrid feature

h_{t} (p)

is formed by concatenating multi-scale features:

h_{t} (p) = [f_{t}^{1} (p); \dots; f_{t}^{K} (p)]

(11)

Then, we concatenate semantic, RGB, and geometric hybrid features into a global feature vector

g (p)

:

g (p) = [h_{s e m} (p); h_{r g b} (p); h_{g e o} (p)]

(12)

The MLP network

N_{h} e t a

decodes the global feature to generate semantic labels

sem (p)

, RGB colors

rgb (p)

, and geometric densities

σ_{g e o} (p)

:

\begin{matrix} sem (p) & = N_{s e m} (g (p)) \end{matrix}

(13)

\begin{matrix} rgb (p) & = N_{r g b} (g (p)) \end{matrix}

(14)

\begin{matrix} σ_{g e o} (p) & = N_{g e o} (g (p)) \end{matrix}

(15)

To integrate these multi-scale mesh features with the TSDF representation, we use the geometric density

σ_{g e o} (p)

to update the TSDF value

d (p)

[25]. The TSDF value represents the signed distance from the point p to the nearest surface, and we can use the following relationship to update it:

d (p) = α_{g e o} \cdot {Sigmoid}^{- 1} (α_{g e o} \cdot σ_{g e o} (p))

(16)

where

α_{g e o}

is the learnable parameter for geometric rendering, controlling the sharpness of surface boundaries (suggested initialization range:

α_{g e o} \in [1, 10]

).

Among them, the rendering method references the approach of styleSDF. Along the ray

r (t) = o + t d

, N points

{p_{n}}_{n = 1}^{N}

are sampled. The multi-scale feature network

D_{θ} (p_{n})

is used to generate the RGB value

r g b (p_{n})

, the semantic value

s e m (p_{n})

, and the Truncated Signed Distance Field (TSDF) value

d (p_{n})

for each point. Subsequently, the Signed Distance Function (SDF) value is converted into the volume density:

σ_{g e o} (p_{n}) = \frac{1}{α_{g e o}} \cdot Sigmoid (- \frac{d (p_{n})}{α_{g e o}})

(17)

σ_{s e m} (p_{n}) = \frac{1}{α_{s e m}} \cdot Sigmoid (- \frac{d (p_{n})}{α_{s e m}})

(18)

where

α_{g e o}

and

α_{s e m}

are learnable parameters for geometric and semantic rendering, respectively, controlling the sharpness of surface boundaries and semantic distributions. The geometric volume density

σ_{g e o} (p_{n})

is used to render color and depth:

\hat{r g b} = \sum_{n = 1}^{N} (exp (- \sum_{i = 1}^{n - 1} σ_{g e o} (p_{i})) (1 - exp (- σ_{g e o} (p_{n})))) \cdot r g b (p_{n})

(19)

\hat{d e p t h} = \sum_{n = 1}^{N} (exp (- \sum_{i = 1}^{n - 1} σ_{g e o} (p_{i})) (1 - exp (- σ_{g e o} (p_{n})))) \cdot d_{n}

(20)

where

d_{n}

denotes the depth of point

p_{n}

relative to the camera pose. The semantic volume density

σ_{s e m} (p_{n})

is used for semantic rendering:

\hat{s e m} = \sum_{n = 1}^{N} (exp (- \sum_{i = 1}^{n - 1} σ_{s e m} (p_{i})) (1 - exp (- σ_{s e m} (p_{n})))) \cdot s e m (p_{n})

(21)

3.3. Loss Function

This section mainly introduces the loss functions involved in this paper, including feature loss, semantic loss, depth loss, and image loss. The design of the image loss and semantic loss mainly employs cross-entropy loss, L1 loss, and Structural Similarity (SSIM) loss, while the depth loss and feature loss only use the L1 norm.

Specifically, our loss function consists of the following parts:

For the image loss

l_{r g b}

, we use the L1 loss and the Structural Similarity (SSIM) loss to calculate the difference between the rendered image and the real image. The L1 loss is used to calculate the absolute difference between the two images, and the SSIM loss is used to calculate the structural similarity between the two images. The calculation formula of the image loss is as follows:

l_{r g b} = (1 - λ_{r g b}) \cdot L 1 (I_{r e n d e r}, I_{g t}) + λ_{r g b} \cdot SSIM (I_{r e n d e r}, I_{g t})

(22)

Among them, the rendered image is

I_{r e n d e r}

, the ground truth image is

I_{g t}

, and

λ_{r g b}

is the weight coefficient.

For the semantic loss

l_{s e m}

, we also use the L1 loss and the SSIM loss to calculate the difference between the rendered semantic image and the ground truth semantic image. The calculation formula of the semantic loss is as follows:

l_{s e m} = (1 - λ_{s e m}) \cdot L 1 (S_{r e n d e r}, S_{g t}) + λ_{s e m} \cdot SSIM (S_{r e n d e r}, S_{g t})

(23)

Among them, the rendered semantic image is

S_{r e n d e r}

, the ground truth semantic image is

S_{g t}

, and

λ_{s e m}

is the weight coefficient.

For the depth loss

l_{d e p t h}

, we use the L1 loss to calculate the difference between the rendered depth image and the ground truth depth image. The calculation formula of the depth loss is as follows:

l_{d e p t h} = L 1 (D_{r e n d e r}, D_{g t})

(24)

where

D_{r e n d e r}

is the rendered depth image, and

D_{g t}

is the ground truth depth image.

In order to enhance the perception ability of feature details, we adopt the feature loss to constrain the intermediate features in the rendering process and the feature extraction process. For the feature loss

l_{f e a t}

, the L1 loss is typically used to calculate the difference between the rendered feature sequence and the feature sequence obtained from feature extraction. The calculation formula of the feature loss is as follows:

\begin{matrix} l_{f e a t} & = L 1 (F_{r e n d e r}, F_{i n}) \end{matrix}

(25)

where

F_{r e n d e r}

is the rendered feature sequence, and

F_{i n}

is the feature sequence obtained from feature extraction.

The overall loss function of our model is defined as follows:

L = λ_{r} l_{r g b} + λ_{s} l_{s e m} + λ_{d} l_{d e p t h} + λ_{f} l_{f e a t}

(26)

where

λ_{r}

,

λ_{s}

,

λ_{d}

, and

λ_{f}

are the weight coefficients for the semantic loss, depth loss, and feature loss, respectively. The overall loss function combines the different components to guide the training process and improve the performance of the model.

The selection of these loss weights (

λ

values) is based on extensive experiments. We conducted a series of experiments to explore different weight combinations and evaluated their effects on the model’s performance using metrics such as mean Intersection over Union (mIoU) and Absolute Trajectory Error (ATE). Specifically, we performed a grid search on the weight space and selected the optimal combination that achieved the best performance on both the Replica and ScanNet datasets. The ablation experiments in Section 4 also provide strong evidence for the effectiveness of the chosen weights.

4. Experiments

4.1. Experimental Settings

In our experimental demonstration, we showcase two typical scenes from each dataset to validate the effectiveness of the proposed method.

The Replica dataset [6], comprising 18 high-fidelity synthetic scenes reconstructed from real-world environments, provides detailed geometric, appearance, and semantic information. Collected through high-resolution RGB-D scanning and refined by manual post-processing, this synthetic dataset allows for precise control over testing variables, making it ideal for isolating specific aspects of the method.

Additionally, we utilized the ScanNet dataset [7], a large-scale real-world 3D scene dataset. It contains 21 RGB-D scans of indoor scenes, including living rooms, offices, and kitchens. The data was collected using a handheld consumer-level RGB-D camera, with manual annotation to provide semantic labels for over 40 common object categories. This dataset’s real-world nature captures the complexity and variability in actual environments, such as occlusions, irregular lighting, and clutter. The combination of synthetic (Replica) and real-world (ScanNet) datasets enables a comprehensive evaluation of Sem-SLAM across diverse scenarios. Both datasets are equipped with semantic ground truth annotations, enabling a comprehensive evaluation across various scenarios.

To rigorously evaluate the performance of our proposed method, we selected the mean Intersection over Union (mIoU) and the Absolute Trajectory Error (ATE) as the key evaluation metrics. mIoU, a widely-recognized standard in semantic segmentation, directly gauges the object classification accuracy of our model. By quantifying the overlap between the predicted and ground truth masks, mIoU provides a numerical measure of segmentation quality, where higher values signify superior performance. This metric aligns seamlessly with our overarching objective of enhancing semantic modeling within the SLAM framework. Conversely, ATE serves as a crucial indicator of trajectory estimation accuracy. In the context of SLAM, precise camera pose estimation is indispensable for reliable scene reconstruction and navigation. ATE quantifies the discrepancy between the estimated and ground truth trajectories, with lower values denoting more accurate pose estimation. As such, ATE is instrumental in achieving high-precision SLAM performance.

To ensure the objectivity of the results, we conducted five independent runs on each scene and calculated the average value. In this study, SNI-SLAM served as our baseline model. We compared its performance with the method presented in this paper. Meanwhile, we also compared it with other existing NeRF-SLAM methods, such as NIDS-SLAM [26].

4.2. Qualitative Evaluations

As shown in Figure 2, we selected two rooms from the Replica dataset to conduct camera pose trajectory experiments. The trajectory represents the camera’s movement path in the environment, which is a crucial indicator for evaluating the positioning accuracy of the SLAM system. We aim to evaluate the accuracy of the proposed method in obtaining the camera pose trajectory. In the experiment, the blue line represents the trajectory obtained by our Sem-SLAM, and the black line represents the ground truth trajectory. The results show that the trajectory obtained by our method highly coincides with the ground truth trajectory, which effectively verifies the effectiveness of introducing the multi-scale information extractor to extract both coarse-grained and fine-grained information of the camera pose. With the help of this innovative structure, relevant trajectory information can be accurately captured, making the trajectory estimation approach the true value.

In addition, we carried out a multi-view semantic mapping experiment for the same room. As shown in Figure 3, throughout the mapping process, relying on the full-attention cross-modal feature fusion mechanism, our algorithm can efficiently integrate the geometric, shape, and semantic features extracted by the cross-attention and self-attention modules. By fully exploiting the complementary advantages of features at various levels, the algorithm is able to accurately capture the subtle structures and semantic attributes of objects in the scene.

However, due to the limitations of the viewing angles, there are some perceptual blind spots in certain areas, making it difficult to achieve a complete recognition in one go. However, as the mapping process progresses, the algorithm can, based on the accumulated multi-view information, dynamically fill in the missing details through an iterative optimization strategy. In this way, it gradually improves the integrity and accuracy of the map, and ultimately generates a more detailed and accurate semantic map.

Finally, we conducted a detailed comparative experiment on semantic modeling across two rooms, as illustrated in Figure 4. By juxtaposing the generated semantic images with their ground truth counterparts, we observed a striking resemblance between them. Our algorithm, empowered by the multi-scale information extractor and full-attention cross-modal feature fusion approach, excels not only in accurately reconstructing semantic information at a holistic level but also in meticulously capturing fine-grained details.

The interpolation results presented in the third column vividly demonstrate that, within specific object modeling regions, the algorithm employs interpolation techniques to refine and enhance the details. Meanwhile, the image in the lower right corner showcases the algorithm’s capability to generate more elaborate object contours. Guided by the joint optimization of color loss, appearance loss, feature loss, and depth loss, the algorithm achieves remarkable precision and completeness in modeling extensive areas such as walls and floors.

4.3. Quantitative Results

To comprehensively evaluate the performance of the proposed method, during the quantitative experiment phase, key metrics such as the mean Intersection over Union (mIoU) and the Absolute Trajectory Error (ATE) were selected for comparison with existing SLAM methods based on NeRF.

Meanwhile, two different environmental scenarios were carefully selected from the Replica dataset [6] and the ScanNet dataset [7], and the experimental results were averaged to ensure the reliability and generalizability of the outcomes.

As shown in Table 1, in the comparison of the mean Intersection over Union, the proposed method innovatively introduces a multi-scale information extractor. This extractor can efficiently extract information at different scales from the camera pose. Moreover, by employing a full-attention feature fusion method, it seamlessly integrates geometric, shape, and semantic features based on cross-attention and self-attention mechanisms. This unique architecture enables the model to fully exploit the value of features at various levels, significantly enhancing its semantic modeling ability. The experimental results demonstrate that, compared with existing NeRF-based SLAM methods, the proposed method has prominent advantages in terms of reconstruction accuracy and integrity, strongly proving its superior performance in semantic modeling tasks.

As illustrated in Table 2, in the evaluation of the Absolute Trajectory Error, the proposed method, by virtue of the precise capture of camera pose information by the multi-scale information extractor and the comprehensive integration of environmental perception information through cross-modal feature fusion, reduces the trajectory estimation error. The experimental data shows that, on the same dataset, the Absolute Trajectory Error of the proposed method is lower than that of the comparative methods. Moreover, the confidence values in the table indicate that our method achieves a confidence of 0.92, which is higher compared to other methods, further highlighting its superiority in positioning accuracy and emphasizing the crucial role of the method in optimizing trajectory estimation.

4.4. Runtime Analysis

We compared and evaluated the running time (FPS), number of parameters (param), and memory usage (Memory) of SNISLAM and NIDS-SLAM in the Table 3. Compared with existing methods, our method shows a certain increase in running speed, while the number of parameters and memory usage during the training phase are similar. Compared with the baseline method SNI-SLAM, our method has a slightly increased number of parameters after adding multi-scale and full-attention mechanisms, but it still does not exceed that of NIDS-SLAM and runs faster. This demonstrates the advantages of our method in terms of running time and computational requirements. However, to better adapt to the future requirements of real-time applications, further algorithm optimization is still needed in future work. We can further reduce the number of parameters and memory usage while ensuring the semantic mapping function, and improve the operation time to meet real-time requirements.

4.5. Ablation Study on Loss Function

To thoroughly analyze the effectiveness of each component and loss function in the proposed method, we meticulously designed ablation experiments, complemented by a comprehensive analysis through quantitative experiments.

Building upon the quantitative experiments, our ablation study primarily focused on two key metrics: the mean Intersection over Union (mIoU) and the Absolute Trajectory Error (ATE). As revealed by the data in the Table 4, we incrementally incorporated different loss functions into the model for comparison. When only the color loss (

l_{r g b}

) and depth loss (

l_{d e p t h}

) were utilized, the model achieved an ATE of 20.59 and an mIoU of 60.51 on the Replica dataset, and an ATE of 22.54 with an mIoU of 63.24 on the ScanNet dataset. Upon introducing the semantic loss (

l_{s e m}

) to form the combination of

l_{r g b} + l_{s e m} + l_{d e p t h}

, a significant performance boost was observed. Specifically, on the Replica dataset, the ATE decreased to 17.52, and the mIoU increased to 67.48; on the ScanNet dataset, the ATE reached 14.35, and the mIoU climbed to 70.45. Notably, when the full set of loss functions

l_{r g b} + l_{s e m} + l_{f e a t} + l_{d e p t h}

(i.e., our proposed method) was employed, the performance was further substantially optimized. On the Replica dataset, the ATE plummeted to 9.47, and the mIoU soared to 78.25; on the ScanNet dataset, the ATE dropped to 8.53, and the mIoU reached an impressive 80.56.

In summary, the results of the ablation experiments strongly confirm the critical significance of the synergistic effect of the multi-scale information extractor, the full-attention cross-modal feature fusion structure, and various loss functions (color loss, appearance loss, feature loss, depth loss) in our proposed method. This synergy plays a pivotal role in enhancing the model’s positioning accuracy and semantic modeling performance in SLAM tasks. Each component is indispensable, and together they form an efficient algorithmic system.

4.6. Ablation Study on Segmentation Backbone Errors

To investigate the sensitivity of the system to errors in the pre-trained segmentation backbone, we conducted ablation experiments on noisy segmentations. We introduced different levels of noise to the segmentation results of the pre-trained backbone and evaluated the system’s performance using the mean Intersection over Union (mIoU) metric.

We used Gaussian noise and salt-and-pepper noise to simulate errors in the segmentation results. Specifically, we controlled the intensity of noise by adjusting the variance of Gaussian noise and the proportion of salt-and-pepper noise. The experimental results are shown in Table 5.

The results show that, as the noise level increases, the mIoU of the system decreases, indicating that the system is sensitive to errors in the pre-trained segmentation backbone. The performance degradation is more significant when the noise level reaches 0.3. Among the two types of noise, salt-and-pepper noise has a relatively greater impact on the system performance. These findings suggest that improving the accuracy of the pre-trained segmentation backbone or adding noise-resistant mechanisms can enhance the robustness of the system.

Moreover, we evaluated the impact of different loss weight combinations on the model performance. Using the grid search method, we tested multiple weight combinations and recorded the corresponding mean Intersection over Union (mIoU) and Absolute Trajectory Error (ATE) metrics. Based on the common experience in designing semantic SLAM weights, we mainly tested three typical weight combinations: balanced, semantics-prioritized, and depth-prioritized. According to the comprehensive performance on the Replica and ScanNet datasets, we selected the depth-prioritized weight combination. Experimental results in the Table 6 show that weight combinations with different characteristics have their own advantages in specific scenarios, and the selected weights can effectively balance different loss components and improve the overall performance of the model.

4.7. Cross-Attention Weights Temporal Stability Experiment

To further validate the temporal stability of cross-attention weights, we conducted experiments using the Absolute Trajectory Error (ATE) metric. ATE measures the absolute difference between the estimated trajectory and the ground truth trajectory. To calculate it, we first align the estimated trajectory with the ground truth using a rigid-body transformation (usually via the Procrustes algorithm), then compute the root mean square error (RMSE) of the point-wise distances between the two trajectories.

Tests on consecutive frames of Replica and ScanNet datasets show ATE ranges of 12.1–12.5 and 13.2–13.6, respectively. The experimental results show that the ATE values are stable within a certain range. These results indirectly demonstrate that the cross-attention weights have good temporal stability, and the model can maintain consistent performance in dynamic environments.

4.8. Failure Cases

Despite its merits, our method fails in three scenarios:

Complex Lighting: Rapid changes or extreme contrasts distort color and depth data, disrupting feature matching and causing inaccurate results.

Ultra-high-speed Movement: Speeds over 10 m/s cause motion blur and aliasing, preventing stable feature tracking and real-time results.

Repetitive Environments: Repetitive patterns yield insufficient unique features, leading to inaccurate pose estimation and poor map quality.

5. Conclusions

The SLAM method proposed in this paper incorporates a multi-scale information extractor and full-attention cross-modal feature fusion technology, which has shown remarkable performance in both qualitative and quantitative experiments.

In qualitative experiments, the estimated trajectory closely matches the ground truth, highlighting the high precision of the multi-scale information extractor in capturing camera pose information. For semantic mapping, the full-attention cross-modal feature fusion enables the algorithm to fully exploit features at various levels, capturing object details and making effective inferences based on existing information.

Quantitative and ablation experiments further validate the method’s advantages. It outperforms existing methods in metrics such as mean Intersection over Union (mIoU) and Absolute Trajectory Error (ATE). Through ablation experiments, we revealed the synergistic effects between color, appearance, feature, and depth losses and the network structure, confirming the critical role of each component in improving positioning accuracy and semantic modeling performance.

We also evaluated the impact of different loss weight combinations on model performance. By testing balanced, semantics-prioritized, and depth-prioritized weight combinations, we selected the optimal combination that effectively balances different loss components and improves overall model performance. Experiments on cross-attention weights temporal stability show that the model can maintain consistent performance in dynamic environments. However, our method still faces limitations in certain scenarios, such as complex lighting, ultra-high-speed movement, and repetitive environments. These limitations may be due to the limitations of the dataset or the characteristics of the environment. In the future, we aim to overcome these limitations through further research and development.

In conclusion, this study demonstrates excellent performance in camera pose estimation and semantic mapping tasks. To extend the system to larger and more unpredictable outdoor environments, future work will focus on exploring multi-sensor fusion strategies. Integrating an Inertial Measurement Unit (IMU) with visual sensors can compensate for vision limitations in rapid motion or occlusion scenarios. Meanwhile, incorporating millimeter-wave radar or LiDAR can enhance the system’s robustness against illumination changes, adverse weather, and moving occlusions. Subsequent research will focus on validating the adaptability of this method under large-scale illumination variations, such as by employing multi-sensor fusion to address unpredictable interference factors. We believe these advancements will significantly enhance the system’s practicality in complex outdoor environments.

Author Contributions

Conceptualization, Y.Z.; methodology, S.L.; software, C.Z. and Q.L.; validation, J.H. and S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China [grant number 2024YFC3016003].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, M.; Ma, Y.; Qiu, Q. SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization. In Proceedings of the 2023 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–8 December 2023; pp. 312–317. [Google Scholar] [CrossRef]
Sha, Y.; Zhu, S.; Guo, H.; Wang, Z.; Wang, H. Towards Autonomous Indoor Parking: A Globally Consistent Semantic SLAM System and A Semantic Localization Subsystem. arXiv 2024, arXiv:2410.121692024. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Mao, Y.; Yu, X.; Wang, K.; Wang, Y.; Xiong, R.; Liao, Y. NGEL-SLAM: Neural Implicit Representation-based Global Consistent Low-Latency SLAM System. arXiv 2023, arXiv:2311.09525. [Google Scholar]
Tao, Y.; Bhalgat, Y.; Fu, L.F.T.; Mattamala, M.; Chebrolu, N.; Fallon, M.F. SiLVR: Scalable Lidar-Visual Reconstruction with Neural Radiance Fields for Robotic Inspection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 17983–17989. [Google Scholar]
Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. The Replica dataset: A digital replica of indoor spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
Ming, Y.; Ye, W.; Calway, A. iDF-SLAM: End-to-End RGB-D SLAM with Neural Implicit Mapping and Deep Feature Tracking. arXiv 2022, arXiv:2209.07919. [Google Scholar]
Yang, F.; Wang, Y.; Tan, L.; Li, M.; Shan, H.; Liao, P. DNIV-SLAM: Dynamic Neural Implicit Volumetric SLAM. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Singapore, 25–28 October 2025; pp. 33–47. [Google Scholar]
Xin, Z.; Yue, Y.; Zhang, L.; Wu, C. HERO-SLAM: Hybrid Enhanced Robust Optimization of Neural SLAM. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8610–8616. [Google Scholar]
Li, H.; Gu, X.; Yuan, W.; Yang, L.; Dong, Z.; Tan, P. Dense RGB SLAM With Neural Implicit Maps. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 1–21. [Google Scholar]
Wei, W.; Wang, J.; Deng, S.; Liu, J. DF-SLAM: Dictionary Factors Representation for High-Fidelity Neural Implicit Dense Visual SLAM System. arXiv 2024, arXiv:2404.17876. [Google Scholar]
Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12786–12796. [Google Scholar]
Zhang, Y.; Tosi, F.; Mattoccia, S.; Poggi, M. GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
Qu, D.; Yan, C.; Wang, D.; Yin, J.; Chen, Q.; Xu, D.; Zhang, Y.; Zhao, B.; Li, X. Implicit Event-RGBD Neural SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 19584–19594. [Google Scholar]
Li, M.; Zhou, Y.; Jiang, G.; Deng, T.; Wang, Y.; Wang, H. DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM. arXiv 2024, arXiv:2401.01545. [Google Scholar] [CrossRef]
Zhu, S.; Wang, G.; Blum, H.; Liu, J.; Song, L.; Pollefeys, M.; Wang, H. SNI-SLAM: Semantic Neural Implicit SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 21167–21177. [Google Scholar] [CrossRef]
Li, G.; Chen, Q.; Yan, Y.; Pu, J. EC-SLAM: Real-time Dense Neural RGB-D SLAM System with Effectively Constrained Global Bundle Adjustment. arXiv 2024, arXiv:2404.13346v1. [Google Scholar]
Cartillier, V.; Schindler, G.; Essa, I. SLAIM: Robust Dense Neural SLAM for Online Tracking and Mapping. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 2862–2871. [Google Scholar] [CrossRef]
Hu, J.; Chen, X.; Feng, B.; Li, G.; Yang, L.; Bao, H.; Zhang, G.; Cui, Z. Cg-slam: Efficient dense rgb-d slam in a consistent uncertainty-aware 3d gaussian field. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 21–25 September 2025; pp. 93–112. [Google Scholar]
Wang, S. Automated fault diagnosis detection of air handling units using real operational labelled data and transformer-based methods at 24-hour operation hospital. Build. Environ. 2025, 282, 113257. [Google Scholar] [CrossRef]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Johari, M.M.; Carta, C.; Fleuret, F. Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 17408–17419. [Google Scholar]
Yang, X.; Li, H.; Zhai, H.; Ming, Y.; Liu, Y.; Zhang, G. Vox-Fusion: Dense tracking and mapping with voxel-based neural implicit representation. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Singapore, 17–21 October 2022; pp. 499–507. [Google Scholar]
Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura, T.; Wang, W. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 27171–27183. [Google Scholar]
Haghighi, Y.; Kumar, S.; Thiran, J.P.; Gool, L.V. Neural Implicit Dense Semantic SLAM. arXiv 2023, arXiv:2304.14560. [Google Scholar]

Figure 1. Our proposed method (Sem-SLAM). The method first uses a forward feature fusion network to integrate multi-scale and multi-modal features from the input data. Then, a backward semantic mapping network takes the fused features to perform environmental semantic modeling, including semantic map construction. This two-step process enables efficient and accurate environmental modeling.

Figure 2. Qualitative comparison on scene reconstruction of our Sem-SLAM and baseline. The ground truth images and details are rendered with ReplicaViewer software (SDK https://github.com/facebookresearch/Replica-Dataset/tree/main (accessed on 10 July 2025)) [6]. These are two rooms selected from the Replica dataset to conduct camera pose trajectory experiments. The trajectory represents the camera’s movement path in the environment, which is a crucial indicator for evaluating the positioning accuracy of the SLAM system. In this figure, if only the blue predicted trajectory is visible, it indicates that there is no deviation between the trajectories, and the black part in the figure represents the trajectory deviation.

Figure 3. A multi-view semantic mapping experiment for the same room. Tabhe red and green lines represent the trajectories of the cameras in operation. The red line is the actual operating trajectory, and the green line is the ground truth trajectory.

Figure 4. Qualitative comparison on scene reconstruction of our method. The ground truth images and details are rendered with ReplicaViewer software. We visualize 2 selected scenes of the Replica dataset [6].

Table 1. This table demonstrates the comparison results of the proposed method and existing NeRF-based SLAM methods in terms of the mean Intersection over Union (mIoU) metric on the Replica and ScanNet datasets.

Methods	Replica			ScanNet
Methods	Room 0	Room 1	Avg	0000	0106	Avg
NIDS-SLAM	60.23	30.51	54.64	70.78	63.46	10.16
SNI-SLAM	58.46	50.48	64.78	84.62	55.84	53.64
Sem-SLAM	88.56	90.26	81.45	86.21	78.96	80.47

Table 2. The comparison of the proposed method and existing NeRF-based SLAM methods in terms of the Absolute Trajectory Error (ATE) metric on the Replica and ScanNet datasets.

Methods	Replica			ScanNet			Confidence
Methods	Room 0	Room 1	Avg	0000	0106	Avg	Confidence
NIDS-SLAM	9.5	12	10.8	12.2	7.70	9.55	89%
SNI-SLAM	8.3	9.9	9.2	10.1	5.5	8.55	90%
Sem-SLAM	8.0	9.0	8.1	8.6	7.8	8.0	92%

Table 3. Runtime and memory comparison on Replica.

Methods	Param	Memory	Slam FPS
NIDS-SLAM	12.5 M	1.5 GB	1.8
SNI-SLAM	6.3 M	1.1 GB	2.12
Sem-SLAM	8.2 M	1.3 GB	2.2

Table 4. Ablation study on loss function (the ↓ means the smaller, the better, and the ↑ means the larger, the better).

	Replica		ScanNet
	ATE ↓	mIOU ↑	ATE ↓	mIOU ↑
$l_{r g b} + l_{d e p t h}$	20.59	60.51	22.54	63.24
$l_{r g b} + l_{s e m} + l_{d e p t h}$	17.52	67.48	14.35	70.45
$l_{r g b} + l_{s e m} + l_{f e a t} + l_{d e p t h}$	9.47	78.25	8.53	80.56

Table 5. Ablation study results on noisy segmentations.

Noise Type	Noise Level	Replica mIoU	ScanNet mIoU
No Noise	0	78.25	80.56
Gaussian	0.1	75.32	77.43
Gaussian	0.3	68.74	70.21
Salt-and-Pepper	0.1	74.18	76.39
Salt-and-Pepper	0.3	66.57	68.42

Table 6. Performance comparison of different loss weight combinations (the ↓ means the smaller, the better, and the ↑ means the larger, the better).

Different Weight Combinations	Replica		ScanNet
Different Weight Combinations	mIoU ↑	ATE ↓	mIoU ↑	ATE ↓
Balance ( $λ_{r} = 0.4$ , $λ_{s} = 0.3$ , $λ_{d} = 0.2$ , $λ_{f} = 0.1$ )	70.23	12.45	72.15	13.67
sem-prior ( $λ_{r} = 0.3$ , $λ_{s} = 0.4$ , $λ_{d} = 0.2$ , $λ_{f} = 0.1$ )	73.45	10.89	75.32	11.23
depth-prior ( $λ_{r} = 0.3$ , $λ_{s} = 0.2$ , $λ_{d} = 0.4$ , $λ_{f} = 0.1$ )	78.22	9.45	82.41	8.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Zhuang, Y.; Zhang, C.; Li, Q.; Hou, J. Sem-SLAM: Semantic-Integrated SLAM Approach for 3D Reconstruction. Appl. Sci. 2025, 15, 7881. https://doi.org/10.3390/app15147881

AMA Style

Liu S, Zhuang Y, Zhang C, Li Q, Hou J. Sem-SLAM: Semantic-Integrated SLAM Approach for 3D Reconstruction. Applied Sciences. 2025; 15(14):7881. https://doi.org/10.3390/app15147881

Chicago/Turabian Style

Liu, Shuqi, Yufeng Zhuang, Chenxu Zhang, Qifei Li, and Jiayu Hou. 2025. "Sem-SLAM: Semantic-Integrated SLAM Approach for 3D Reconstruction" Applied Sciences 15, no. 14: 7881. https://doi.org/10.3390/app15147881

APA Style

Liu, S., Zhuang, Y., Zhang, C., Li, Q., & Hou, J. (2025). Sem-SLAM: Semantic-Integrated SLAM Approach for 3D Reconstruction. Applied Sciences, 15(14), 7881. https://doi.org/10.3390/app15147881

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sem-SLAM: Semantic-Integrated SLAM Approach for 3D Reconstruction

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Forward Feature Fusion

3.2. Backward Semantic Mapping

3.3. Loss Function

4. Experiments

4.1. Experimental Settings

4.2. Qualitative Evaluations

4.3. Quantitative Results

4.4. Runtime Analysis

4.5. Ablation Study on Loss Function

4.6. Ablation Study on Segmentation Backbone Errors

4.7. Cross-Attention Weights Temporal Stability Experiment

4.8. Failure Cases

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI