Semantic Mapping in Public Indoor Environments Using Improved Instance Segmentation and Continuous-Frame Dynamic Constraint

Lu, Yumin; Feng, Xueyu; Guo, Zonghuan; Wang, Jianchao; Zhou, Lin; Lin, Yingcheng

doi:10.3390/electronics15071392

Open AccessArticle

Semantic Mapping in Public Indoor Environments Using Improved Instance Segmentation and Continuous-Frame Dynamic Constraint

by

Yumin Lu

¹,

Xueyu Feng

¹,

Zonghuan Guo

²,

Jianchao Wang

²,

Lin Zhou

^2,* and

Yingcheng Lin

^1,*

¹

The School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

²

Seres Group Co., Ltd., Chongqing 401335, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(7), 1392; https://doi.org/10.3390/electronics15071392

Submission received: 11 February 2026 / Revised: 20 March 2026 / Accepted: 24 March 2026 / Published: 26 March 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Reliable semantic perception is crucial for service robots operating in complex public indoor environments. However, existing semantic mapping approaches often face the dual challenges of high computational overhead and semantic redundancy in maps. To address these limitations, this paper proposes a low-resource semantic mapping framework based on improved instance segmentation and dynamic constraints from consecutive frames. First, we design the lightweight model MS-YOLO, which adopts MobileNetV4 as its backbone network and incorporates the SHViT neck module, effectively optimizing the balance between detection accuracy and computational cost. Second, we propose a consecutive frame dynamic constraint method that eliminates redundant object annotations through consecutive frame stability verification. Experimental results relating to both fusion and custom datasets demonstrate that compared to YOLOv8n-seg, MS-YOLO achieves improvements in accuracy, recall, and mAP@0.5, while reducing the number of parameters by 11.7% and floating-point operations (FLOPs) by 32.2%. Furthermore, compared to YOLOv11n-seg and YOLOv5n-seg, its FLOPs are reduced by 17.2% and 25.5%, respectively. Finally, the successful deployment and field validation of this system on the Jetson Orin NX platform demonstrate its real-time capability and engineering practicality for edge computing in public indoor service robots.

Keywords:

service robot; semantic mapping; SLAM; instance segmentation; YOLOv8

1. Introduction

With the rapid advancement of artificial intelligence, robotics, and computer vision technologies, service robots have gradually transitioned from laboratory environments to practical applications. Currently, they are widely deployed in indoor public spaces such as hospitals, schools, libraries, hotels, and shopping malls to perform tasks including patrols, guidance, cleaning, and security. This has not only significantly enhanced service efficiency but also effectively reduced labor costs [1,2,3,4]. To reliably execute tasks in these complex, dynamic environments, robots must possess robust environmental perception, mapping, and autonomous navigation capabilities. Among these, mapping serves as a core module, providing fundamental environmental representation for robot localization and navigation.

Currently, indoor mobile robots primarily rely on metric maps represented by occupancy grids. Such maps focus on describing the geometric structure of the environment, containing only spatial information about obstacles and navigable areas, and lack semantic descriptions of entity attributes within the environment [5]. However, in practical service scenarios, natural language remains the most intuitive means of interaction between humans and robots. Traditional metric maps struggle to support advanced task comprehension and interaction based on semantic concepts, thereby limiting the intelligence level of service robots [6]. For instance, in public indoor areas like hospitals and shopping malls, robots frequently interact with humans. Relying solely on geometric information clearly fails to meet practical demands. Consequently, developing methods to construct semantic maps that integrate spatial structure and semantic information has become a research hotspot [7].

Semantic mapping builds upon geometric mapping by attaching semantic labels to entities in the environment, further integrating external knowledge bases and reasoning mechanisms to achieve advanced semantic understanding [8]. The typical semantic mapping process comprises two key components: spatial mapping and semantic extraction. Regarding spatial mapping, Simultaneous Localization and Mapping (SLAM) technology has been widely applied for pose estimation and map construction in unseen environments [9]. The core challenge in constructing semantic maps lies in extracting semantic information about environmental entities [10]. Currently, semantic extraction primarily falls into two categories: scene-oriented and object-oriented approaches. Scene-oriented methods focus on classifying and identifying functional areas such as rooms and corridors. Due to its low sensitivity to lighting conditions and high stability, lidar data is frequently employed for such tasks. Based on sensor type, semantic mapping methods can be categorized into those using 2D LiDAR [11,12,13] and 3D LiDAR [14,15,16]. However, LiDAR provides limited texture information, making it difficult to distinguish objects with similar geometric features but different semantic meanings. In contrast, visual sensors offer rich texture and color details, making them more suitable for object-oriented recognition (e.g., identifying doors, trash cans). Although semantic information can be supplemented through geometric model matching [17,18] or human–machine interaction [19,20], purely vision-based semantic SLAM still faces challenges such as sparse mapping, high computational overhead, and poor environmental adaptability. While integrating inertial measurement units (IMUs) [21] and odometers [22] can enhance stability, practical application remains limited in public indoor environments with drastic lighting variations.

To address the limitations of single-sensor systems, Martins et al. proposed a solution integrating visual and laser SLAM, leveraging deep learning to extract semantic information and fuse it into geometric maps, thereby significantly enhancing map accuracy [23]. However, such methods typically require substantial computational resources, making deployment on computationally constrained public service robots challenging. Furthermore, during mapping, perspective shifts and pose drift can cause the system to repeatedly detect and label the same object at different times and locations, leading to semantic redundancy in the map [24]. With the advancement of deep learning, semantic acquisition methods based on object detection [25] and instance segmentation [26] have gradually become mainstream. Algorithms represented by the YOLO series are widely applied in semantic SLAM. However, standard YOLO models feature complex architectures and large parameter counts, posing challenges for direct deployment on mobile robot platforms [27]. To address this issue, lightweight networks such as MobileNet [28,29], ShuffleNet [30,31], EfficientNet [32], and FasterNet [33] have been proposed, enabling real-time perception for embedded systems. For instance, Zhang F et al. achieved model lightweighting by replacing the YOLO backbone with MobileNetV3 [34], while Zhang D et al. introduced FasterNet to enhance YOLOv8n-seg and reduce resource consumption [35].

To address the application requirements of service robots in indoor public areas, this paper proposes a low-resource semantic information extraction method based on improved instance segmentation. It also designs a continuous-frame dynamic constraint approach to resolve multi-view redundant labeling issues in semantic mapping. The main contributions are as follows:

Dataset Construction: Two specialized datasets were constructed for common objects in public indoor areas. This involved combining self-collected data with public datasets and applying comprehensive data augmentation techniques.
Synergistic Architectural Optimization for Edge Perception: We designed the lightweight instance segmentation model, MS-YOLO, specifically optimized for edge devices. By reconstructing the backbone with MobileNetV4 [36] and innovatively integrating the SHViT [37] module into the C2f neck to form a complementary “local-global” feature extractor, this architecture resolves the conflict between high-resolution perception and limited computational budgets. This synergy achieves a significant 32.2% reduction in FLOPs without compromising segmentation accuracy.
Application-Driven Strategy for Semantic Landmark Registration: To overcome the pervasive multi-view redundant labeling problem in semantic SLAM, we developed a computationally efficient Continuous-Frame Dynamic Constraint method. This method tightly couples TF coordinate transformations with the spatiotemporal consistency verification of deep point clouds, ensuring the uniqueness and stability of semantic landmarks in global maps without relying on computationally heavy tracking algorithms.
End-to-End System-Level Implementation and Validation: We successfully integrated the optimized perception and mapping modules into a complete ROS 2 framework and deployed it on a low-power edge platform. Real-world testing in public indoor scenarios demonstrates that our system achieves reliable, real-time semantic map generation, providing a highly practical and verifiable engineering solution for service robots.

2. Datasets

Public indoor areas (such as hospitals, laboratory buildings, and shopping malls) typically contain a series of representative objects. These objects include both everyday service facilities—such as doors, rubbish bins, and benches—and fire safety equipment installed to ensure security in high-traffic environments. These objects hold significant semantic value for service robots performing navigation, interaction, and assistance tasks. To ensure the constructed dataset comprehensively reflects the semantic characteristics of public indoor scenes, this study selected four object categories—doors, rubbish bins, benches, and fire extinguishers—as the target classes for instance segmentation in the first phase.

It is worth noting that while mainstream indoor datasets such as NYUv2 and ScanNet provide massive amounts of data, their class distributions heavily skew towards domestic and residential objects (e.g., beds, sofas, and desks). They lack critical public safety and functional landmarks essential for service robots in public areas. Due to the current lack of publicly available datasets covering all the aforementioned categories simultaneously, the first-phase experimental dataset (i.e., Indoor1) was constructed using a multi-source fusion approach. Specifically, instance segmentation data for four target categories were obtained from the Roboflow platform (accessible online at: https://roboflow.com, accessed on 4 December 2025), we obtained instance segmentation data for four target categories: doors, rubbish bins, benches, and fire extinguishers. The dataset comprises 767 images of doors, 202 images of rubbish bins, 459 images of benches, and 489 images of fire extinguishers, totaling 1917 images. All data were divided into training, validation, and test sets in a 7:2:1 ratio to ensure reliable model training and evaluation. Partial examples of the constructed indoor common object dataset are shown in Figure 1.

To further validate the feasibility of the proposed method in real-world scenarios while evaluating the model’s generalization capability and stability in practical applications, this study constructed a self-collected dataset (Indoor2) in the second phase. Data collection was conducted at the Third Experimental Building of Chongqing University’s Huxi Campus, yielding 1905 raw images covering five target categories: doors, rubbish bins, fire extinguishers, fire hydrants, and evacuation equipment boxes. To enhance the performance, robustness, and generalization capability of the deep learning model, various data augmentation techniques were applied to the raw images. These included brightness adjustments, random cropping, arbitrary angle rotation, and mirror flipping. The effects of data augmentation are illustrated in Figure 2. Following these processes, the dataset size expanded to 5000 images, significantly increasing data diversity while preserving the original category distribution characteristics.

During the annotation phase of the self-collected dataset, this study employs the Labelme tool for polygonal annotation of images to precisely delineate target contours. This annotation method effectively captures shape features of objects, aiding instance segmentation models in learning more detailed boundary information. To ensure rigorous annotation quality and mitigate subjective bias, a cross-review mechanism was implemented. The dataset was independently annotated by two researchers. Subsequently, a third senior researcher conducted consistency checks and resolved inter-annotator disagreements through Intersection-over-Union (IoU) evaluations, ensuring high-fidelity ground truth generation. An annotation schematic is shown in Figure 3a. Annotation files were initially saved in JSON format and converted to TXT format prior to model training to meet the input requirements of the YOLO instance segmentation network. Figure 3b presents an example of a model training result. Ultimately, this high-quality annotated dataset provides a robust training foundation for the model in instance segmentation tasks within public indoor areas. The complete dataset and the source code for the proposed system are publicly available at: https://huggingface.co/datasets/lwarwick480-create/indoor (accessed on 21 March 2026).

3. Method

3.1. Overall Structure

The system comprises two parallel pipelines: “Geometric Mapping” and “Semantic Landmark Generation,” as shown in Figure 4, aiming to achieve deep integration of environmental geometric structures and semantic information within a unified map coordinate system.

First, within the Geometric Mapping pipeline, the system employs the Cartographer algorithm as its foundational SLAM framework. By fusing multi-source sensor data—including 2D LiDAR (N10 LiDAR Raysense, Shenzhen, China), IMU, and odometry—the system performs real-time estimation of the robot’s position and posture while constructing an occupancy grid map of the environment. It is important to note that this geometric map construction relies strictly on the aforementioned raw physical sensor data and operates entirely independently of the custom visual datasets described in Section 2. This process provides an accurate global positioning reference and environmental geometric foundation for subsequent semantic overlay.

Subsequently, within the semantic landmark generation pipeline, the system synchronously captures color images and depth information using an RGB-D camera (Astra Pro Plus Orbbec, Shenzhen, China)). The color images are fed into the proposed lightweight instance segmentation network, MS-YOLO, to extract real-time object category labels and instance masks. The datasets constructed in Section 2 are exclusively utilized to train and validate this MS-YOLO model, equipping it with the necessary prior knowledge to recognize target objects in public indoor environments. Combined with the corresponding depth images, the system obtains 3D depth point clouds within the masks. To achieve co-domain alignment between semantic and geometric data, the system employs real-time pose output from Cartographer. Through Transform Frame (TF) coordinate transformation, the semantic point cloud is sequentially mapped from the camera coordinate system to the robot base coordinate system and then to the global map coordinate system.

During the point cloud processing stage, the coordinate-transformed semantic point cloud undergoes further processing including filtering for noise reduction, spatial clustering, and minimum bounding box fitting. This ultimately generates structured semantic landmarks containing position, category, and size information. Finally, through the Landmark Insertion module, these semantic objects are registered into the occupancy grid map, forming a final semantic map that combines both geometric reachability and semantic interpretability.

3.2. Low-Resource-Consumption Instance Segmentation Algorithm

3.2.1. MS-YOLO Instance Segmentation

YOLOv8-seg is a widely adopted instance segmentation framework that combines high detection accuracy with fast inference speed. Its engineering implementation features parameter and structural optimizations tailored for practical deployment. Its overall architecture comprises three components: a backbone network, a feature fusion neck, and a segmentation head. The backbone network extracts multi-level features bottom-up through convolutional blocks, Cross-Stage Partial with 2-way Fusion (C2f), and Spatial Pyramid Pooling-Fast (SPPF) modules, enhancing the receptive field and representational capacity. The neck employs Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) structures for cross-scale feature fusion, accommodating detection and segmentation needs for objects of varying sizes. The segmentation head outputs both bounding boxes and pixel-level masks based on decoupled detection, enabling instance-level object localization and segmentation. YOLOv8-seg offers multiple scale variants—n/s/m/l/x [38]—to balance accuracy and speed. Among these, YOLOv8n-seg represents the lightest configuration, significantly reducing computational overhead while maintaining respectable accuracy.

Although YOLOv8n-seg is more lightweight compared to other versions, it still suffers from large model parameter counts and high computational demands when deployed on resource-constrained platforms such as service robots. To address this issue, this paper proposes an improved lightweight instance segmentation model, MS-YOLO, as shown in Figure 5. The specific improvements are as follows: replacing the original backbone network of YOLOv8n-seg with the lighter MobileNetV4 network, while adjusting the channel count appropriately to reduce computational complexity and enhance feature extraction efficiency; integrating the SHViT module with the original C2f module to construct a new composite module, C2f_SHViTBlock, thereby strengthening the model’s focus on target channel features. The final overall architecture is shown in Figure 5. Through these optimizations, MS-YOLO enhances detection and segmentation accuracy while effectively reducing parameter size and computational load, making it more suitable for deployment on resource-constrained platforms such as service robots.

3.2.2. MobileNetV4 Backbone Network

This paper introduces MobileNetV4 as the backbone network based on YOLOv8n-seg. As shown in Figure 5, MobileNetV4 is designed for efficient computation scenarios. Its core innovation lies in proposing the Universal Inverted Bottleneck (UIB) [36] module. Combined with a structured network search strategy, it achieves a better trade-off between efficiency and performance. This module has become a standard building block for efficient networks [39,40]. This network maintains strong feature representation capabilities while preserving a small number of parameters and computational complexity, making it suitable for deployment on resource-constrained platforms such as service robots.

The structure of the UIB module is shown in Figure 6. Compared to the traditional IB (Inverted Bottleneck), UIB introduces two optional DW (Depthwise Convolution) layers between the expansion layer and the projection layer. During the NAS (Neural Architecture Search) process, it can be instantiated into four structural variants: ExtraDW (Extra DepthWise IB), IB, ConvNeXt-like, and FFN. Additionally, MobileNetV4 incorporates the FusedIB structure in the network’s stem phase, fusing standard convolutions with pointwise convolutions to further reduce memory access and computational overhead.

In terms of overall configuration, MobileNetV4 offers multiple scales—MNv4-Conv-S, MNv4-Conv-M, MNv4-Hybrid-M, MNv4-Conv-L, and MNv4-Hybrid-L—to accommodate varying computational demands. This paper selects the most compact MNv4-Conv-S as the backbone network. Its stage division and key hyperparameters, as shown in Table 1, better align with the real-time and energy constraints of the resource-constrained edge computing environment discussed herein.

To ensure seamless integration with the YOLOv8n-seg framework, this implementation retains the first five stages of MNv4-Conv-S while replacing its final module with the SPPF from YOLOv8n-seg. This achieves efficient multi-scale context aggregation and stable stride configurations, ensuring compatibility with subsequent FPN and PAN necks and the decoupled segmentation head.

3.2.3. C2f_SHViTBlock

The C2f module is a critical component in the YOLOv8 architecture, responsible for early feature refinement and propagation across stage-specific feature aggregation. Its structure follows the “cv1 splitting—Bottleneck extraction—Concat aggregation—cv2 compression” workflow: First, cv1 (typically a 1 × 1 convolution) performs channel-wise expansion and feature splitting. One branch bypasses directly to preserve the original representation, while the other enters multiple Bottleneck layers for sequential semantic extraction, with optional shortcut activation to ensure gradient flow and information recycling. Subsequently, the outputs from each Bottleneck layer and the bypass features are concatenated along the channel dimension. Finally, cv2 compresses the aggregated channel count to the target dimension. This design promotes stable training and multi-scale information fusion through parallel residual paths while achieving effective channel reorganization and model compression via two pointwise convolutions. However, edge devices like service robots face severe constraints in inference environments—limited computational and storage budgets, power/thermal boundaries, and strict real-time requirements—while perception tasks often operate at high resolutions and in multi-object interference scenarios. To balance expressive power and inference efficiency under resource constraints, this paper introduces SHViT as the attention unit. By employing single-head self-attention and lightweight projections, it reduces redundant computations and parameter overhead. While preserving global dependency modeling capabilities, it enhances complementarity with local priors from convolutions, achieving stronger discriminative power at lower resource costs to meet practical edge deployment demands.

As shown on the left side of Figure 7, SHViT adopts a serial structure centered on “lightweight convolution—single-head self-attention—forward reconstruction,” supplemented by residual connections at each level to stabilize training and suppress degradation. First, DWConv [41,42] performs spatial mixing within local regions while preserving edge and texture information. Then, Single-Head Self-Attention (SHSA) is introduced to model global dependencies. Finally, a lightweight Feedforward Network (FFN) performs nonlinear re-calibration in the channel dimension. This design introduces explicit long-range relationships with minimal parameter and computational overhead without altering the input/output tensor scale, thereby balancing expressiveness for high-resolution scenes with computational efficiency for edge deployment.

The focus of SHViT lies in the SHSA layer, as shown on the upper right of Figure 7. It applies a single-headed attention layer to a portion

C_{p}

of the input channels for spatial feature aggregation, while leaving the remaining channels unchanged. The SHSA layer can be described by the following formula:

S H S A (X) = C o n c a t ({\tilde{X}}_{a t t}, X_{r e s}) W^{O},

(1)

{\tilde{X}}_{a t t} = A t t e n t i o n (X_{a t t} W^{L}, X_{a t t} W^{M}, X_{a t t} W^{N}),

(2)

A t t e n t i o n (L, M, N) = S o f t m a x (L M^{T} / \sqrt{d_{l m}}) N,

(3)

X_{a t t}, X_{r e s} = S p l i t (X, [C_{p}, C - C_{p}])

(4)

W^{L}

,

W^{M}

,

W^{N}

, and

W^{O}

are projection weights, where

d_{l m}

(default 16) denotes the query and key dimensions. To achieve consistent memory access, the initial

C_{p}

channel (

C_{p} = r C

, r defaults to 1/4.67) serves as a representative for the entire feature map. Furthermore, the final SHSA projection is applied to all channels, not just the initial

C_{p}

channel, ensuring effective propagation of attention features to the remaining channels. SHSA can be interpreted as sequentially stretching previously parallel-computed redundant heads along the block axis.

This “partial channel + single-head” strategy significantly reduces the size of the attention energy matrix and the number of projection parameters. It preserves global aggregation capabilities across spatial positions while maintaining original details and gradient pathways through the bypass branch. Combined with the preceding depth-separable convolution, it forms a complementary “local-global” modeling approach. Compared to multi-head or full-channel attention, SHSA offers superior resource efficiency, making it well-suited for lightweight, high-precision instance segmentation and detection tasks on resource-constrained devices such as service robots.

The proposed C2f_SHViTBlock is illustrated in the lower right corner of Figure 7: Without altering the C2f interface or scale, the original Bottleneck is equivalently replaced with SHViTBlock. This upgrades the single-path gradient flow (convolution only) to a multi-path information flow (“convolution + single-head attention”), enhancing spatial-channel interactions and long-range dependencies while maintaining controllable parameters and model size. It can be seamlessly integrated into YOLOv8.

3.3. Semantic Landmark Construction

To achieve stable and accurate annotation of semantic objects in the environment on occupancy grid maps, this paper constructs semantic landmarks following the main workflow: “initial semantic point cloud acquisition—cross-frame consistency constraints—minimum bounding rectangle generation for landmarks.” First, during mapping, RGB-D cameras synchronously capture color and depth images. Instance segmentation from the previous section outputs object masks to filter the pixel set

Ω = {(u, v)}

belonging to the target. The pinhole camera model describes the relationship between 2D pixel

u

and the 3D projection point

X^{c}

in camera coordinates, expressed by Equation (5):

u = [\begin{matrix} u \\ v \\ 1 \end{matrix}] = \frac{1}{z^{c}} [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x^{c} \\ y^{c} \\ z^{c} \end{matrix}] = \frac{1}{z^{c}} K X^{c}

(5)

Here,

K

denotes the camera intrinsic matrix, and

X^{c} = {[x^{c}, y^{c}, z^{c}]}^{T}

represents the coordinates of a point in the camera coordinate system. Since the RGB-D camera directly provides

(u, v, z^{c})

, the corresponding 3D point for each pixel in

Ω

is obtained using the back-projection formula:

X^{c} = z^{c} K^{- 1} u

(6)

The labels resulting from instance segmentation are assigned as semantic labels to the 3D points, forming an initial semantic point cloud. To obtain its pose in the map coordinate system, Cartographer outputs the robot’s pose during mapping. To ensure precise spatio-temporal alignment across these multi-source sensors, our system leverages the ROS 2 middleware framework. Temporally, timestamp synchronization between the RGB-D sensor streams and the Cartographer pose outputs is strictly managed using the ROS message_filters mechanism. Spatially, the extrinsic parameters between the Astra Pro Plus depth camera, the 2D LiDAR, and the robot’s base chassis were explicitly pre-calibrated and are continuously broadcasted via the ROS tf2 transformation tree. Combined with TF coordinate transformation, this yields a homogeneous transformation from the camera to the map. Based on this, the point cloud can be mapped to the map coordinate system, as illustrated in Figure 8.

During robotic motion, the same object may be detected multiple times across consecutive frames. Direct overlaying risks creating redundant annotations on the map. To address this, this paper proposes a consistency constraint of sequential semantic point clouds across consecutive frames. While rigorous 3D matching metrics (e.g., ICP or Chamfer distance) offer higher precision, we adopt a lightweight spatial similarity approximation. This intentional trade-off prioritizes real-time performance on edge devices, as the prior semantic filtering by MS-YOLO primarily necessitates rapid spatial co-location verification rather than complex structural matching. As shown in Algorithm 1, processing through this consistency constraint mechanism yields a preliminary semantic map as depicted in Figure 9a. However, this map still cannot distinguish between objects of the same class but different instances (e.g., multiple door instances).

Algorithm 1. Consistency Constraint of Sequential Semantic Point Clouds

Input: Sequence of semantic point clouds

{P_{t}}

, Distance threshold

θ

, Stable duration

T

Output: Updated semantic map

M^{'}

1: Initialize

P_{l a s t} \leftarrow Ø

2: Initialize

t_{l a s t} \leftarrow 0

3: Initialize map

M^{'} \leftarrow M

4: for each incoming point cloud

P_{t}

at time

t

do
5:

{\tilde{P}}_{t} \leftarrow TransformToMapCoordinate (P_{t})

6: if

P_{l a s t} = Ø

or

| | {\tilde{P}}_{t} | - | P_{l a s t} | | > Δ_{s i z e}

then
7:

P_{l a s t} \leftarrow {\tilde{P}}_{t}

8:

t_{l a s t} \leftarrow t

    9:          continue
    10:    end if
    11:

S ({\tilde{P}}_{t}, P_{l a s t}) \leftarrow \frac{1}{N} \sum_{k = 1}^{N} ‖ {\tilde{p}}_{t}^{k} - p_{l a s t}^{k} ‖_{2}

12: if

S ({\tilde{P}}_{t}, P_{l a s t}) > θ

then
13:

P_{l a s t} \leftarrow {\tilde{P}}_{t}

14:

t_{l a s t} \leftarrow t

15: else if

(t - t_{l a s t}) \geq T

then
16:

M^{'} \leftarrow InsertSemanticMarkers (M^{'}, {\tilde{P}}_{t})

    17:    end if
    18:  end for
    19:  return

M^{'}

To achieve instance-level differentiation, Euclidean clustering was performed on similar point clouds, with results shown in Figure 9b. Subsequently, on the ground plane of the map coordinate system, a minimum bounding rectangle was fitted to each instance point cluster. This yielded the instance’s center position, dimensions, and orientation, serving as a compact representation of the final semantic landmark. This format facilitates retrieval during human–computer interaction, as demonstrated in Figure 9c.

4. Experimental Design and Analysis of Results

4.1. Experimental Environment and Parameters

Hardware Configuration: Processor is Intel Corporation (Santa Clara, CA, USA) Xeon Platinum 8481C, 25 cores; Graphics card is NVIDIA Corporation (Santa Clara, CA, USA) RTX 4090, 24 GB × 1 VRAM. Software Configuration: Programming language is Python 3.8.10; Deep learning framework: PyTorch 2.0.0 (CUDA 11.8).

Parameter Settings: Training hyperparameters were empirically tuned to balance model convergence and computational efficiency. The input image size was set to

640 \times 640

to ensure high-resolution feature extraction within GPU memory limits. Training was fixed at 100 epochs, which was experimentally observed to be sufficient for full feature learning while preventing overfitting. A batch size of 16 was chosen to optimize training speed and memory utilization on the RTX 4090. The initial learning rate was set to 0.01 to escape local optima early on, utilizing a cosine annealing strategy for smooth decay in later stages. Other parameters followed the YOLOv8 default settings.

The above environment and parameter settings ensure efficient experimental operation with hardware acceleration, providing a stable foundation for model convergence and performance evaluation.

4.2. Evaluation Indicators

To comprehensively evaluate the performance of the MS-YOLO instance segmentation model proposed in this paper, Precision, Recall, Mean Average Precision (mAP), and Giga Floating-point Operations (GFLOPs) were selected as the primary evaluation metrics.

Precision measures the proportion of actual positive samples correctly identified as positive by the model, reflecting its ability to avoid false positives during detection. Higher precision indicates lower false positive rates. Its calculation formula is shown in (7):

P = \frac{T_{P}}{T_{P} + F_{P}}

(7)

Among these,

T_{P}

denotes the number of targets correctly detected, while

F_{P}

represents the number of targets incorrectly detected.

Recall measures the proportion of all actual positive samples correctly detected by the model, reflecting its ability to minimize missed detections. A higher Recall indicates a higher target detection rate for the model. Its calculation formula is shown in (8):

R = \frac{T_{P}}{T_{P} + F_{N}}

(8)

Among these,

F_{N}

denotes the actual number of targets that were not detected.

mAP is the most commonly used comprehensive metric in object detection and instance segmentation tasks, reflecting the model’s overall detection accuracy at different Recall levels. First, the average precision (AP) is calculated for a single category. AP is computed based on the area under the Precision–Recall (P-R) curve, as shown in Formula (9):

A P = \int_{0}^{1} P (R) d R

(9)

Then, the mAP is calculated by averaging the AP values across all categories:

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(10)

Here, n denotes the number of classes. This paper employs two common forms: mAP@0.5 (mAP calculated at an IoU threshold of 0.5); mAP@[0.5:0.95] (AP calculated at IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, then averaged).

Intersection over Union (IoU) is defined as the ratio of the intersection area between the predicted bounding box and the ground truth bounding box to the union area:

I o U (A, B) = \frac{| A \cap B |}{| A \cup B |}

(11)

GFLOPs represent the number of billion floating-point operations a model can perform per second, serving as a key metric for measuring computational complexity and inference speed. In deep learning, GFLOPs reflect the computational overhead during model inference, calculated as shown in (12):

G F L O P s = \frac{2 \times H_{o u t} \times W_{o u t} \times C_{i n} \times K_{h} \times K_{w} \times C_{o u t}}{10^{9}}

(12)

Here,

H_{o u t}

and

W_{o u t}

denote the height and width of the output feature map, respectively, while

C_{i n}

and

C_{o u t}

represent the input and output channel counts, respectively.

K_{h}

and

K_{w}

denote the height and width of the convolutional kernel, respectively.

In the experiments conducted in this paper, the Precision, Recall, and mAP metrics were obtained from the result.csv file generated after training. The GFLOP values were automatically computed by the model during both training and inference phases.

4.3. Experimental Demonstration

To validate the effectiveness and superiority of the proposed MS-YOLO instance segmentation model in public indoor scenarios, this paper conducts systematic comparative experiments between it and the original YOLOv8n-seg model on the fusion dataset Indoor1 and the self-collected dataset Indoor2. The comparison primarily covers three aspects: detection performance, segmentation performance, and model complexity. The experimental results are shown in Table 2, Table 3, and Table 4, respectively.

From the perspective of detection performance (Table 2), on the Indoor2 dataset, MS-YOLO outperformed YOLOv8n-seg across all four metrics: Precision, Recall, mAP@0.5, and mAP@[0.50:0.90]. Specifically, Precision increased from 0.975 to 0.983, Recall rose to 0.972, and mAP@0.5reached 0.989, indicating the improved model possesses stronger object recognition and localization capabilities in self-built real-world scenario data. On the Indoor1 fusion dataset, MS-YOLO demonstrates even more significant performance gains, particularly in Recall and high-threshold mAP metrics, achieving improvements of 7.0% and 8.9%, respectively. This indicates that the proposed lightweight backbone architecture and enhanced feature extraction module exhibit superior robustness and generalization capabilities under complex data distribution conditions.

The comparison of instance segmentation performance (Table 3) demonstrates that MS-YOLO achieves higher segmentation accuracy than the original model on both datasets. On the Indoor2 dataset, MS-YOLO achieves an mAP@0.5 of 0.987, with segmentation precision and recall improving to 0.981 and 0.970, respectively, indicating its ability to more accurately delineate object boundary information. On the Indoor1 dataset, MS-YOLO demonstrates even more pronounced advantages in segmentation tasks, with mAP@[0.50:0.90] improving from 0.691 to 0.782. This indicates that the improved model significantly enhances its ability to fit instance contours under multi-scale objects and complex background conditions.

As shown in the model complexity comparison results (Table 4), MS-YOLO achieves significant lightweighting while maintaining or even improving segmentation accuracy. Compared to YOLOv8n-seg, MS-YOLO reduces the number of parameters from 3.259 million to 2.876 million, decreasing model size by approximately 10.3%. FLOPs are substantially lowered from 12.1 G to 8.2 G, representing a reduction of 32.2%. These results demonstrate that the proposed optimization strategy effectively reduces computational overhead without compromising model performance, making it particularly suitable for deployment on resource-constrained edge platforms such as Jetson Orin NX.

Based on experimental results, MS-YOLO outperforms the baseline YOLOv8n-seg model in detection accuracy, segmentation performance, and computational efficiency, providing a more efficient and reliable perception foundation for subsequent semantic mapping tasks of service robots in public indoor areas.

To further justify our hyperparameter settings, particularly the selection of 100 training epochs, we analyzed the training convergence process of MS-YOLO. As illustrated by the Mask evaluation indicator curves in Figure 10a, metrics including precision, recall, and mAP increase rapidly during the early stages of training, indicating that the model swiftly learns the key features required to distinguish target objects from complex indoor backgrounds. Simultaneously, the validation loss curves (Figure 10b) reveal a consistent and steady decline across bounding box (Box_loss), segmentation (Seg_loss), classification (Cls_loss), and distribution focal loss (Dfl_loss). As the iteration approaches 100 epochs, all loss values plateau and stabilize. This steady convergence confirms that the model has fully learned the dataset’s features without exhibiting any obvious signs of overfitting or underfitting, validating the optimality of the 100-epoch hyperparameter configuration.

4.4. Ablation Experiments

To evaluate the independent and synergistic contributions of the MobileNetV4 backbone and the C2f_SHViTBlock module to model performance, ablation experiments were conducted on the Indoor1 dataset. These experiments compared instance segmentation metrics (Precision, Recall, mAP@0.5) and model complexity variations across different structural combinations. The results are shown in Table 5.

Without incorporating MobileNetV4 and C2f_SHViTBlock, the model corresponds to the original YOLOv8n-seg, achieving Precision, Recall, and mAP@0.5 of 0.930, 0.854, and 0.893, respectively. Its parameter count and FLOPs were 3.259 M and 12.1 G. After replacing only the backbone network with MobileNetV4, the model’s parameters decreased to 3.147 M, and FLOPs were significantly reduced to 8.3 G, resulting in a computational cost reduction of approximately 31.4%. On this basis, Precision and mAP@0.5 improved to 0.942 and 0.916, respectively, indicating that the lightweight backbone maintains relatively stable feature representation capabilities while significantly reducing computational complexity.

When introducing the SHViTBlock only in the C2f module, the model parameters further decreased to 2.954 million. Precision and Recall improved to 0.955 and 0.871, respectively, with mAP@0.5 reaching 0.917. This demonstrates that the SHViTBlock effectively enhances feature modeling capabilities, particularly in expressing target semantic information more fully in complex backgrounds. However, with the backbone structure unchanged, FLOPs only slightly decreased to 11.4 G, indicating relatively limited computational compression.

When both the MobileNetV4 backbone and C2f_SHViTBlock are integrated, the model achieves an optimal balance between performance and complexity. Precision, Recall, and mAP@0.5 improve to 0.969, 0.888, and 0.923, respectively, while the number of parameters decreases to 2.876 M and FLOPs drop to 8.2 G. These results demonstrate that MobileNetV4 primarily handles significant computational compression, while C2f_SHViTBlock effectively compensates and stabilizes model accuracy on this foundation. The two components form a complementary relationship, aligning with the edge deployment requirements of resource-constrained scenarios such as public indoor service robots.

4.5. Model Comparison Experiments

To comprehensively evaluate the overall accuracy and computational complexity of different models, we conducted instance segmentation performance experiments using a diverse set of baseline models. In addition to YOLOv5n-seg and YOLOv11n-seg, we introduced YOLOv3-tiny-seg as an earlier lightweight baseline, YOLACT as a representative real-time one-stage model, and Mask R-CNN as a classic high-precision two-stage baseline. The experimental results (as shown in Table 6) demonstrate that MS-YOLO achieves the best results across all segmentation accuracy metrics, highlighting its advantages in complex indoor scene instance segmentation tasks.

Specifically, MS-YOLO outperforms both the YOLO series comparison models and YOLACT in Precision, Recall, mAP@0.5, and mAP@(50–90) metrics. Notably, even when compared to the heavyweight Mask R-CNN model—which typically excels in fine boundary delineation—MS-YOLO achieves slightly better performance, improving the mAP@0.5 from 0.905 (Mask R-CNN) to 0.923 and the more stringent mAP@(50–90) metric from 0.778 to 0.782. This demonstrates MS-YOLO’s superior stability and generalization capabilities in target boundary delineation and fine-grained segmentation.

In terms of computational complexity, MS-YOLO achieves only 8.2 G FLOPs. This is not only significantly lower than other lightweight models such as YOLOv11n-seg (9.9 G), YOLOv5n-seg (11.0 G), and YOLOv3-tiny-seg (32.7 G), but also represents a massive reduction in computational overhead compared to YOLACT (68.4 G) and Mask R-CNN (190.0 G). This reduction in computational demand is accompanied by higher segmentation accuracy. This demonstrates that the proposed MS-YOLO strikes an exceptional balance between accuracy and efficiency, making it highly suitable for deployment in practical scenarios with limited computational resources.

Regarding the significance of the differences observed in Table 2, Table 3, Table 4, Table 5 and Table 6, it should be noted that in deep learning tasks, performance metrics are evaluated deterministically over large-scale validation sets. The consistent improvements achieved by MS-YOLO across multiple metrics and two distinct data distributions indicate robust performance gains. Furthermore, from an engineering perspective, achieving superior segmentation accuracy while securing a 32.2% reduction in FLOPs constitutes a highly significant practical advantage for real-time deployment on resource-constrained robotic platforms.

4.6. Semantic Mapping Results

Previous experiments have demonstrated that the MS-YOLO instance segmentation model proposed in this paper exhibits superior deployment adaptability on computationally constrained service robot platforms, providing a reliable perceptual foundation for edge-side semantic mapping. To validate the feasibility of our proposed low-computational-overhead, human–machine interaction-oriented semantic mapping method, we completed full-system deployment and real-world testing on edge devices. The hardware platform utilizes the Jetson Orin NX Super (NVIDIA, Santa Clara, CA, USA) as the computational unit (see Figure 11a), integrated with an Ackerman chassis (Wheeltec, Dongguan, China) and Astra Pro Plus depth camera to form the mobile robot platform for experimental use (see Figure 11b). The software system is built upon ROS 2.

The experiment selected three indoor scenes for mapping trials, as shown in Figure 12, to cover diverse spatial configurations and object arrangements. It is crucial to clarify that these mapping trials were conducted online in real physical environments. The sensor data used to construct occupancy grid maps (Figure 13) and semantic maps (Figure 14) were collected in real-time during the robot’s navigation. This online mapping process is completely independent of the static image datasets described in Section 2, which were utilized solely for the offline training of the MS-YOLO model. Results demonstrate that within each scene, the system successfully constructs an occupancy grid map while simultaneously detecting, integrating, and updating semantic landmarks, ultimately generating a complete semantic map as shown in Figure 14. The spatial positions of target categories align with geometric structures such as walls and passageways, indicating the solution possesses strong edge-side implementability and environmental adaptability.

While Figure 14 visually confirms the successful construction of semantic maps, quantitative evaluation is essential to fully validate the system’s efficiency and map quality. However, as highlighted by recent studies [10], traditional mapping evaluation metrics such as Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) are primarily centered around the pose estimation of geometric SLAM systems. Since mature SLAM algorithms can readily achieve centimeter-level accuracy in typical indoor environments, further evaluating trajectory precision offers limited practical value for service robot navigation. Currently, the academic community lacks a universally recognized evaluation standard specifically for semantic mapping.

Therefore, rather than relying on geometric trajectory metrics, we focus on the practical effectiveness of our Continuous-Frame Dynamic Constraint algorithm in eliminating semantic redundancy. To quantitatively evaluate this, we define the Redundancy Elimination Rate (RER):

R E R = \frac{N_{r a w} - N_{f i n a l}}{N_{r a w}} \times 100 %

(13)

where

N_{r a w}

is the number of raw semantic bounding boxes continuously detected, and

N_{f i n a l}

is the final number of unique landmarks registered on the map. The evaluation results across the three scenes are presented in Table 7.

As shown in Table 7, our algorithm effectively filters out transient multi-view redundancies, consistently achieving an RER of over 97%. This fundamental data compression ensures map conciseness without information loss. Furthermore, the end-to-end system maintains an inference speed of approximately 38 FPS on the Jetson Orin NX Super. Since 30 FPS is generally sufficient for real-time robotic applications, our system successfully balances optimal map data overhead with real-time perception.

Beyond quantitative metrics, comparing the occupancy grid maps (Figure 13) with our semantic maps (Figure 14) further highlights the system’s advantages for human–computer interaction. While traditional grid maps primarily express geometric passability, our semantic maps explicitly integrate key object categories, locations, and instance boundaries. This rich and highly readable representation directly supports semantic-based task planning and complex instruction execution for service robots.

5. Discussion

The experimental results systematically demonstrate that the proposed framework successfully balances the dual challenges of high computational overhead and semantic map redundancy on edge devices. By reconstructing the backbone with MobileNetV4 and introducing the SHViT module, MS-YOLO achieved a 32.2% reduction in FLOPs compared to the baseline while maintaining a high mAP@0.5 of 0.989. Furthermore, the continuous-frame dynamic constraint method effectively eliminated transient “ghosting” observations, achieving a redundancy elimination rate (RER) of over 97%. This proves that deploying high-resolution semantic perception and mapping on resource-constrained service robot platforms (e.g., Jetson Orin NX) is not only theoretically feasible but also engineeringly robust.

Despite these promising results, this study involves several intentional engineering trade-offs and limitations. First, regarding landmark representation, projecting 3D point clusters into 2D minimum bounding rectangles inherently discards vertical geometric details. While sufficient for planar robot navigation, it limits applications requiring full 3D manipulation. Second, the current system is optimized for static functional landmarks (e.g., doors, fire equipment). In highly complex and dynamic scenes—such as those with extremely dense crowds causing prolonged, severe occlusions—the semantic extraction process may be interrupted. Currently, the system relies on downstream local planners (e.g., TEB or DWA) to handle transient dynamic pedestrians, which is not reflected in the global semantic map.

To address these identified limitations, our future research will focus on two main directions. First, we plan to enhance the model’s robustness against severe occlusions by introducing temporal tracking mechanisms and multi-sensor fusion (e.g., integrating LiDAR intensity features with visual semantics). Second, we aim to deeply integrate the generated semantic maps into the robot’s active SLAM and autonomous decision-making systems, exploring how semantic priors can optimize global path planning and complex human–robot interaction tasks in crowded public environments.

6. Conclusions

This paper addresses the dual challenges of high computational overhead and semantic redundancy faced by existing semantic mapping methods on computationally constrained platforms. We propose an efficient semantic mapping system based on an improved lightweight instance segmentation network (MS-YOLO) and a Continuous-Frame Dynamic Constraint strategy.

The core innovation lies in achieving real-time, robust semantic environment representation on resource-constrained edge computing devices. Methodologically, MS-YOLO significantly reduces computational complexity (FLOPs and parameter counts) while maintaining high segmentation accuracy for typical indoor objects, outperforming the baseline YOLOv8n-seg model. Simultaneously, the proposed dynamic constraint method based on deep point clouds effectively filters transient observation noise, eliminating “ghosting” phenomena and ensuring the global uniqueness of semantic landmarks.

End-to-end deployment testing on a self-built hybrid dataset and the Jetson Orin NX platform successfully validated the system’s engineering applicability. The framework ensures stable operation in complex public indoor environments, generating highly readable semantic maps that achieve an excellent balance between real-time performance and spatial accuracy. Ultimately, this lightweight mapping solution provides a robust perceptual foundation for advanced robot navigation and human–robot interaction tasks in practical service domains.

Author Contributions

Conceptualization, Y.L. (Yumin Lu), L.Z. and Y.L. (Yingcheng Lin); methodology, Y.L. (Yumin Lu) and X.F.; software, Y.L. (Yumin Lu), Z.G. and J.W.; validation, X.F., Z.G. and J.W.; formal analysis, Y.L. (Yumin Lu) and X.F.; investigation, Y.L. (Yumin Lu) and J.W.; resources, L.Z. and Y.L. (Yingcheng Lin); data curation, Y.L. (Yumin Lu); writing—original draft preparation, Y.L. (Yumin Lu); writing—review and editing, Y.L. (Yumin Lu), L.Z., and Y.L. (Yingcheng Lin); visualization, X.F. and Z.G.; supervision, L.Z. and Y.L. (Yingcheng Lin); project administration, L.Z. and Y.L. (Yingcheng Lin); funding acquisition, L.Z. and Y.L. (Yingcheng Lin). All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Science and Technology Innovation Key R&D Program of Chongqing (No. CSTB2025TIAD-STX0036).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study, including the source codes and the custom datasets, are publicly available to ensure full reproducibility. The complete source code for the MS-YOLO network and the semantic mapping system can be accessed on GitHub at: https://github.com/lwarwick480-create/Semantic-Mapping-using-Improved-YOLO.git (accessed on 19 March 2026). The custom Indoor1 and Indoor2 datasets, due to their large scale, are hosted and maintained long-term on Hugging Face, accessible at: https://huggingface.co/datasets/lwarwick480-create/indoor (accessed on 21 March 2026).

Conflicts of Interest

Authors Zonghuan Guo, JianchaoWang and Lin Zhou were employed by the company Seres Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Pena, A.; Tejada, J.C.; Gonzalez-Ruiz, J.D.; Sepúlveda-Cano, L.M.; Chiclana, F.; Caraffini, F.; Gongora, M. An evolutionary intelligent control system for a flexible joints robot. Appl. Soft Comput. 2023, 135, 110043. [Google Scholar] [CrossRef]
Müller, C. Market for Professional and Domestic Service Robots Booms in 2018. Available online: https://ifr.org/post/market-for-professional-and-domestic-service-robots-booms-in-2018 (accessed on 13 January 2021).
Lee, H.; Kang, D.; Kwak, S.S.; Choi, J. Designing Robotic Cabinets That Assist Users’ Tidying Behaviors. In Proceedings of the 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Naples, Italy, 31 August–4 September 2020; pp. 253–258. [Google Scholar]
Borghi, M.; Mariani, M.M.; Vega, R.P.; Wirtz, J. The impact of service robots on customer satisfaction online ratings: The moderating effects of rapport and contextual review factors. Psychol. Mark. 2023, 40, 2355–2369. [Google Scholar] [CrossRef]
Kaleci, B.; Turgut, K.; Dutagaci, H. 2dlasernet: A deep learning architecture on 2d laser scans for semantic classification of mobile robot locations. Eng. Sci. Technol. Int. J. 2022, 28, 101027. [Google Scholar] [CrossRef]
Alhmiedat, T.; Marei, A.M.; Messoudi, W.; Albelwi, S.; Bushnag, A.; Bassfar, Z.; Alnajjar, F.; Elfaki, A.O. A slam-based localization and navigation system for social robots: The pepper robot case. Machines 2023, 11, 158. [Google Scholar] [CrossRef]
Yue, Y.; Zhao, C.; Wu, Z.; Yang, C.; Wang, Y.; Wang, D. Collaborative semantic understanding and mapping framework for autonomous systems. IEEE/ASME Trans. Mechatron. 2021, 26, 978–989. [Google Scholar] [CrossRef]
Nüchter, A.; Hertzberg, J. Towards semantic maps for mobile robots. Robot. Auton. Syst. 2008, 56, 915–926. [Google Scholar] [CrossRef]
Hong, Y.-T.; Huang, H.-P. A comparison of outdoor 3D reconstruction between visual SLAM and LiDAR SLAM. In Proceedings of the 2023 International Automatic Control Conference (CACS), Penghu, Taiwan, 26–29 October 2023; pp. 1–6. [Google Scholar]
Song, X.; Liang, X.; Huaidong, Z. Semantic mapping techniques for indoor mobile robots: Review and prospect. Meas. Control 2024, 58, 377–393. [Google Scholar] [CrossRef]
Park, S.; Park, S.K. 2dpca-based method for place classification using range scan. Electron. Lett. 2011, 47, 1364–1366. [Google Scholar] [CrossRef]
Premebida, C.; Faria, D.R.; Souza, F.A.; Nunes, U. Applying probabilistic mixture models to semantic place classification in mobile robotics. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 4265–4270. [Google Scholar]
Goeddel, R.; Olson, E. Learning semantic place labels from occupancy grids using CNNs. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 3999–4004. [Google Scholar]
Swadzba, A.; Wachsmuth, S. A detailed analysis of a new 3d spatial feature vector for indoor scene classification. Robot. Auton. Syst. 2014, 62, 646–662. [Google Scholar] [CrossRef]
Swadzba, A.; Wachsmuth, S. Indoor scene classification using combined 3d and gist features. In Proceedings of the Asian Conference on Computer Vision (ACCV), Queenstown, New Zealand, 8–12 November 2010; pp. 201–215. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Salas-Moreno, R.F.; Newcombe, R.A.; Strasdat, H.; Kelly, P.H.; Davison, A.J. SLAM++: Simultaneous localization and mapping at the level of objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 1352–1359. [Google Scholar]
Günther, M.; Wiemann, T.; Albrecht, S.; Hertzberg, J. Model-based furniture recognition for building semantic object maps. Artif. Intell. 2017, 247, 336–351. [Google Scholar] [CrossRef]
Lu, Y.; Zheng, H.; Chand, S.; Xia, W.; Liu, Z.; Xu, X.; Wang, L.; Qin, Z.; Bao, J. Outlook on human-centric manufacturing towards Industry 5.0. J. Manuf. Syst. 2022, 62, 612–627. [Google Scholar] [CrossRef]
Adel, A. Future of industry 5.0 in society: Human-centric solutions, challenges and prospective research areas. J. Cloud Comput. 2022, 11, 40. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. Visual-inertial monocular SLAM with map reuse. IEEE Robot. Autom. Lett. 2017, 2, 796–803. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, Q.; Niu, X. Kinematic measurement of the railway track centerline position by GNSS/INS/Odometer integration. IEEE Access 2019, 7, 157241–157253. [Google Scholar] [CrossRef]
Martins, R.; Bersan, D.; Campos, M.F.M.; Nascimento, E.R. Extending Maps with Semantic and Contextual Object Information for Robot Navigation: A Learning-Based Framework Using Visual and Depth Cues. J. Intell. Robot. Syst. 2020, 99, 555–569. [Google Scholar] [CrossRef]
Ma, T.; Jiang, G.; Ou, Y.; Xu, S. Semantic geometric fusion multi-object tracking and lidar odometry in dynamic environment. Robotica 2024, 42, 891–910. [Google Scholar] [CrossRef]
Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. Sg-slam: A real-time rgb-d visual slam toward dynamic scenes with semantic and geometric information. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
Bista, S.R.; Hall, D.; Talbot, B.; Zhang, H.; Dayoub, F.; Sunderhauf, N. Evaluating the impact of semantic segmentation and pose estimation on dense semantic slam. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5328–5335. [Google Scholar]
Dilmi, W.; El Ferik, S.; Ouerdane, F.; Khaldi, M.K.; Saif, A.-W.A. Technical Aspects of Deploying UAV and Ground Robots for Intelligent Logistics Using YOLO on Embedded Systems. Sensors 2025, 25, 2572. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Zhang, F.; Tian, C.; Li, X.; Yang, N.; Zhang, Y.; Gao, Q. MTD-YOLO: An Improved YOLOv8-Based Rice Pest Detection Model. Electronics 2025, 14, 2912. [Google Scholar] [CrossRef]
Zhang, D.; Lu, R.; Guo, Z.; Yang, Z.; Wang, S.; Hu, X. Algorithm for Locating Apical Meristematic Tissue of Weeds Based on YOLO Instance Segmentation. Agronomy 2024, 14, 2121. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4—Universal Models for the Mobile Ecosystem. arXiv 2024, arXiv:2404.10518. [Google Scholar]
Yun, S.; Ro, Y. SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design. arXiv 2024, arXiv:2401.16456. [Google Scholar]
Wu, T.; Dong, Y. YOLO-SE: Improved YOLOv8 for Remote Sensing Object Detection and Recognition. Appl. Sci. 2023, 13, 12977. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Shen, C. Conditional positional encodings for vision transformers. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]

Figure 1. Training samples from the Indoor Common Objects Dataset (Indoor1). (a) Door; (b) rubbish bin; (c) Bench; (d) Fire extinguisher.

Figure 2. Example of data enhancement. (a) Original image. (b) Brightness enhancement. (c) Random cropping. (d) Arbitrary rotation. (e) Image flipping.

Figure 3. Examples of image annotation and model training results: (a) example of Labelme manual annotation; (b) example of model training results.

Figure 4. Overall System Architecture: Grid-map Generation and Semantic Landmark Generation Pipelines.

Figure 5. MS-YOLO network structure.

Figure 6. Universal Inverted Bottleneck (UIB) blocks.

Figure 7. C2f_SHViTBlock, SHViT and SHSA network structure.

Figure 8. Initial Semantic Point Cloud Generation Process.

Figure 9. (a) Semantic map after consistency constraint, unable to distinguish between different instances of the same category; (b) Instance segmentation results based on Euclidean clustering; (c) Semantic landmarks (center, size, and orientation) generated from minimum bounding rectangles.

Figure 10. (a) Evaluation index change curve of Mask; (b) Loss change curve of prediction process.

Figure 11. (a) Main control board Jetson Orin NX Super; (b) Mobile robot with Ackermann steering chassis.

Figure 12. Actual scenes from three different environments: (a) experimental site A; (b) experimental site B; (c) experimental site C.

Figure 13. Occupancy grid maps of three different scenes: (a) experimental site A; (b) experimental site B; (c) experimental site C.

Figure 14. Semantic maps of three different scenes: (a) experimental site A; (b) experimental site B; (c) experimental site C.

Table 1. Architecture specification of MNv4-Conv-S.

Input	Block	$DW K_{1}$	$DW K_{2}$	Expanded Dim	Output Dim	Stride
$224^{2} \times 3$	Conv2D	-	$3 \times 3$	-	32	2
$112^{2} \times 32$	FusedIB	-	$3 \times 3$	32	32	2
$56^{2} \times 32$	FusedIB	-	$3 \times 3$	96	64	2
$28^{2} \times 64$	ExtraDW	$5 \times 5$	$5 \times 5$	192	96	2
$14^{2} \times 96$	IB	-	$3 \times 3$	192	96	1
$14^{2} \times 96$	IB	-	$3 \times 3$	192	96	1
$14^{2} \times 96$	IB	-	$3 \times 3$	192	96	1
$14^{2} \times 96$	IB	-	$3 \times 3$	192	96	1
$14^{2} \times 96$	ConvNext	$3 \times 3$	-	384	96	1
$14^{2} \times 96$	ExtraDW	$3 \times 3$	$3 \times 3$	576	128	2
$7^{2} \times 128$	ExtraDW	$5 \times 5$	$5 \times 5$	512	128	1
$7^{2} \times 128$	IB	-	$5 \times 5$	512	128	1
$7^{2} \times 128$	IB	-	$5 \times 5$	384	128	1
$7^{2} \times 128$	IB	-	$3 \times 3$	512	128	1
$7^{2} \times 128$	IB	-	$3 \times 3$	512	128	1
$7^{2} \times 128$	Conv2D	-	$1 \times 1$	-	960	1
$7^{2} \times 960$	AvgPool	-	$7 \times 7$	-	960	1
$1^{2} \times 960$	Conv2D	-	$1 \times 1$	-	1280	1
$7^{2} \times 1280$	Conv2D	-	$1 \times 1$	-	1000	1

Table 2. Comparison of detection Box-related parameters.

Model	Database	P/%	R/%	mAP@0.5	mAP@(50–90)
YOLOv8n-seg	Indoor1	0.955	0.816	0.913	0.716
MS-YOLO	Indoor1	0.98	0.886	0.942	0.805
YOLOv8n-seg	Indoor2	0.975	0.966	0.978	0.960
MS-YOLO	Indoor2	0.983	0.972	0.989	0.965

Table 3. Comparison of segmentation Mark-related parameters.

Model	Database	P/%	R/%	mAP@0.5	mAP@(50–90)
YOLOv8n-seg	Indoor1	0.930	0.854	0.893	0.691
MS-YOLO	Indoor1	0.969	0.888	0.923	0.782
YOLOv8n-seg	Indoor2	0.974	0.965	0.975	0.960
MS-YOLO	Indoor2	0.981	0.970	0.987	0.966

Table 4. Comparison of model complexity.

Model	Parameters (Millions)	Model Size/MB	FLOPs/G
YOLOv8n-seg	3.259	6.8	12.1
MS-YOLO	2.876	6.1	8.2

Table 5. Comparison of ablation experiments.

MobileNetV4	C2f_SHViTBlock	P/%	R/%	mAP@0.5	Params (M)	Model Size/MB	FLOPs/G
×	×	0.93	0.854	0.893	3.259	6.8	12.1
√	×	0.942	0.862	0.916	3.1469	6.6	8.3
×	√	0.955	0.871	0.917	2.954	6.2	11.4
√	√	0.969	0.888	0.923	2.876	6.1	8.2

Table 6. Comparison of Parameters Related to Different Models.

Model	P/%	R/%	mAP@0.5	mAP@(50–90)	FLOPs/G
YOLOv3tiny-seg	0.907	0.826	0.866	0.709	32.7
YOLOv5n-seg	0.913	0.837	0.884	0.688	11.0
YOLOv11n-seg	0.963	0.876	0.917	0.765	9.9
Mask R-CNN	0.954	0.866	0.905	0.778	190
YOLACT	0.948	0.858	0.896	0.729	68.4
MS-YOLO	0.969	0.888	0.923	0.782	8.2

Table 7. Evaluation of semantic redundancy elimination and inference speed.

Scene	$N_{r a w}$	$N_{f i n a l}$	RER	FPS
Scene(a)	289	6	0.979	38
Scene(b)	306	6	0.980
Scene(c)	257	6	0.976

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, Y.; Feng, X.; Guo, Z.; Wang, J.; Zhou, L.; Lin, Y. Semantic Mapping in Public Indoor Environments Using Improved Instance Segmentation and Continuous-Frame Dynamic Constraint. Electronics 2026, 15, 1392. https://doi.org/10.3390/electronics15071392

AMA Style

Lu Y, Feng X, Guo Z, Wang J, Zhou L, Lin Y. Semantic Mapping in Public Indoor Environments Using Improved Instance Segmentation and Continuous-Frame Dynamic Constraint. Electronics. 2026; 15(7):1392. https://doi.org/10.3390/electronics15071392

Chicago/Turabian Style

Lu, Yumin, Xueyu Feng, Zonghuan Guo, Jianchao Wang, Lin Zhou, and Yingcheng Lin. 2026. "Semantic Mapping in Public Indoor Environments Using Improved Instance Segmentation and Continuous-Frame Dynamic Constraint" Electronics 15, no. 7: 1392. https://doi.org/10.3390/electronics15071392

APA Style

Lu, Y., Feng, X., Guo, Z., Wang, J., Zhou, L., & Lin, Y. (2026). Semantic Mapping in Public Indoor Environments Using Improved Instance Segmentation and Continuous-Frame Dynamic Constraint. Electronics, 15(7), 1392. https://doi.org/10.3390/electronics15071392

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Mapping in Public Indoor Environments Using Improved Instance Segmentation and Continuous-Frame Dynamic Constraint

Abstract

1. Introduction

2. Datasets

3. Method

3.1. Overall Structure

3.2. Low-Resource-Consumption Instance Segmentation Algorithm

3.2.1. MS-YOLO Instance Segmentation

3.2.2. MobileNetV4 Backbone Network

3.2.3. C2f_SHViTBlock

3.3. Semantic Landmark Construction

4. Experimental Design and Analysis of Results

4.1. Experimental Environment and Parameters

4.2. Evaluation Indicators

4.3. Experimental Demonstration

4.4. Ablation Experiments

4.5. Model Comparison Experiments

4.6. Semantic Mapping Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI